Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-09-08 Thread Mithun Cy
On Wed, Sep 7, 2016 at 7:26 PM, Victor Wagner  wrote:
> No, algorithm here is more complicated. It must ensure that there would
> not be second attempt to connect to host, for which unsuccessful
> connection attempt was done. So, there is list rearrangement.
>Algorithm for pick random list element by single pass is quite trivial.

If concern is only about excluding address which was tried previously. Then
I think we can try this way, randomly permute given address list (for
example fisher-yates shuffle) before trying any of those address in it.
After that you can treat the new list exactly like we do for sequential
connections. I think this makes code less complex.
-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-08 Thread Dilip Kumar
On Thu, Sep 8, 2016 at 7:21 PM, Tom Lane  wrote:
> We could make it work like that without breaking the ABI if we were
> to add a NOERROR bit to the allowed "flags".  However, after looking
> around a bit I'm no longer convinced what I said above is a good idea.
> In particular, if we put the responsibility of reporting the error on
> callers then we'll lose the ability to distinguish "no such pg_type OID"
> from "type exists but it's only a shell".  So I'm now thinking it's okay
> to promote lookup_type_cache's elog to a full ereport, especially since
> it's already using ereport for some of the related error cases such as
> the shell-type failure.

Okay..
>
> That still leaves us with what to do for domain_in.  A really simple
> fix would be to move the InitDomainConstraintRef call before the
> getBaseTypeAndTypmod call, because the former fetches the typcache
> entry and would then benefit from lookup_type_cache's ereport.
> But that seems a bit magic.
>
> I'm tempted to propose that domain_state_setup start with
>
> typentry = lookup_type_cache(domainType, TYPECACHE_DOMAIN_INFO);
> if (typentry->typtype != TYPTYPE_DOMAIN)
> ereport(ERROR,
> (errcode(ERRCODE_DATATYPE_MISMATCH),
>  errmsg("type %s is not a domain",
> format_type_be(domainType;
>
> removing the need to verify that after getBaseTypeAndTypmod.

Seems like a better option, done it this way..
attaching the modified patch..
>
> The cache loading is basically free because InitDomainConstraintRef
> would do it anyway; so the extra cost here is only one dynahash search.
> You could imagine buying back those cycles by teaching the typcache
> to be able to cache the result of getBaseTypeAndTypmod, but I'm doubtful
> that we really care.  This whole setup sequence only happens once per
> query anyway.

Agreed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


cache_lookup_failure_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] feature request: explain "with details" option

2016-09-08 Thread Tom Lane
Craig Ringer  writes:
> On 9 September 2016 at 01:40, Roger Pack  wrote:
>> Today's explain tells us what loops and scans were used, and relative
>> costs, etc.  It doesn't seem to tell *why* the planner elected to use
>> what it did.

> One thing that's been discussed here is to have a way to see which
> potential plans are rejected and compare their costs.

> This isn't simple because there are often *lots* of variants. You
> don't just want to see the "top 10" candidate plans, because they're
> probably a bunch of small variants on the same plan; the ones you'll
> be interested in will probably be very different plans with very bad
> relative estimates.

The other big problem here is that the planner tries *very* hard to reject
losing paths early; so it does not even form an explainable plan for a
large fraction of the search space.  (And if it did, you'd be dead before
you got your EXPLAIN result back.)

People have experimented with making the planner log every candidate path
before the path enters the comparison tournament (and, typically, doesn't
survive the first round).  But I've never seen any version of that that
I thought would be intelligible to non-experts.  It's exceedingly verbose
and it certainly doesn't look anything like what we know as EXPLAIN output.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-08 Thread Peter Eisentraut
Review of 0003-Define-logical-replication-protocol-and-output-plugi.patch:

(This is still based on the Aug 31 patch set, but at quick glance I
didn't see any significant changes in the Sep 8 set.)

Generally, this all seems mostly fine.  Everything is encapsulated
well enough that problems are localized and any tweaks don't affect
the overall work.

Changes needed to build:

--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2158,8 +2158,8 @@ Logical Streaming Replication
Parameters
  
 
 Comma separated list of publication names for which to
subscribe
-   (receive changes). See
-for more info.
+   (receive changes). 

  
   
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -25,6 +25,7 @@
 #include "utils/builtins.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
+#include "utils/syscache.h"

 PG_MODULE_MAGIC;

This is all fixed in later patches.

AFAICT, pgoutput does not use libpq, so the mentions in
src/backend/replication/pgoutput/Makefile are not needed (perhaps
copied from libpqwalreceiver?).

The start_replication option pg_version option is not documented and
not used in any later patch.  We can probably do without it and just
rely on the protocol version.

In pgoutput_startup(), you check opt->output_type.  But it is not set
anywhere.  Actually, the startup callback is supposed to set it
itself.

In init_rel_sync_cache(), the way hash_flags is set seems kind of
weird.  I think that variable could be removed and the flags put
directly into the hash_create() call.

pgoutput_config.c seems over-engineered, e.g., converting cstring to
Datum and back.  Just do normal DefElem list parsing in pgoutput.c.
That's not pretty either, but at least it's a common coding pattern.

In the protocol documentation, explain the meaning of int64 as a
commit timestamp.

Also, the documentation should emphasize more clearly that all the
messages are not actually top-level protocol messages but are
contained inside binary copy data.

On the actual protocol messages:

Why do strings have a length byte?  That is not how other strings in
the protocol work.  As a minor side-effect, this would limit for
example column names to 255 characters.

The message structure doesn't match the documentation in some ways.
For example Attributes and TupleData are not separate messages but are
contained in Relation and Insert/Update/Delete messages.  So the
documentation needs to be structured a bit differently.

In the Attributes message (or actually Relation message), we don't
need the 'A' and 'C' bytes.

I'm not sure that pgoutput should concern itself with the client
encoding.  The client encoding should already be set by the initial
FE/BE protocol handshake.  I haven't checked that further yet, so it
might already work, or it should be made to work that way, or I might
be way off.

Slight abuse of pqformat functions.  We're not composing messages
using pq_beginmessage()/pq_endmessage(), and we're not using
pq_getmsgend() when reading.  The "proper" way to do this is probably
to define a custom set of PQcommMethods.  (low priority)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-08 20:15:46 -0400, Peter Eisentraut wrote:
>> We don't support build directories with spaces in them, but we support
>> installation directories with spaces in them.  So I guess that means
>> your point is valid.

> Even if not necessary in this specific case, I personally think it's
> just a good to quote halfway sensibly. Especially when it's trivial to
> do so.

The problem is it's only halfway ... won't paths with quotes in them
still break it?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-09-08 Thread Craig Ringer
On 9 September 2016 at 10:37, Craig Ringer  wrote:

> I'm looking at the revised patch now.

Considerably improved.


I fixed a typo from decondig to decoding that's throughout all the
callback names.


This test is wrong:

+while ((change = ReorderBufferIterTXNNext(rb, iterstate)) !=
NULL && rb->continue_decoding_cb != NULL &&
rb->continue_decoding_cb())

If continue_decoding_cb is non-null, it'll never be called since the
and will short-circuit. If it's null the and will be false so we'll
call rb->continue_decoding_cb().

It also violates PostgreSQL source formatting coding conventions; it
should be wrapped to 80 lines. (Yes, that's archaic, but it's kind of
useful when you've got multiple files open side-by-side, and at least
it's consistent across the codebase).

so it should be:

while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL &&
   (rb->continue_decoding_cb == NULL ||
rb->continue_decoding_cb()))

I'd prefer the continue_decoding_cb tests to be on the same line, but
it's slightly too wide. Eh, whatever.


Same logic issue later;

- if(rb->continue_decoding_cb != NULL && rb->continue_decoding_cb())
+if (rb->continue_decoding_cb == NULL || rb->continue_decoding_cb())


Commit message is slightly inaccurate. The walsender DID look for
client messages during ReorderBufferCommit decoding, but it never
reacted to CopyDone sent by a client when it saw it.


Rewrote comment on IsStreamingActive() per my original review advice.



I'm not sure why exactly you removed

-/* fast path */
-/* Try to flush pending output to the client */
-if (pq_flush_if_writable() != 0)
-WalSndShutdown();
-
-if (!pq_is_send_pending())
-return;
-

It should be OK if we flush pending output even after we receive
CopyDone. In fact, we have to, since we can't unqueue it and have to
send it before we can send our own CopyDone reply.
pq_flush_if_writable() should only return EOF if the socket is closed,
in which case fast bailout is the right thing to do.

Can you explain your thinking here and what the intended outcome is?



I've attached updated patches with a number of typo fixes,
copy-editing changes, a fix to the test logic, etc as noted above.

Setting "waiting on author" in CF per discussion of the need for tests.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 14bbd25cf27c3555126b571ee7cb41524f2a8729 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 10 May 2016 10:34:10 +0800
Subject: [PATCH 1/2] Respect client-initiated CopyDone in walsender

The walsender never reacted to CopyDone sent by the client unless it
had already decided it was done sending data and dispatched its own
CopyDone message.

It actually noticed CopyDone from the client when WalSndWriteData() called
ProcessRepliesIfAny() but it didn't react to it until it separately decided to
end streaming from the walsender end.

Modify the walsender so it checks for client-initiated CopyDone when in COPY
BOTH mode. It now cleans up what it's doing, reples with its own CopyDone and
returns to command mode. In logical decoding this will allow the client to end
a logical decoding session between transactions without just unilaterally
closing its connection. For physical walsender connections this allows the
client to end streaming before the end of a timeline.

This change does not allow a client to end COPY BOTH session in the middle of
processing a logical decoding commit (in ReorderBufferCommit) or while decoding
a large WAL record, so there can still be a significant delay before the
walsender reacts to the client.
---
 src/backend/replication/walsender.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1ea2a5c..ad94c13 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -759,6 +759,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
+	/*
+	 * If the client sent CopyDone while we were waiting,
+	 * bail out so we can wind up the decoding session.
+	 */
+	if (streamingDoneSending)
+		return -1;
+
 	/* more than one block available */
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;
@@ -1220,8 +1227,11 @@ WalSndWaitForWal(XLogRecPtr loc)
 		 * It's important to do this check after the recomputation of
 		 * RecentFlushPtr, so we can send all remaining data before shutting
 		 * down.
+		 *
+		 * We'll also exit here if the client sent CopyDone because it wants
+		 * to return to command mode.
 		 */
-		if (walsender_ready_to_stop)
+		if (walsender_ready_to_stop || streamingDoneReceiving)
 			break;
 
 		/*
@@ 

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-08 Thread Andrey Borodin
Thank you for your attention to details, Mikhail.

pack_float_good() looks good. But I'm not sure inline strict init is allowed 
under ansi C. Converting to regular ancient form b.fp = v; won't change compile 
result, would it?

Regards, Andrey Borodin.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee
On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada 
wrote:

>
>
> Making the vacuum possible to choose between two data representations
> sounds good.
> I implemented the patch that changes dead tuple representation to bitmap
> before.
> I will measure the performance of bitmap representation again and post
> them.


Sounds great! I haven't seen your patch, but what I would suggest is to
compute page density (D) = relpages/(dead+live tuples) and experiment with
bitmap of sizes of D to 2D bits per page. May I also suggest that instead
of putting in efforts in implementing the overflow area,  just count how
many dead TIDs would fall under overflow area for a given choice of bitmap
size.

It might be a good idea to experiment with different vacuum scale factor,
varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a
longish pgbench test on a large table and then save the data directory for
repeated experiments, although I'm not sure if pgbench will be a good
choice because HOT will prevent accumulation of dead pointers, in which
case you may try adding another index on abalance column.

It'll be worth measuring memory consumption of both representations as well
as performance implications on index vacuum. I don't expect to see any
major difference in either heap scans.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Amit Kapila
On Fri, Sep 9, 2016 at 12:39 AM, Jeff Janes  wrote:
>
> I plan to do testing using my own testing harness after changing it to
> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
> column, which are never queried by the core part of the harness) and
> deleting them at random intervals.  I think that none of pgbench's built in
> tests are likely to give the bucket splitting and squeezing code very much
> exercise.
>

Hash index tests [1] written by Mithun does cover part of that code
and we have done testing by extending those tests to cover splitting
and squeezing part of code.

> Is there a way to gather statistics on how many of each type of WAL record
> are actually getting sent over the replication link?  The only way I can
> think of is to turn on wal archving as well as replication, then using
> pg_xlogdump to gather the stats.
>

Sounds sensible, but what do you want to know by getting the number of
each type of WAL records?  I understand it is important to cover all
the WAL records for hash index (and I think Ashutosh has done that
during his tests [2]), but may be sending multiple times same record
could further strengthen the validation.

> I've run my original test for a while now and have not seen any problems.
> But I realized I forgot to compile with enable-casserts, to I will have to
> redo it to make sure the assertion failures have been fixed.  In my original
> testing I did very rarely get a deadlock (or some kind of hang), and I
> haven't seen that again so far.  It was probably the same source as the one
> Mark observed, and so the same fix.
>

Thanks for the verification.


[1] - https://commitfest.postgresql.org/10/716/
[2] - 
https://www.postgresql.org/message-id/CAE9k0PkPumi4iWFuD%2BjHHkpcxn531%3DDJ8uH0dctsvF%2BdaZY6yQ%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-09-08 Thread Andrey Borodin
That storage assertion fired during usual update table set x=random() without 
conditions. Also Make check fails without it (for brin usage, gist is ok with 
it).
Cannot type quotation properly, I'm on a phone for a long time.

Regards, Andrey Borodin.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 6:48 PM, Craig Ringer  wrote:
> Pity ICU doesn't offer versioned collations within a single install.
> Though I can understand why they don't.

There are two separate issues with collator versioning. ICU can
probably be used in a way that clearly decouples these two issues,
which is very important. The first is that the rules of collations
change. The second is that the binary key that collators create (i.e.
the equivalent of strxfrm()) can change for various reasons that have
nothing to do with culture or natural languages -- purely technical
reasons. For example, they can add new optimizations to make
generating new binary keys faster. If there are bugs in how that
works, they can fix the bugs and increment the identifier [1], which
could allow Postgres to insist on a REINDEX (if abbreviated keys for
collated text were reenabled, although I don't think that problems
like that are limited to binary key generation).

So, to bring it back to that little program I wrote:

$ ./icu-coll-versions | head
Collator  | ICU Version | UCA Version
-
Afrikaans | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)   | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)  | 99-38-00-00 | 07-00-00-00
Aghem | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)  | 99-38-00-00 | 07-00-00-00
Akan  | 99-38-00-00 | 07-00-00-00
Akan (Ghana)  | 99-38-00-00 | 07-00-00-00
Amharic   | 99-38-00-00 | 07-00-00-00

Here, what appears as "ICU version" has the identifier [1] baked in,
although this is undocumented (it also has any "custom tailorings"
that might be used, say if we had user defined customizations to
collations, as Firebird apparently does [2] [3]). I'm pretty sure that
UCA version relates to a version of the Unicode collation algorithm,
and its associated DUCET table (this is all subject to ISO
standardization). I gather that a particular collation is actually
comprised of a base UCA version (and DUCET table -- I think that ICU
sometimes calls this the "root"), with custom tailorings that a locale
provides for a given culture or country. These collators may in turn
be further "tailored" to get that fancy user defined customization
stuff.

In principle, and assuming I haven't gotten something wrong, it ought
to be possible to unambiguously identify a collation based on a
matching UCA version (i.e. DUCET table), plus the collation tailorings
matching exactly, even across ICU versions that happen to be based on
the same UCA version (they only seem to put out a new UCA version
about once a year [4]).  It *might* be fine, practically speaking, to
assume that a collation with a matching iso-code and UCA version is
compatible forever and always across any ICU version. If not, it might
instead be feasible to write a custom fingerprinter for collation
tailorings that ran at initdb time. Maybe the tailorings, which are
abstract rules, could even be stored in system catalogs, so the only
thing that need match is ICU's UCA version (the "root" collators must
still match), since replicas may reconstruct the serialized tailorings
that comprise a collation as needed [5][6], since the tailoring that a
default collator for a locale uses isn't special, technically
speaking.

Of course, this is all pretty hand-wavey right now, and much more
research is needed. I am very intrigued about the idea of storing the
collators in the system catalogs wholesale, since ICU provides
facilities that make that seem possible. If a "serialized unicode set"
build from a collators tailoring rules, or, alternatively, a collator
saved as a binary representation [7] were stored in the system
catalogs, perhaps it wouldn't matter as much that the stuff
distributed with different ICU versions didn't match, at least in
theory. It's unclear that the system catalog representation could be
usable with a fair cross section of ICU versions, but if it could then
that would be perfect. This also seems to be how Firebird style
user-defined tailorings might be implemented anyway, and it seems very
appealing to add that as a light layer on top of how the base system
works, if at all possible.

[1] 
https://github.com/svn2github/libicu/blob/c43ec130ea0ee6cd565d87d70088e1d70d892f32/common/unicode/uvernum.h#L149
[2] http://www.firebirdsql.org/refdocs/langrefupd25-ddl-collation.html
[3] 
http://userguide.icu-project.org/collation/customization#TOC-Building-on-Existing-Locales
[4] http://unicode.org/reports/tr10/#Synch_14651_Table
[5] 
https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a1982f184bca8adaa848144a1959ff235
[6] 

Re: [HACKERS] Stopping logical replication protocol

2016-09-08 Thread Michael Paquier
On Fri, Sep 9, 2016 at 11:37 AM, Craig Ringer  wrote:
> When writing TAP tests for Perl you must ensure you use only Perl
> 5.8.8-compatible code and only the core modules plus IPC::Run . You'll
> usually be fine if you just avoid importing things you don't see other
> modules already using. Don't bother actually installing Perl 5.8.8, we
> can run tests in an ancient-perl Docker container before committing
> and fix any issues. I don't see any reason you'd need anything special
> for this test.

On top of that, the perl documentation is your friend if you have a
doubt about the versioning of a method:
http://perldoc.perl.org/
The important thing is that it is easy to navigate through perl
versions down to 5.8.8 so you can easily detect if what you are using
would be fit or not for the buildfarm.

> If you think some of the code you add would be very reusable for other
> test modules in future, feel free to propose new methods for
> PostgresNode.pm . Ask if you're not sure.

That would be great if things generic enough could be extracted, but I
haven't read the patch to make an opinion regarding what could be
done.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Mark Kirkwood

On 09/09/16 07:09, Jeff Janes wrote:

On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma > wrote:


> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid)


Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will 
never use hash indexes for searching, so it limits the amount of test 
coverage you will get.  While searching shouldn't generate novel types 
of WAL records (that I know of), it will generate locking and timing 
issues that might uncover bugs (if there are any left to uncover, of 
course).


I would drop the primary key on pgbench_accounts and replace it with a 
hash index and test it that way (except I don't have a 128 core 
machine at my disposal, so really I am suggesting that you do this...)


The lack of primary key and the non-uniqueness of the hash index 
should not be an operational problem, because the built in pgbench 
runs never attempt to violate the constraints anyway.


In fact, I'd replace all of the indexes on the rest of the pgbench 
tables with hash indexes, too, just for additional testing.


I plan to do testing using my own testing harness after changing it to 
insert a lot of dummy tuples (ones with negative values in the 
pseudo-pk column, which are never queried by the core part of the 
harness) and deleting them at random intervals.  I think that none of 
pgbench's built in tests are likely to give the bucket splitting and 
squeezing code very much exercise.


Is there a way to gather statistics on how many of each type of WAL 
record are actually getting sent over the replication link?  The only 
way I can think of is to turn on wal archving as well as replication, 
then using pg_xlogdump to gather the stats.


I've run my original test for a while now and have not seen any 
problems.  But I realized I forgot to compile with enable-casserts, to 
I will have to redo it to make sure the assertion failures have been 
fixed.  In my original testing I did very rarely get a deadlock (or 
some kind of hang), and I haven't seen that again so far.  It was 
probably the same source as the one Mark observed, and so the same fix.


Cheers,

Jeff


Yeah, good suggestion about replacing (essentially) all the indexes with 
hash ones and testing. I did some short runs with this type of schema 
yesterday (actually to get a feel for if hash performance vs btree was 
compareable - does seem tp be) - but probably longer ones with higher 
concurrency (as high as I can manage on a single socket i7 anyway) is a 
good plan. If Ashutosh has access to seriously large numbers of cores 
then that is even better :-)


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-09-08 Thread Craig Ringer
On 9 September 2016 at 03:56, Vladimir Gordiychuk  wrote:
>>It'd helpful if you summarize the changes made when posting revisions.
>
> Can we use as summary about changes first message?

I meant "what did you change between the last patch I posted and the
one you just posted" ?

I'm looking at the revised patch now.

> Sounds good. Is it changes should be include in this PR?

Yes, probably as a 3rd patch on top of these two that adds the
pg_recvlogical tests.

I suggest:

1 and 2: same patches as currently
3: send CopyDone on SIGINT in pg_recvlogical, second SIGINT just
closes per current behaviour
4: add tests using the code in 1-3

When you're editing such a patch stack, sometimes you'll find you want
to change something in patch 1 or 2 or 3 while working on patch 4. The
trick to doing this is to make sure to commit your changes in small
pieces and avoid mixing up changes for different patches. So you'll
land up with lots of small commits like

* "fix typo in ReorderBufferIsActive declaration"
* "forgot to test streamingDoneReceiving in SomeFunction"

etc.

Then when you're ready to send a new patch series, you tag your
current working branch (optional, but makes cleaning up after mistakes
easier), rebase the changes, tag the result, git format-patch it and
push it, e.g.:


git checkout my-branch
git tag my-branch-v4-before-rebase
# non-interactive rebase on top of current Pg git master
git rebase master
# Interactive rebase of all patches on top of master.
git rebase -i master

This will produce an editor window. Here you choose patches to reorder
and squash into existing patches, e.g if you start with:

pick 81b3f61 Respect client-initiated CopyDone in walsender
pick 43de23d Second part of Vladimir Gordiychuk's patch
pick 1231134 fix a typo in part 1
pick 123 forgot to check something in part 2
pick a213111 another typo

you might edit it to:

pick 81b3f61 Respect client-initiated CopyDone in walsender
fixup 1231134 fix a typo in part 1
fixup a213111 another typo
pick 43de23d Second part of Vladimir Gordiychuk's patch
fixup 123 forgot to check something in part 2

When you save and quit, git will reorder the patches and squash them
for you. If there are merge conflicts you fix them at each stage then
"git rebase --continue".

It takes some getting used to but isn't really hard at all. If you run
into trouble, just push your current working branch to a public tree
(github or whatever) and ask for help here.

Once you're used to it, git has a few helper features like "fixup!"
commits and --auto-squash that make this a bit quicker, but it's worth
doing manually first since it's easier to figure out problems.

There's lots of good reference material on this out there so I won't
go on about it endlessly.

> I'm not ensure that
> I do this before complete this commitefest, I need some time to understand
> how this tests works, and how can i write new one.

No problem. I don't think it matters much if the patch bumps to the
next commitfest. It's not a big intrusive change that needs to get in
early in the development cycle.

I'm not a committer and can't commit this myself. If I were I don't
think I'd do so without a working test anyway.

A good starting point for the tests is src/test/perl/README . You can
add a test to src/test/recovery/t . Follow the model used in the other
tests there.

I wrote some tests using logical decoding in the logical decoding
timeline following patch; see
https://commitfest.postgresql.org/10/779/ for the patch, or
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following-pg10-v1
. They use the SQL interface not pg_recvlogical though.

There's also 
https://github.com/2ndQuadrant/postgres/blob/dev/failover-slots/src/test/recovery/t/008_failover_slots.pl
which has another example of controlling an external process, in this
case pg_receivexlog, using IPC::Run. See start_pg_receivexlog(...) and
its callers.

When writing TAP tests for Perl you must ensure you use only Perl
5.8.8-compatible code and only the core modules plus IPC::Run . You'll
usually be fine if you just avoid importing things you don't see other
modules already using. Don't bother actually installing Perl 5.8.8, we
can run tests in an ancient-perl Docker container before committing
and fix any issues. I don't see any reason you'd need anything special
for this test.

If you think some of the code you add would be very reusable for other
test modules in future, feel free to propose new methods for
PostgresNode.pm . Ask if you're not sure.

Ping me if you're stuck or need a hand. I really want more people to
have a clue about how the TAP tests work. I don't use IRC (wrong
timezone and life is busy) so email is best.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] ICU integration

2016-09-08 Thread Craig Ringer
On 9 September 2016 at 08:51, Tatsuo Ishii  wrote:
>> On 9 September 2016 at 00:19, Peter Eisentraut
>>  wrote:
>>> On 9/8/16 11:16 AM, Tom Lane wrote:
 This is a problem, if ICU won't guarantee cross-version compatibility,
 because it destroys the argument that moving to ICU would offer us
 collation behavior stability.
>>>
>>> It would offer a significant upgrade over the current situation.
>>>
>>> First, it offers stability inside the same version.  Whereas glibc might
>>> change a collation in a minor upgrade, ICU won't do that.  And the
>>> postgres binary is bound to a major version of ICU by the soname (which
>>> changes with every major release).  So this would avoid the situation
>>> that a simple OS update could break collations.
>>
>> It also lets *users* and PostgreSQL-specific distributors bundle ICU
>> and get stable collations. We can't exactly bundle glibc.
>
> I would like to know the fate of community RPMs because if PostgreSQL
> does not include ICU source, the effort of integrating ICU is totally
> up to packagers.

CC'ing Devrim.

Personally I would be pretty reluctant to package libicu when it's
already in RH/Fedora. If it were bundled in Pg's source tree and a
private copy was built/installed by the build system so it was part of
the main postgresql-server package that'd be different. I can't really
imagine that being acceptable for upstream development though.

RH and Debian would instantly rip it out and replace it with their
packaged ICU anyway.

Pity ICU doesn't offer versioned collations within a single install.
Though I can understand why they don't.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An extra error for client disconnection on Windows

2016-09-08 Thread Haribabu Kommi
On Thu, Jun 2, 2016 at 6:51 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello.
>
> After a process termination without PQfinish() of a client,
> server emits the following log message not seen on Linux boxes.
>
> > LOG:  could not receive data from client: An existing connection was
> forcibly closed by the remote host.
>
> This is because pgwin32_recv reuturns an error ECONNRESET for the
> situation instead of returning non-error EOF as recv(2) does.
>
> This patch translates WSAECONNRESET of WSARecv to an EOF so that
> pgwin32_recv behaves the same way with Linux.
>
> The attached patch does this.
>

I reviewed and verified the changes. This patch works as it stats.
Now there is no extra error message that occurs whenever a client
disconnects abnormally.

Marked the patch as "ready for committer".

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] feature request: explain "with details" option

2016-09-08 Thread Craig Ringer
On 9 September 2016 at 01:40, Roger Pack  wrote:
> My apologies if this was already requested before...
>
> I think it would be fantastic if postgres had an "explain the explain" option:
> Today's explain tells us what loops and scans were used, and relative
> costs, etc.  It doesn't seem to tell *why* the planner elected to use
> what it did.

One thing that's been discussed here is to have a way to see which
potential plans are rejected and compare their costs.

This isn't simple because there are often *lots* of variants. You
don't just want to see the "top 10" candidate plans, because they're
probably a bunch of small variants on the same plan; the ones you'll
be interested in will probably be very different plans with very bad
relative estimates.

Say you've got a query over 20 tables through five views. You're only
interested in a particular part that runs much slower than the
estimates say it should. You don't care about any of the other parts
of the plan. How can Pg help with this?

My thinking is that maybe the planner can expose an interface tools
can use to offer plan alternative drill-down. You'd start with the
original plan and the tool would ask "show me alternatives for this
sub-path". You'd explore the plan to see what alternatives were
considered. A way to see how cost estimates are calculated and based
on what stats would be needed.

That's hardly a simple   explain (altplans)  though.

If you have good ideas about how to choose a small subset of alternate
plans to show the user that'd be informative and not risk being even
more misleading, that'd be good. But in a cost-based planner that
explores many paths this might not be as simple as you expect.

> For instance, in the case of a corrupted index, it doesn't say why
> it's not using that index, it just doesn't use it, causing some
> confusion to end users.  At least causing confusion to me.

It doesn't have a "corrupted index" flag. What do you mean by this?

> Or in the case of where it iterates over an entire table (seq. scan)
> instead of using an index because the index range specified "is most
> of the table" (thus not helpful to use the index)...The choice is
> appropriate.  The reasoning why is not explicitly mentioned.  Again
> causing possibility for some delay as you try to "decipher the mind"
> of the planner.  Sometimes tables (ex: tables after having been first
> propagated) need an "analyze" run on them, but it's not clear from an
> "explain" output that the analyze statistics are faulty.  Not even a
> hint.

That one's not simple. If Pg knew the stats were wrong it'd say so,
but it has no idea. It'd have to consult its stats to figure out ...
oh, damn.

We could probably do a better job of identifying tables that have been
flagged as needing analyze but autovacuum hasn't got around to it yet,
though.

> So this is a feature request for an "EXPLAIN DETAILS" option or
> something, basically like today's explain but with more "rationale"
> included.  This could be immensely useful to many Postgres users.
>
> I'd even be willing to chip in a couple hundred bucks if it would help
> grease the wheels for somebody taking up the challenge if that helps
> at all :)

I think you missed a zero or two, unfortunately. I don't think this is
a small project to do well and right. Doing it badly might just add
more confusing/misleading information. Then again, I'm not exactly a
planner expert.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-09-08 Thread Amit Langote
On 2016/09/08 21:38, Rajkumar Raghuwanshi wrote:
> On Wed, Sep 7, 2016 at 3:58 PM, Amit Langote wrote:
>> On 2016/09/07 17:56, Rajkumar Raghuwanshi wrote:
>>>
>>> In this case not sure how to create partition table. Do we have something
>>> like we have UNBOUNDED for range partition or oracle have "DEFAULT" for
>>> list partition.
>>>
>>> create table employee (empid int, dept varchar) partition by list(dept);
>>> create table emp_p1 partition of employee for values in ('IT');
>>> create table emp_p2 partition of employee for values in ('HR');
>>> create table emp_p3 partition of employee for values in (??);
>>
>> Sorry, no such feature is currently offered.  It might be possible to
>> offer something like a "default" list partition which accepts values other
>> than those specified for other existing partitions.  However, that means
>> if we add a non-default list partition after a default one has been
>> created, the implementation must make sure that it moves any values from
>> the default partition that now belong to the newly created partition.
> 
> Thanks for clarifying, But I could see same problem of moving data when
> adding a new non-default partition with unbounded range partition.
> 
> For example give here, Initially I have create a partition table test with
> test_p3 as unbounded end,
> Later tried to change test_p3 to contain 7-9 values only, and adding a new
> partition test_p4 contain 10-unbound.
> 
> --create partition table and some leafs
> CREATE TABLE test (a int, b int) PARTITION BY RANGE(a);
> CREATE TABLE test_p1 PARTITION OF test FOR VALUES START (1) END (4);
> CREATE TABLE test_p2 PARTITION OF test FOR VALUES START (4) END (7);
> CREATE TABLE test_p3 PARTITION OF test FOR VALUES START (7) END UNBOUNDED;
> 
> --insert some data
> INSERT INTO test SELECT i, i*10 FROM generate_series(1,3) i;
> INSERT INTO test SELECT i, i*10 FROM generate_series(4,6) i;
> INSERT INTO test SELECT i, i*10 FROM generate_series(7,13) i;
> 
> --directly not able to attach test_p4 because of overlap error, hence
> detached test_p3 and than attaching test_p4
> SELECT tableoid::regclass,* FROM test;
>  tableoid | a  |  b
> --++-
>  test_p1  |  1 |  10
>  test_p1  |  2 |  20
>  test_p1  |  3 |  30
>  test_p2  |  4 |  40
>  test_p2  |  5 |  50
>  test_p2  |  6 |  60
>  test_p3  |  7 |  70
>  test_p3  |  8 |  80
>  test_p3  |  9 |  90
>  test_p3  | 10 | 100
>  test_p3  | 11 | 110
>  test_p3  | 12 | 120
>  test_p3  | 13 | 130
> (13 rows)
> 
> ALTER TABLE test DETACH PARTITION test_p3;
> CREATE TABLE test_p4 (like test);
> ALTER TABLE test ATTACH PARTITION test_p4 FOR VALUES start (10) end
> UNBOUNDED;
> 
> --now can not attach test_p3 because of overlap with test_p4, causing data
> loss from main test table.
> ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (10);
> ERROR:  source table contains a row violating partition bound specification
> ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (13);
> ERROR:  partition "test_p3" would overlap partition "test_p4"

In this particular case, you will have to move any rows in test_p3 with
key > or >= 10 into the new partition test_p4 using dml (to not lose any
data).  Then attach test_p3 as partition for values start (7) end (10);
you won't get either of the above errors.

Looking forward, what we need I think is a split partition command.
Adding a new partition that overlaps the default list partition or
unbounded range partition could be done by splitting the latter. Perhaps
something like:

alter table test
  split partition test_p3 at (10) [inclusive | exclusive] with test_p4;

The above command would make test_p3 into 2 partitions:

  test_p3 start (7) end (10) [inclusive | exclusive]
  test_p4 start (10) [exclusive | inclusive] end unbounded

Any rows in test_p3 with key > or >= 10 will be moved into the newly
created test_p4 as part of the execution of this command.

For your list partitioning example:

create table employee (empid int, dept varchar) partition by list(dept);
create table emp_p1 partition of employee for values in ('IT');
create table emp_p2 partition of employee for values in ('HR');
create table emp_p3 partition of employee for values in (default);

alter table emp
  split partition emp_p3 with emp_p3 ('ACCT') emp_p4 (default);

Any rows in emp_p3 with key != 'ACCT' will be moved into the newly created
default partition emp_p4.


But for time being, I think we could provide the syntax and mechanism for
default list partition seeing as we have the same for range partitioned
table (namely a range partition with unbounded start or end).  Although
with the limitations as discussed.

Thoughts?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-08 Thread Michael Paquier
On Thu, Sep 8, 2016 at 6:26 PM, Masahiko Sawada  wrote:
> "k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
> compatibility but most users would think "k(n1, n2, n3)" as quorum
> after introduced quorum.
> I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
> n3)" style before 9.6 releasing if we got consensus.

Considering breaking backward-compatibility in the next release does
not sound like a good idea to me for a new feature that is going to be
GA soon.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Install extensions using update scripts (was Re: [HACKERS] Remove superuser() checks from pgstattuple)

2016-09-08 Thread Andres Freund
On 2016-09-07 13:44:28 -0400, Tom Lane wrote:
> +   
> +Installing Extensions using Update Scripts
> +
> +
> + An extension that has been around for awhile will probably exist in

Wanted to cry typo for 'awhile' here, but apparently that's actually a
word. The German in me is pleased.


> + several versions, for which the author will need to write update 
> scripts.
> + For example, if you have released a foo extension in
> + versions 1.0, 1.1, and 1.2, there
> + should be update scripts foo--1.0--1.1.sql
> + and foo--1.1--1.2.sql.
> + Before PostgreSQL 10, it was necessary to create new
> + script files foo--1.1.sql and foo--1.2.sql
> + containing the same changes, or else the newer versions could not be

Maybe replace "same changes" "directly creating the extension's
contents" or such?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Install extensions using update scripts (was Re: [HACKERS] Remove superuser() checks from pgstattuple)

2016-09-08 Thread Andres Freund
Hi,


On 2016-09-07 09:46:32 -0400, Tom Lane wrote:
> At this point it's awfully tempting to make ALTER EXTENSION UPDATE grow
> a CASCADE option to allow automatic installation of new requirements
> of the new version, but I didn't do that here.

Sounds like a plan. After the refactoring that should be easy.


> @@ -1086,6 +1097,9 @@ identify_update_path(ExtensionControlFil
>   * Apply Dijkstra's algorithm to find the shortest path from evi_start to
>   * evi_target.
>   *
> + * If reject_indirect is true, ignore paths that go through installable
> + * versions (presumably, caller will consider starting from such versions).

Reading this I was initially confused, only after reading
find_install_path() this made sense. It's to cut the search short,
right?  Rephrasing this a bit might make sense.



> + /*
> +  * No FROM, so we're installing from scratch.  If there is an 
> install
> +  * script for the desired version, we only need to run that one.
> +  */
> + char   *filename;
> + struct stat fst;
> +
>   oldVersionName = NULL;
> - updateVersions = NIL;
> +
> + filename = get_extension_script_filename(pcontrol, NULL, 
> versionName);
> + if (stat(filename, ) == 0)
> + {
> + /* Easy, no extra scripts */
> + updateVersions = NIL;
> + }
> + else
> + {
> + /* Look for best way to install this version */
> + List   *evi_list;
> + ExtensionVersionInfo *evi_target;
> +
> + /* Extract the version update graph from the script 
> directory */
> + evi_list = get_ext_ver_list(pcontrol);
> +
> + /* Identify the target version */
> + evi_target = get_ext_ver_info(versionName, _list);
> +
> + /*
> +  * We don't expect it to be installable, but maybe 
> somebody added
> +  * a suitable script right after our stat() test.
> +  */
> + if (evi_target->installable)
> + {
> + /* Easy, no extra scripts */
> + updateVersions = NIL;
> + }

Heh, that's an odd thing to handle.


> + else
> + {
> + /* Identify best path to reach target */
> + ExtensionVersionInfo *evi_start;
> +
> + evi_start = find_install_path(evi_list, 
> evi_target,
> + 
>   );
> +
> + /* Fail if no path ... */
> + if (evi_start == NULL)
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +  errmsg("extension 
> \"%s\" has no installation script for version \"%s\"",
> + 
> pcontrol->name, versionName)));

Maybe, and I mean maybe, we should rephrase this to hint at indirect
installability?


Looks good to me.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-08 Thread Tatsuo Ishii
> On 9 September 2016 at 00:19, Peter Eisentraut
>  wrote:
>> On 9/8/16 11:16 AM, Tom Lane wrote:
>>> This is a problem, if ICU won't guarantee cross-version compatibility,
>>> because it destroys the argument that moving to ICU would offer us
>>> collation behavior stability.
>>
>> It would offer a significant upgrade over the current situation.
>>
>> First, it offers stability inside the same version.  Whereas glibc might
>> change a collation in a minor upgrade, ICU won't do that.  And the
>> postgres binary is bound to a major version of ICU by the soname (which
>> changes with every major release).  So this would avoid the situation
>> that a simple OS update could break collations.
> 
> It also lets *users* and PostgreSQL-specific distributors bundle ICU
> and get stable collations. We can't exactly bundle glibc.

I would like to know the fate of community RPMs because if PostgreSQL
does not include ICU source, the effort of integrating ICU is totally
up to packagers.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-09-08 Thread Andres Freund
On 2016-08-31 15:15:16 -0700, Peter Geoghegan wrote:
> On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund  wrote:
> > On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan  wrote:
> >
> >>In other painfully pedantic news, I should point out that
> >>sizeof(size_t) isn't necessarily word size (the most generic
> >>definition of word size for the architecture), contrary to my reading
> >>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
> >>here, of course.
> >
> > Uh?
> 
> Sorry, I really should have not said anything. It is true that x86_64
> word size is sometimes reported as 16 and/or 32 bits [1], because of
> legacy issues.

I think native word size describes the issue well enough. And, more
importantly, I can't think of an equally short but more accurate
description.

I've pushed the patches.  Thanks for the review.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdtruncate leaking fd.c handles?

2016-09-08 Thread Andres Freund
On 2016-09-08 15:50:42 -0700, Andres Freund wrote:
> > Andres Freund  writes:
> > > Am I missing something or is md.c:mdtruncate() leaking open files?

> Will fix.

And done.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Andres Freund
On 2016-09-08 20:15:46 -0400, Peter Eisentraut wrote:
> On 9/8/16 6:04 PM, Alvaro Herrera wrote:
> > Andres Freund wrote:
> >> On 2016-09-08 18:13:06 -0300, Alvaro Herrera wrote:
> >>> I suppose -I$(srcdir) should be fine.  (Why the quotes?)
> >>
> >> Because quoting correctly seems like a good thing to do? Most people
> >> won't have whitespace in there, but it doesn't seem impossible?
> > 
> > Well, I think they are used without quotes in so many places that doing
> > it here seems rather pointless.
> 
> We don't support build directories with spaces in them, but we support
> installation directories with spaces in them.  So I guess that means
> your point is valid.

Even if not necessary in this specific case, I personally think it's
just a good to quote halfway sensibly. Especially when it's trivial to
do so.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Peter Eisentraut
On 9/8/16 6:04 PM, Alvaro Herrera wrote:
> Andres Freund wrote:
>> On 2016-09-08 18:13:06 -0300, Alvaro Herrera wrote:
>>> I suppose -I$(srcdir) should be fine.  (Why the quotes?)
>>
>> Because quoting correctly seems like a good thing to do? Most people
>> won't have whitespace in there, but it doesn't seem impossible?
> 
> Well, I think they are used without quotes in so many places that doing
> it here seems rather pointless.

We don't support build directories with spaces in them, but we support
installation directories with spaces in them.  So I guess that means
your point is valid.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-08 Thread Craig Ringer
On 9 September 2016 at 00:19, Peter Eisentraut
 wrote:
> On 9/8/16 11:16 AM, Tom Lane wrote:
>> This is a problem, if ICU won't guarantee cross-version compatibility,
>> because it destroys the argument that moving to ICU would offer us
>> collation behavior stability.
>
> It would offer a significant upgrade over the current situation.
>
> First, it offers stability inside the same version.  Whereas glibc might
> change a collation in a minor upgrade, ICU won't do that.  And the
> postgres binary is bound to a major version of ICU by the soname (which
> changes with every major release).  So this would avoid the situation
> that a simple OS update could break collations.

It also lets *users* and PostgreSQL-specific distributors bundle ICU
and get stable collations. We can't exactly bundle glibc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 8:16 AM, Tom Lane  wrote:
>> I understand that in principle, but I don't see operating system
>> providers shipping a bunch of ICU versions to facilitate that.  They
>> will usually ship one.
>
> I agree with that estimate, and I would further venture that even if we
> wanted to bundle ICU into our tarballs, distributors would rip it out
> again on security grounds.

I agree that we're not going to bundle our own ICU. And, that
packagers have to be more or less on board with whatever plan we come
up with for any this to be of much practical value. The plan itself is
at least as important as the patch.

> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

Not exactly. Peter E. didn't seem to be aware that there is an ICU
collator versioning concept (perhaps I misunderstood, though). It
might be that in practice, the locales are very stable, so it almost
doesn't matter that it's annoying when they change. Note that
"collators" are versioned in a sophisticated way, not locales.

You can build the attached simple C program to see the versions of
available collators from each locale, as follows:

$ gcc icu-test.c -licui18n -licuuc -o icu-coll-versions
$ ./icu-coll-versions | head -n 20
Collator  | ICU Version | UCA Version
-
Afrikaans | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)   | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)  | 99-38-00-00 | 07-00-00-00
Aghem | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)  | 99-38-00-00 | 07-00-00-00
Akan  | 99-38-00-00 | 07-00-00-00
Akan (Ghana)  | 99-38-00-00 | 07-00-00-00
Amharic   | 99-38-00-00 | 07-00-00-00
Amharic (Ethiopia)| 99-38-00-00 | 07-00-00-00
Arabic| 99-38-1B-01 | 07-00-00-00
Arabic (World)| 99-38-1B-01 | 07-00-00-00
Arabic (United Arab Emirates) | 99-38-1B-01 | 07-00-00-00
Arabic (Bahrain)  | 99-38-1B-01 | 07-00-00-00
Arabic (Djibouti) | 99-38-1B-01 | 07-00-00-00
Arabic (Algeria)  | 99-38-1B-01 | 07-00-00-00
Arabic (Egypt)| 99-38-1B-01 | 07-00-00-00
Arabic (Western Sahara)   | 99-38-1B-01 | 07-00-00-00
Arabic (Eritrea)  | 99-38-1B-01 | 07-00-00-00

I also attach a full list from my Ubuntu 16.04 laptop. I'll try to
find some other system to generate output from, to see how close it
matches what I happen to have here.

"ICU version" here is an opaque 32-bit integer [1]. I'd be interested
to see how much the output of this program differs from one major
version of ICU to the next. Collations will change. of course, but not
that often. It's not the end of the world if somebody has to REINDEX
when they change major OS version. It would be nice if everything just
continued to work with no further input from the user, but it's not
essential, assuming that collation are pretty stable in practice,
which I think they are. It is a total disaster if a mismatch in
collations is initially undetected, though.

Another issue that nobody has mentioned here, I think, is that the
glibc people just don't seem to care about our use-case (Carlos
O'Donnell basically said as much, during the strxfrm() debacle earlier
this year, but it wasn't limited to how we were relying on strxfrm()
at that time). Since it's almost certainly true that other major
database systems are critically reliant on ICU's strxfrm() agreeing
with strcoll (substitute ICU equivalent spellings), and issues beyond
that, it stands to reason that they take that stuff very seriously. It
would be really nice to get back abbreviated keys for collated text,
IMV. I think ICU gets us that. Even if we used ICU in exactly the same
way as we use the C standard library today, that general sense of
stability being critical that ICU has would still be a big advantage.
If ICU drops the ball on collation stability, or strxfrm() disagreeing
with strcoll(), it's a huge problem for lots of groups of people, not
just us.

[1] 
https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#af756972781ac556a62e48cbd509ea4a6
-- 
Peter Geoghegan
Collator  | ICU Version | UCA Version
-
Afrikaans | 99-38-00-00 | 

Re: [HACKERS] High-CPU consumption on information_schema (only) query

2016-09-08 Thread Andres Freund
On 2016-09-07 23:37:31 +, Robins Tharakan wrote:
> If someone asks for I could provide SQL + EXPLAIN, but it feels irrelevant
> here.

Why is that? information_schema are normal sql queries, and some of them
far from trivial.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] High-CPU consumption on information_schema (only) query

2016-09-08 Thread Jeff Janes
On Wed, Sep 7, 2016 at 4:37 PM, Robins Tharakan  wrote:

>

> Hi,
>
> An SQL (with only information_schema related JOINS) when triggered, runs
with max CPU (and never ends - killed after 2 days).
> - It runs similarly (very slow) on a replicated server that acts as a
read-only slave.
> - Top shows only postgres as hitting max CPU (nothing else). When query
killed, CPU near 0%.
> - When the DB is restored on a separate test server (with the exact
postgresql.conf) the same query works fine.
> - There is no concurrent usage on the replicated / test server (although
the primary is a Production server and has concurrent users).
>
> Questions:
> - If this was a postgres bug or a configuration issue, query on the
restored DB should have been slow too. Is there something very basic I am
missing here?
>
> If someone asks for I could provide SQL + EXPLAIN, but it feels
irrelevant here. I amn't looking for a specific solution but what else
should I be looking for here?

strace -ttt -T -y the process to see what system calls it is making.  If it
is not doing many systme calls, or they are uninformative, then attach the
gdb debugger to it and periodically interrupt the process (ctrl c) and take
a back trace (bt), then restart it (c) and repeat.  If all the stack traces
look similar, you will know where the time is going.

Cheers,

Jeff


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-08 Thread Corey Huinker
On Thu, Sep 8, 2016 at 6:59 PM, Craig Ringer 
wrote:

> On 9 Sep. 2016 03:45, "Corey Huinker"  wrote:
> >
> >
>
> > Stylistically, would a separate .pl file for the emitter be preferable
> to something inline like
> >
> >> perl -e 'print "a\tb\tcc\t4\n"; print "b\tc\tdd\t5\n"'
>
> I'd be fine with that and a suitable comment. Just be careful with
> different platforms' shell escaping rules.
>
Do perl command switches on windows/VMS use /e instead of -e?  If so,
that'd be a great argument doing just "perl filename".


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-08 Thread Craig Ringer
On 9 Sep. 2016 03:45, "Corey Huinker"  wrote:
>
>

> Stylistically, would a separate .pl file for the emitter be preferable to
something inline like
>
>> perl -e 'print "a\tb\tcc\t4\n"; print "b\tc\tdd\t5\n"'

I'd be fine with that and a suitable comment. Just be careful with
different platforms' shell escaping rules.


Re: [HACKERS] mdtruncate leaking fd.c handles?

2016-09-08 Thread Andres Freund
On 2016-09-08 18:39:56 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Am I missing something or is md.c:mdtruncate() leaking open files?
> 
> Yeah, I think you're right.
> 
> > This only really matters for VACUUM truncate and ON COMMIT TRUNCATE temp
> > table truncation afaics.
> 
> Also, you'd need a table > 1GB to leak anything at all, which probably
> helps explain why it hasn't been noticed.

Heh, this bug is of some older vintage... Appears to originate in
1a5c450f3024ac57cd6079186c71b3baf39e84eb - before that we did a
FileUnlink(), which includes a FileClose().

Will fix.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdtruncate leaking fd.c handles?

2016-09-08 Thread Tom Lane
Andres Freund  writes:
> Am I missing something or is md.c:mdtruncate() leaking open files?

Yeah, I think you're right.

> This only really matters for VACUUM truncate and ON COMMIT TRUNCATE temp
> table truncation afaics.

Also, you'd need a table > 1GB to leak anything at all, which probably
helps explain why it hasn't been noticed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] mdtruncate leaking fd.c handles?

2016-09-08 Thread Andres Freund
Hi,

Am I missing something or is md.c:mdtruncate() leaking open files?
The relevant piece of code is:

if (priorblocks > nblocks)
{
/*
 * This segment is no longer active (and has already 
been unlinked
 * from the mdfd_chain). We truncate the file, but do 
not delete
 * it, for reasons explained in the header comments.
 */
if (FileTruncate(v->mdfd_vfd, 0) < 0)
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not truncate 
file \"%s\": %m",

FilePathName(v->mdfd_vfd;

if (!SmgrIsTemp(reln))
register_dirty_segment(reln, forknum, v);
v = v->mdfd_chain;
Assert(ov != reln->md_fd[forknum]); /* we never drop 
the 1st

 * segment */
pfree(ov);
}

note that we're freeing the chain entry, which contains the mdfd_vfd,
without previously calling FileClose().

This only really matters for VACUUM truncate and ON COMMIT TRUNCATE temp
table truncation afaics.  But it seems like we should fix it everywhere
nonetheless.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing deadlock on parallel backup

2016-09-08 Thread Lucas
Tom,

Yes, it is what I mean. Is what pg_dump uses to get things synchronized. It
seems to me a clear marker that the same task is using more than one
connection to accomplish the one job.

Em 08/09/2016 6:34 PM, "Tom Lane"  escreveu:

> Lucas  writes:
> > The queue jumping logic can not use the distributed transaction id?
>
> If we had such a thing as a distributed transaction id, maybe the
> answer could be yes.  We don't.
>
> I did wonder whether using a shared snapshot might be a workable proxy
> for that, but haven't pursued it.
>
> regards, tom lane
>


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Alvaro Herrera
Andres Freund wrote:
> On 2016-09-08 18:13:06 -0300, Alvaro Herrera wrote:
> > I suppose -I$(srcdir) should be fine.  (Why the quotes?)
> 
> Because quoting correctly seems like a good thing to do? Most people
> won't have whitespace in there, but it doesn't seem impossible?

Well, I think they are used without quotes in so many places that doing
it here seems rather pointless.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-09-08 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera  writes:
> > Tom Lane wrote:
> >> That comment seems utterly wrong to me, because both PageIndexTupleDelete
> >> and PageIndexMultiDelete certainly contain assertions that every item on
> >> the page has storage.  Are you expecting that any BRIN index items
> >> wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
> >> have storage sensible?
> 
> > It is possible in BRIN to have empty intermediate tuples; for example it
> > is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.
> 
> Hm.  So apparently, the only reason this stuff works at all is that
> BRIN isn't using either PageIndexTupleDelete or PageIndexMultiDelete.

Yes, this is why the NoCompact variant was introduced in the first
place.

> > Now if this loop is concerned only with live lps and does not move lps,
> > then it should be fine to add the assertion.
> 
> No, it iterates over all lps on the page.  I'm inclined to think it
> should be written like
> 
>   if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
>   ii->lp_off += size_diff;
> 
> because futzing with the lp_off field in an item that isn't really
> pointing at storage feels wrong.  We might need to do that to
> PageIndexTupleDelete and/or PageIndexMultiDelete someday, too.

I suppose it is conceivable that we start using lp_off for other
purposes in the future, so I don't disagree.  I don't think index pages
currently do any funny business with it.

> I notice that PageIndexDeleteNoCompact, which seems to express what
> BRIN is expecting in a rather underdocumented way, forces any
> items without storage into "unused" state.  I don't really think
> it's bufpage.c's place to do that, though.  Do you think that code
> is actually supposed to fire, or is it just there for lack of a
> better idea?

I just put it there only because I didn't see any reason not to, really.
I don't think BRIN relies on it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-09-08 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> That comment seems utterly wrong to me, because both PageIndexTupleDelete
>> and PageIndexMultiDelete certainly contain assertions that every item on
>> the page has storage.  Are you expecting that any BRIN index items
>> wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
>> have storage sensible?

> It is possible in BRIN to have empty intermediate tuples; for example it
> is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.

Hm.  So apparently, the only reason this stuff works at all is that
BRIN isn't using either PageIndexTupleDelete or PageIndexMultiDelete.

> Now if this loop is concerned only with live lps and does not move lps,
> then it should be fine to add the assertion.

No, it iterates over all lps on the page.  I'm inclined to think it
should be written like

if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
ii->lp_off += size_diff;

because futzing with the lp_off field in an item that isn't really
pointing at storage feels wrong.  We might need to do that to
PageIndexTupleDelete and/or PageIndexMultiDelete someday, too.

I notice that PageIndexDeleteNoCompact, which seems to express what
BRIN is expecting in a rather underdocumented way, forces any
items without storage into "unused" state.  I don't really think
it's bufpage.c's place to do that, though.  Do you think that code
is actually supposed to fire, or is it just there for lack of a
better idea?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Claudio Freire
On Thu, Sep 8, 2016 at 2:19 PM, Peter Geoghegan  wrote:
>> Peter, looking at your "displace" patch in this light, I think
>> tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm
>> calling it now), should share a common subroutine. Displace replaces the top
>> node with the new node passed as argument, and then calls the subroutine to
>> push it down to the right place. Delete_top replaces the top node with the
>> last value in the heap, and then calls the subroutine. Or perhaps delete_top
>> should remove the last value from the heap, and then call displace() with
>> it, to re-insert it.
>
> I can see the value in that, but I'd still rather not. The code reuse
> win isn't all that large, and having a separate
> tuplesort_heap_root_displace() routine allows us to explain what's
> going on for merging, to assert what we're able to assert with tape
> numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex
> cruft that the existing routines need (if only barely) to support
> replacement selection. I would be surprised if with a very tight inner
> loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it
> increases pipeline stalls).

BTW, regarding this, since I read in some other thread that it was ok
to use static inline functions, I believe the compiler is smart enough
to optimize out the constant true/false in checkIndex for inlined
calls, so that may be viable (on modern compilers).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Andres Freund
On 2016-09-08 18:13:06 -0300, Alvaro Herrera wrote:
> I suppose -I$(srcdir) should be fine.  (Why the quotes?)

Because quoting correctly seems like a good thing to do? Most people
won't have whitespace in there, but it doesn't seem impossible?


> > check-world appears to mostly run (still doing so, but it's mostly
> > through everything relevant).

Passed successfully since.


> > I can't vouch for the windows stuff, and
> > the invocations indeed look vulnerable. I'm not sure if hte fix actually
> > matters on windows, given . is the default for pretty much everything
> > there.
> 
> Well, maybe it doesn't matter now but as I understand the fix is going
> to enter the next stable upstream perl, so it'll fail eventually.  It'd
> be saner to just fix the thing completely so that we can forget about
> it.

Yea, it'd need input from somebody on windows. Michael? What happens if
you put a line remove . from INC (like upthread) in the msvc stuff?


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 2:09 PM, Tom Lane  wrote:
> Well, my vote is that it ain't broke and we shouldn't fix it.

To take a step back, what prompted this whole discussion is the patch
that I wrote that shifts down, replacing calls to
tuplesort_heap_siftup() and tuplesort_heap_insert with one new call to
a function I've called tuplesort_heap_root_displace() (today, Claudio
reports that it makes some of his test queries go 25% faster). This
new function shifts down. It's not clear what I'm supposed to say
about that, given the current understanding. So, in a sense, it's
blocking on this.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing deadlock on parallel backup

2016-09-08 Thread Tom Lane
Lucas  writes:
> The queue jumping logic can not use the distributed transaction id?

If we had such a thing as a distributed transaction id, maybe the
answer could be yes.  We don't.

I did wonder whether using a shared snapshot might be a workable proxy
for that, but haven't pursued it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-09-08 Thread Alvaro Herrera
Tom Lane wrote:
> Hey Alvaro, can you comment on this bit in the proposed patch?
> 
> + for (i = FirstOffsetNumber; i <= itemcount; i++)
> + {
> + ItemId  ii = PageGetItemId(phdr, i);
> +
> + /* Normally here was following assertion
> +  * Assert(ItemIdHasStorage(ii));
> +  * This assertion was introduced in PageIndexTupleDelete
> +  * But if this function will be used from BRIN index
> +  * this assertion will fail. Thus, here we do not check 
> that
> +  * item has the storage.
> +  */
> + if (ItemIdGetOffset(ii) <= offset)
> + ii->lp_off += size_diff;
> + }
> + }
> 
> That comment seems utterly wrong to me, because both PageIndexTupleDelete
> and PageIndexMultiDelete certainly contain assertions that every item on
> the page has storage.  Are you expecting that any BRIN index items
> wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
> have storage sensible?

It is possible in BRIN to have empty intermediate tuples; for example it
is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.

This is a tricky case to reproduce, but I think this should do it:
consider a table with pages_per_range=1.  Page 1 is summarized by index
tuple 1, page 2 is summarized by itup 2, page 3 is summarized by index
tuple 3.  On heap page 2 a new tuple is inserted and the summary data is
now much larger, such that the summary tuple no longer fits in the index
page, so it is moved to a new index page.  Then index tuple 2 in the
first index page becomes unused.  Revmap still points to lp 3, so it
can't be moved to lp 2.  The lp_offs can be adjusted, as long as the lp
itself is not moved from its original position.

Now if this loop is concerned only with live lps and does not move lps,
then it should be fine to add the assertion.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add support for restrictive RLS policies

2016-09-08 Thread Tom Lane
Stephen Frost  writes:
> * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
>> Can't you keep those words as Sconst or something (DefElems?) until the
>> execution phase, so that they don't need to be keywords at all?

> Seems like we could do that, though I'm not convinced that it really
> gains us all that much.  These are only unreserved keywords, of course,
> so they don't impact users the way reserved keywords (of any kind) can.
> While there may be some places where we use a string to represent a set
> of defined options, I don't believe that's typical

-1 for having to write them as string literals; but I think what Alvaro
really means is to arrange for the words to just be identifiers in the
grammar, which you strcmp against at execution.  See for example
reloption_list.  (Whether you use DefElem as the internal representation
is a minor detail, though it might help for making the parsetree
copyObject-friendly.)

vacuum_option_elem shows another way to avoid making a word into a
keyword, although to me that one is more of an antipattern; it'd be better
to leave the strcmp to execution, since there's so much other code that
does things that way.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-09-08 Thread Tom Lane
Hey Alvaro, can you comment on this bit in the proposed patch?

+   for (i = FirstOffsetNumber; i <= itemcount; i++)
+   {
+   ItemId  ii = PageGetItemId(phdr, i);
+
+   /* Normally here was following assertion
+* Assert(ItemIdHasStorage(ii));
+* This assertion was introduced in PageIndexTupleDelete
+* But if this function will be used from BRIN index
+* this assertion will fail. Thus, here we do not check 
that
+* item has the storage.
+*/
+   if (ItemIdGetOffset(ii) <= offset)
+   ii->lp_off += size_diff;
+   }
+   }

That comment seems utterly wrong to me, because both PageIndexTupleDelete
and PageIndexMultiDelete certainly contain assertions that every item on
the page has storage.  Are you expecting that any BRIN index items
wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
have storage sensible?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Alvaro Herrera
Andres Freund wrote:
> On 2016-09-08 17:58:03 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:
> > 
> > > ISTM that the easiest fix is to just tack  -I '$(srcdir)' into the prove
> > > flags like:
> > > PROVE = @PROVE@
> > > PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I '$(srcdir)'
> > > PROVE_FLAGS = --verbose
> > > 
> > > I don't think there's any security concerns for us here.
> > 
> > Maybe not, but we could just as well use -I$(top_srcdir)/src/test/perl
> > and not have to think about it.
> 
> That doesn't fix the issue - RewindTest is in src/bin/pg_rewind for
> example. There's already an -I for /src/test/perl.

Doh, you're right.  And we have a .pm in src/test/ssl too, which I
assume you didn't catch only because the ssl test is not run by default.

I suppose -I$(srcdir) should be fine.  (Why the quotes?)

> > But we have other .pm's ... are there other things that would break once
> > the fix for that problem propagates?  I think the msvc stuff will break,
> > for one.
> 
> check-world appears to mostly run (still doing so, but it's mostly
> through everything relevant). I can't vouch for the windows stuff, and
> the invocations indeed look vulnerable. I'm not sure if hte fix actually
> matters on windows, given . is the default for pretty much everything
> there.

Well, maybe it doesn't matter now but as I understand the fix is going
to enter the next stable upstream perl, so it'll fail eventually.  It'd
be saner to just fix the thing completely so that we can forget about
it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Sep 8, 2016 at 1:14 PM, Andres Freund  wrote:
>> Is this issue really worth keeping several hackers busy?

> I don't think it's fair to put it to me specifically that I'm doing
> that. That said, I would like to see this resolved without further
> bikeshedding.

Well, my vote is that it ain't broke and we shouldn't fix it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Andres Freund
On 2016-09-08 17:58:03 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > ISTM that the easiest fix is to just tack  -I '$(srcdir)' into the prove
> > flags like:
> > PROVE = @PROVE@
> > PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I '$(srcdir)'
> > PROVE_FLAGS = --verbose
> > 
> > I don't think there's any security concerns for us here.
> 
> Maybe not, but we could just as well use -I$(top_srcdir)/src/test/perl
> and not have to think about it.

That doesn't fix the issue - RewindTest is in src/bin/pg_rewind for
example. There's already an -I for /src/test/perl.


> But we have other .pm's ... are there other things that would break once
> the fix for that problem propagates?  I think the msvc stuff will break,
> for one.

check-world appears to mostly run (still doing so, but it's mostly
through everything relevant). I can't vouch for the windows stuff, and
the invocations indeed look vulnerable. I'm not sure if hte fix actually
matters on windows, given . is the default for pretty much everything
there.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Alvaro Herrera
Andres Freund wrote:

> ISTM that the easiest fix is to just tack  -I '$(srcdir)' into the prove
> flags like:
> PROVE = @PROVE@
> PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I '$(srcdir)'
> PROVE_FLAGS = --verbose
> 
> I don't think there's any security concerns for us here.

Maybe not, but we could just as well use -I$(top_srcdir)/src/test/perl
and not have to think about it.

But we have other .pm's ... are there other things that would break once
the fix for that problem propagates?  I think the msvc stuff will break,
for one.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Andres Freund
Hi,

On Debian unstable I just got a failure when running the regression
tests:
andres@alap4:~/build/postgres/dev-assert/vpath/src/bin/pg_rewind$ make check
rm -rf '/home/andres/build/postgres/dev-assert/vpath'/tmp_install
/bin/mkdir -p '/home/andres/build/postgres/dev-assert/vpath'/tmp_install/log
make -C '../../..' 
DESTDIR='/home/andres/build/postgres/dev-assert/vpath'/tmp_install install 
>'/home/andres/build/postgres/dev-assert/vpath'/tmp_install/log/install.log 2>&1
rm -rf 
/home/andres/build/postgres/dev-assert/vpath/src/bin/pg_rewind/tmp_check/log
cd /home/andres/src/postgresql/src/bin/pg_rewind && 
TESTDIR='/home/andres/build/postgres/dev-assert/vpath/src/bin/pg_rewind' 
PATH="/home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert//install/bin:$PATH"
 
LD_LIBRARY_PATH="/home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert//install/lib"
 PGPORT='65432' 
PG_REGRESS='/home/andres/build/postgres/dev-assert/vpath/src/bin/pg_rewind/../../../src/test/regress/pg_regress'
 prove -I /home/andres/src/postgresql/src/test/perl/ --verbose t/*.pl
t/001_basic.pl  
1..8
Can't locate RewindTest.pm in @INC (you may need to install the RewindTest 
module) (@INC contains: /home/andres/src/postgresql/src/test/perl /etc/perl 
/usr/local/lib/x86_64-linux-gnu/perl/5.22.2 /usr/local/share/perl/5.22.2 
/usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 
/usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 
/usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at t/001_basic.pl 
line 6.
BEGIN failed--compilation aborted at t/001_basic.pl line 6.
# Looks like your test exited with 2 before it could output anything.
Dubious, test returned 2 (wstat 512, 0x200)

Debian's perl changelog says:
perl (5.22.2-3) unstable; urgency=high

  * [SECURITY] CVE-2016-1238: opportunistic loading of optional
modules can make many programs unintentionally load code
from the current working directory (which might be changed to
another directory without the user realising).
+ allow user configurable removal of "." from @INC in
  /etc/perl/sitecustomize.pl for a transitional period. (See: #588017)
+ backport patches from [perl #127834] to fix known vulnerabilities
  even if the user does not configure "." to be removed from @INC
+ backport patches from [perl #127810] to fix various classes of
  build failures in perl and CPAN modules if "." is removed from
  @INC

and sitecustomize notes:

# This script is only provided as a transition mechanism for
# removing the current working directory from the library search path
# while leaving a temporary way to override this locally.
#
# If you really need "." to be on @INC globally, you can comment
# this away for now. However, please note that this facility
# is expected to be removed after the Debian stretch release,
# at which point any code in this file will not have any effect.
#
# Please see CVE-2016-1238 for background information on the risks
# of having "." on @INC.

pop @INC if $INC[-1] eq '.' and !$ENV{PERL_USE_UNSAFE_INC};

ISTM that the easiest fix is to just tack  -I '$(srcdir)' into the prove
flags like:
PROVE = @PROVE@
PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I '$(srcdir)'
PROVE_FLAGS = --verbose

I don't think there's any security concerns for us here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Why does PageIndexTupleDelete insist tuple size be maxaligned?

2016-09-08 Thread Tom Lane
While perusing the proposed PageIndexTupleOverwrite patch, I noticed
that PageIndexTupleDelete throws an error if size != MAXALIGN(size),
where "size" is the ItemIdGetLength value.  This seems wrong now that
I look at it, because PageAddItem does not insist on tuple sizes being
maxaligned.  It does maxalign the amount of space actually allocated,
but what it stores into the ItemId length field is the pre-alignment
tuple size.

Also, PageRepairFragmentation and PageIndexMultiDelete are on board
with having to maxalign ItemIdGetLength values.  The latter makes
things particularly problematic here, because depending on how many
tuples are to be deleted, we might or might not enforce the restriction
about size being maxaligned.

So I think this is wrong, or at least trouble waiting to happen.
It probably works because all index AMs using these functions maxalign
their claimed tuple sizes before calling any bufpage.c functions,
but since we don't do that for heap tuples, it seems wrong to insist
that index AMs do so.

Barring objection, I plan to modify PageIndexTupleDelete to not expect
maxalignment of the stored size.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing deadlock on parallel backup

2016-09-08 Thread Lucas
I agree. It is an ugly hack.

But to me, the reduced window for failure is important. And that way an
failure will happen right away to be submitted to my operators as soon as
possible.

The queue jumping logic can not use the distributed transaction id?

On my logic, if a connection requests a shared lock that is already granted
to another connection in the same distributed transaction it should be
granted right away... make sense?

Em 08/09/2016 4:15 PM, "Tom Lane"  escreveu:

> Lucas  writes:
> > I made a small modification in pg_dump to prevent parallel backup
> failures
> > due to exclusive lock requests made by other tasks.
>
> > The modification I made take shared locks for each parallel backup worker
> > at the very beginning of the job. That way, any other job that attempts
> to
> > acquire exclusive locks will wait for the backup to finish.
>
> I do not think this would eliminate the problem; all it's doing is making
> the window for trouble a bit narrower.  Also, it implies taking out many
> locks that would never be used, since no worker process will be touching
> all of the tables.
>
> I think a real solution involves teaching the backend to allow a worker
> process to acquire a lock as long as its master already has the same lock.
> There's already queue-jumping logic of that sort in the lock manager, but
> it doesn't fire because we don't see that there's a potential deadlock.
> What needs to be worked out, mostly, is how we can do that without
> creating security hazards (since the backend would have to accept a
> command enabling this behavior from the client).  Maybe it's good enough
> to insist that leader and follower be same user ID, or maybe not.
>
> There's some related problems in parallel query, which AFAIK we just have
> an ugly kluge solution for ATM.  It'd be better if there were a clear
> model of when to allow a parallel worker to get a lock out-of-turn.
>
> regards, tom lane
>


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 1:14 PM, Andres Freund  wrote:
> Is this issue really worth keeping several hackers busy?

I don't think it's fair to put it to me specifically that I'm doing
that. That said, I would like to see this resolved without further
bikeshedding.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Andres Freund
Hi,

Is this issue really worth keeping several hackers busy?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 12:46 PM, Tom Lane  wrote:
>
>  /*
>   * The tuple at state->memtuples[0] has been removed from the heap.
> - * Decrement memtupcount, and sift up to maintain the heap invariant.
> + * Decrement memtupcount, and shift down to maintain the heap invariant.
>   */
>  static void
> -tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
> +tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)
>
> I don't find this clearer at all, because
>
> (1) As noted in the comment, the heap top has *already* been removed,
> we're just trying to re-establish the heap invariant.  So maybe
> "delete_top" isn't the best name after all.

This is why I suggested tuplesort_heap_compact(). When merging, there
is a comment associated with a call to the function, "compact the
heap". I based my name off of that, thinking that that was really no
different to any other caller.

> (2) "shift down" seems a bit weird, since we're moving data closer to
> the heap top, not further away from it; why isn't that direction "up"?

Well, the fact that the routine does that makes it a higher level
thing than something like a sift or a shift. It might just as easily
be some other tuple (e.g., from the caller), as far as "shifting down"
goes. (I wrote a patch where for merging, it *is* from the caller, and
the heap stays the same size, which Heikki mentioned on this thread.)

> (3) "shift" makes it sound like it ought to be a simple memmove
> kind of operation, which it surely is not.

Then, how about the Sedgewick terminology, "sink"? I'm really not
attached to that aspect. I do think that "shift down" is unambiguous,
but I'd just as soon use "sink", or even explicitly spell it out.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sequence data type

2016-09-08 Thread Peter Eisentraut
On 9/3/16 2:41 PM, Vik Fearing wrote:
> On 08/31/2016 06:22 AM, Peter Eisentraut wrote:
>> Here is a patch that adds the notion of a data type to a sequence.  So
>> it might be CREATE SEQUENCE foo AS integer.  The types are restricted to
>> int{2,4,8} as now.
> 
> This patch does not apply cleanly to current master (=600dc4c).

Updated patch attached.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 693bba6619d6e3283c6f7973f932044df2b4f695 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 8 Sep 2016 12:00:00 -0400
Subject: [PATCH v2] Add CREATE SEQUENCE AS  clause

This stores a data type, required to be an integer type, with the
sequence.  The sequences min and max values default to the range
supported by the type, and they cannot be set to values exceeding that
range.  The internal implementation of the sequence is not affected.

Change the serial types to create sequences of the appropriate type.
This makes sure that the min and max values of the sequence for a serial
column match the range of values supported by the table column.  So the
sequence can no longer overflow the table column.

This also makes monitoring for sequence exhaustion/wraparound easier,
which currently requires various contortions to cross-reference the
sequences with the table columns they are used with.
---
 doc/src/sgml/information_schema.sgml  |  4 +-
 doc/src/sgml/ref/create_sequence.sgml | 37 +++
 src/backend/catalog/information_schema.sql|  4 +-
 src/backend/commands/sequence.c   | 92 --
 src/backend/parser/gram.y |  6 +-
 src/backend/parser/parse_utilcmd.c|  2 +-
 src/bin/pg_dump/pg_dump.c | 94 +--
 src/bin/pg_dump/t/002_pg_dump.pl  |  2 +
 src/include/catalog/pg_proc.h |  2 +-
 src/include/commands/sequence.h   |  6 +-
 src/include/pg_config_manual.h|  6 --
 src/test/modules/test_pg_dump/t/001_base.pl   |  1 +
 src/test/regress/expected/sequence.out| 45 +
 src/test/regress/expected/sequence_1.out  | 45 +
 src/test/regress/expected/updatable_views.out |  3 +-
 src/test/regress/sql/sequence.sql | 20 +-
 16 files changed, 269 insertions(+), 100 deletions(-)

diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml
index c43e325..a3a19ce 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -4653,9 +4653,7 @@ sequences Columns
   data_type
   character_data
   
-   The data type of the sequence.  In
-   PostgreSQL, this is currently always
-   bigint.
+   The data type of the sequence.
   
  
 
diff --git a/doc/src/sgml/ref/create_sequence.sgml b/doc/src/sgml/ref/create_sequence.sgml
index 62ae379..f31b595 100644
--- a/doc/src/sgml/ref/create_sequence.sgml
+++ b/doc/src/sgml/ref/create_sequence.sgml
@@ -21,7 +21,9 @@
 
  
 
-CREATE [ TEMPORARY | TEMP ] SEQUENCE [ IF NOT EXISTS ] name [ INCREMENT [ BY ] increment ]
+CREATE [ TEMPORARY | TEMP ] SEQUENCE [ IF NOT EXISTS ] name
+[ AS data_type ]
+[ INCREMENT [ BY ] increment ]
 [ MINVALUE minvalue | NO MINVALUE ] [ MAXVALUE maxvalue | NO MAXVALUE ]
 [ START [ WITH ] start ] [ CACHE cache ] [ [ NO ] CYCLE ]
 [ OWNED BY { table_name.column_name | NONE } ]
@@ -111,6 +113,21 @@ Parameters

 

+data_type
+
+ 
+  The optional
+  clause AS data_type
+  specifies the data type of the sequence.  Valid types are
+  are smallint, integer,
+  and bigint.  bigint is the
+  default.  The data type determines the default minimum and maximum
+  values of the sequence.
+ 
+
+   
+
+   
 increment
 
  
@@ -132,9 +149,9 @@ Parameters
   class="parameter">minvalue determines
   the minimum value a sequence can generate. If this clause is not
   supplied or NO MINVALUE is specified, then
-  defaults will be used.  The defaults are 1 and
-  -263-1 for ascending and descending sequences,
-  respectively.
+  defaults will be used.  The default for an ascending sequence is 1.  The
+  default for a descending sequence is the minimum value of the data type
+  plus 1.
  
 

@@ -148,9 +165,9 @@ Parameters
   class="parameter">maxvalue determines
   the maximum value for the sequence. If this clause is not
   supplied or NO MAXVALUE is specified, then
-  default values will be used.  The defaults are
-  263-1 and -1 for ascending and descending
-  sequences, respectively.
+  default values will be used.  The default for an ascending sequence is
+  the maximum value of the data type.  The default for a descending
+  sequence is -1.
  
 

@@ -349,12 +366,6 @@ 

Re: [HACKERS] Add support for restrictive RLS policies

2016-09-08 Thread Stephen Frost
Alvaro,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> Stephen Frost wrote:
> > * Stephen Frost (sfr...@snowman.net) wrote:
> > > Based on Robert's suggestion and using Thom's verbiage, I've tested this
> > > out:
> > > 
> > > CREATE POLICY pol ON tab AS [PERMISSIVE|RESTRICTIVE] ...
> 
> Can't you keep those words as Sconst or something (DefElems?) until the
> execution phase, so that they don't need to be keywords at all?  I'm
> fairly sure we do that kind of thing elsewhere.  Besides, that let you
> throw errors such as "keyword 'foobarive' not recognized" instead of a
> generic "syntax error" if the user enters a bogus permissivity (?)
> keyword.

Seems like we could do that, though I'm not convinced that it really
gains us all that much.  These are only unreserved keywords, of course,
so they don't impact users the way reserved keywords (of any kind) can.
While there may be some places where we use a string to represent a set
of defined options, I don't believe that's typical- certainly something
like DISCARD has a set of explicit values, same for CASCADE vs.
RESTRICT, replica_identity, TableLikeOption, etc..

We do have a few 'not recognized' messages in the backend, though
they're usually 'option %s not recognized' (there aren't any which use
'keyword') and they're in places where we support a list of options to
be specified (which also means they require additional code to check for
conflicting/redundant options).  We could possibly rearrange the entire
CREATE POLICY comamnd to use a list of options instead, along the lines
of what we do for views:

CREATE POLICY p1 ON t1
WITH (command=select,combine_using=AND)
USING ...;

but that hardly seems better.

Perhaps I'm not understanding what you're getting at though- is there
something in the existing grammar, in particular, that you're comparing
this to?

> Is the permissive/restrictive dichotomy enough to support all
> interesting use cases?  What I think is the equivalent concept in PAM
> uses required/requisite/sufficient/optional as possibilities, which
> allows for finer grained control.  Even there that's just the historical
> interface, and the replacement syntax has more gadgets.

Restrictive vs. Permissive very simply maps to the logical AND and OR
operators.  Those are the only binary logical operators we have and it
seems unlikely that we're going to get any more anytime soon.

I don't believe the comparison to PAM is really fair, as PAM is trying
to support the flexibility we already have by allowing users to specify
an expression in the policy itself.

Perhaps we may wish to come up with a more complex approach for how to
combine policies, but in that case, I'd think we'd do something like:

CREATE POLICY p1 ON t1 COMBINING ((policy1 AND policy2) OR policy3);

though I've not yet come across a case that requires something more
complicated than what we can do already with policies and the
restrictive / permissive options (note that the case above can be
handled that way, in fact, by making policy1 and policy2 restrictive and
policy3 permissive).  Perhaps that's because that more complicated logic
is generally understood and expected to be part of the policy expression
itself instead.

Also, as mentioned at the start of this thread, the capability for
restrictive policies has existed since 9.5, this change is simply
exposing that to users without having to require using an extension.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Stopping logical replication protocol

2016-09-08 Thread Vladimir Gordiychuk
>It'd helpful if you summarize the changes made when posting revisions.

Can we use as summary about changes first message? If not, summary can be
something like that:

This parches fix scenarios interrupt logical replication from client side
and allow the client to end a logical decoding session between transactions
without just unilaterally closing its connection. The client send CopyDone
package and wait stop replication as fast as possible. But in two cases
walsender interrupt replication too slow. First scenario it's interrupt
during waiting new WAL, when replication in streaming state, walsender
ignores messages from Client until new WAL record will not generate. The
second scenario it is interrupt replication during decoding long
transaction that contain many changes.

>Yeah, testing it isn't trivial in postgres core, since of course none of
the tools know to send a client-initiated CopyDone.

>My suggestion is to teach pg_recvlogical to send CopyDone on the first
SIGINT (control-C) and wait until the server ends the data-stream and
returns to command mode. A second control-C should unilaterally close the
connection and it should close after a timeout too. This >will be backward
compatible with 9.4/5/6 (just with a delay on exit by sigint).

>Then in a TAP test in src/test/recovery set up a logical decoding session
with test_decoding and pg_recvlogical. Do a test xact then sigint
pg_recvlogical and verify that it exits. To make sure it exited by mutual
agreement with server, probably run pg_recvlogival at a higher debug level
and text search for a message you print from pg_recvlogical when it gets
server CopyDone in the response to client CopyDone. I don't think a
different pg_recvlogical numeric exit code could be justified for this.

>It sounds like more work than I think it would actually be.

Sounds good. Is it changes should be include in this PR? I'm not ensure
that I do this before complete this commitefest, I need some time to
understand how this tests works, and how can i write new one.


2016-09-08 12:29 GMT+03:00 Craig Ringer :

> On 7 September 2016 at 15:44, Vladimir Gordiychuk 
> wrote:
>
> > Fixed patch in attach.
>
> It'd helpful if you summarize the changes made when posting revisions.
>
> >> * There are no tests. I don't see any really simple way to test this,
> >> though.
> >
> > I will be grateful if you specify the best way how to do it. I tests this
> > patches by pgjdbc driver tests that also build on head of postgres. You
> can
> > check test scenario for both patches in
> > PRhttps://github.com/pgjdbc/pgjdbc/pull/550
> >
> > org.postgresql.replication.LogicalReplicationTest#
> testDuringSendBigTransactionReplicationStreamCloseNotActive
> > org.postgresql.replication.LogicalReplicationTest#
> testAfterCloseReplicationStreamDBSlotStatusNotActive
>
> Yeah, testing it isn't trivial in postgres core, since of course none of
> the tools know to send a client-initiated CopyDone.
>
> My suggestion is to teach pg_recvlogical to send CopyDone on the first
> SIGINT (control-C) and wait until the server ends the data-stream and
> returns to command mode. A second control-C should unilaterally close the
> connection and it should close after a timeout too. This will be backward
> compatible with 9.4/5/6 (just with a delay on exit by sigint).
>
> Then in a TAP test in src/test/recovery set up a logical decoding session
> with test_decoding and pg_recvlogical. Do a test xact then sigint
> pg_recvlogical and verify that it exits. To make sure it exited by mutual
> agreement with server, probably run pg_recvlogival at a higher debug level
> and text search for a message you print from pg_recvlogical when it gets
> server CopyDone in the response to client CopyDone. I don't think a
> different pg_recvlogical numeric exit code could be justified for this.
>
> It sounds like more work than I think it would actually be.
>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Tom Lane
Peter Geoghegan  writes:
> Attached patch does it that way, then. I stuck with the reference to
> "shift down", though, since I think we all agree that that is
> unambiguous.

I dunno.  What you've now got is

 /*
  * The tuple at state->memtuples[0] has been removed from the heap.
- * Decrement memtupcount, and sift up to maintain the heap invariant.
+ * Decrement memtupcount, and shift down to maintain the heap invariant.
  */
 static void
-tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
+tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)

I don't find this clearer at all, because

(1) As noted in the comment, the heap top has *already* been removed,
we're just trying to re-establish the heap invariant.  So maybe
"delete_top" isn't the best name after all.

(2) "shift down" seems a bit weird, since we're moving data closer to
the heap top, not further away from it; why isn't that direction "up"?

(3) "shift" makes it sound like it ought to be a simple memmove
kind of operation, which it surely is not.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-08 Thread Corey Huinker
On Tue, Sep 6, 2016 at 11:44 PM, Craig Ringer 
wrote:

> On 7 September 2016 at 11:37, Corey Huinker 
> wrote:
> > On Tue, Sep 6, 2016 at 11:24 PM, Craig Ringer <
> craig.rin...@2ndquadrant.com>
> > wrote:
> >>
> >> On 7 September 2016 at 11:21, Corey Huinker 
> >> wrote:
> >> > On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer
> >> > 
> >>
> >> > And the TAP test would detect the operating system and know to create
> an
> >> > FDW
> >> > that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
> >> > test_data.csv'
> >> > on windows, and 'type test_data.csv;1' on VMS?
> >>
> >> Right. Or just "perl emit_test_data.pl" that works for all of them,
> >> since TAP is perl so you can safely assume you have Perl.
> >
> >
> > Thanks. I was mentally locked in more basic OS commands. Am I right in
> > thinking perl is about the *only* OS command you can be sure is on every
> > architecture?
>
> Probably, there's a lot of crazy out there.
>
> TAP tests can be conditionally run based on architecture, but
> something like this is probably worth testing as widely as possible.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Stylistically, would a separate .pl file for the emitter be preferable to
something inline like

perl -e 'print "a\tb\tcc\t4\n"; print "b\tc\tdd\t5\n"'


?

I'm inclined to go inline to cut down on the number of moving parts, but I
can see where perl's readability is a barrier to some, and either way I
want to follow established patterns.


[*] For those who don't perl, the command prints:

a   b   cc  4
b   c   dd  5


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Peter Geoghegan
Sift means shift up. There is no such thing as sift down, though, only
shift down. That is my understanding, based on the Wikipedia article on
heaps.

--
Peter Geoghegan


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Claudio Freire
On Thu, Sep 8, 2016 at 4:20 PM, Peter Geoghegan  wrote:
> On Thu, Sep 8, 2016 at 10:40 AM, Tom Lane  wrote:
>>> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas  wrote:
 I still think tuplesort_heap_siftup is a confusing name, although I'm not
 sure that Peter's "compact" is much better. I suggest that we rename it to
 tuplesort_heap_delete_top(). In comments within the function, explain that
 the *loop* corresponds to the "siftup" in Knuth's book.
>>
>>> I'm also fine with that name.
>>
>> I can live with it too.
>
> Attached patch does it that way, then. I stuck with the reference to
> "shift down", though, since I think we all agree that that is
> unambiguous.

I believe the term is "sift" not "shift"


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add support for restrictive RLS policies

2016-09-08 Thread Alvaro Herrera
Stephen Frost wrote:
> Greetings!
> 
> * Stephen Frost (sfr...@snowman.net) wrote:
> > Based on Robert's suggestion and using Thom's verbiage, I've tested this
> > out:
> > 
> > CREATE POLICY pol ON tab AS [PERMISSIVE|RESTRICTIVE] ...

Can't you keep those words as Sconst or something (DefElems?) until the
execution phase, so that they don't need to be keywords at all?  I'm
fairly sure we do that kind of thing elsewhere.  Besides, that let you
throw errors such as "keyword 'foobarive' not recognized" instead of a
generic "syntax error" if the user enters a bogus permissivity (?)
keyword.

Is the permissive/restrictive dichotomy enough to support all
interesting use cases?  What I think is the equivalent concept in PAM
uses required/requisite/sufficient/optional as possibilities, which
allows for finer grained control.  Even there that's just the historical
interface, and the replacement syntax has more gadgets.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-09-08 Thread Alvaro Herrera
Tom Lane wrote:
> Andrew Borodin  writes:
> > Thank you for your corrections.
> > Here is the patch with suggestions taken into account, except 6th.
> 
> I'll pick this up, unless some other committer is already on it.
> 
> I think that we should buy back the addition of PageIndexTupleOverwrite
> to bufpage.c by getting rid of PageIndexDeleteNoCompact, as suggested
> earlier by Alvaro.  The latter seems a lot messier in concept, and it's
> more general than BRIN needs.

Yeah, BRIN only wants to remove one item nowadays; that function was
written when the BRIN code wanted to remove multiple items in one go.

> It also kinda looks like we could get rid of PAI_ALLOW_FAR_OFFSET,
> which is a messy kluge in itself.

That'd be nice.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 10:40 AM, Tom Lane  wrote:
>> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas  wrote:
>>> I still think tuplesort_heap_siftup is a confusing name, although I'm not
>>> sure that Peter's "compact" is much better. I suggest that we rename it to
>>> tuplesort_heap_delete_top(). In comments within the function, explain that
>>> the *loop* corresponds to the "siftup" in Knuth's book.
>
>> I'm also fine with that name.
>
> I can live with it too.

Attached patch does it that way, then. I stuck with the reference to
"shift down", though, since I think we all agree that that is
unambiguous.

-- 
Peter Geoghegan
From e9af6dbd2bb80b83efa50f748cbe8be0882d1f9c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Wed, 7 Sep 2016 13:51:38 -0700
Subject: [PATCH] Rename tuplesort_heap_siftup function

tuplesort_heap_siftup is renamed to tuplesort_heap_delete_top.  The new
name is chosen to describe the function at a slightly higher level,
since the underlying heap operation is just an implementation detail.
---
 src/backend/utils/sort/tuplesort.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index aa8e0e4..92e3db4 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -570,7 +570,7 @@ static void sort_bounded_heap(Tuplesortstate *state);
 static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 	  int tupleindex, bool checkIndex);
-static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
+static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
@@ -1617,9 +1617,9 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 			}
 			else
 			{
-/* discard top of heap, sift up, insert new tuple */
+/* discard top of heap, and insert new tuple */
 free_sort_tuple(state, >memtuples[0]);
-tuplesort_heap_siftup(state, false);
+tuplesort_heap_delete_top(state, false);
 tuplesort_heap_insert(state, tuple, 0, false);
 			}
 			break;
@@ -1987,7 +1987,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
  * more generally.
  */
 *stup = state->memtuples[0];
-tuplesort_heap_siftup(state, false);
+tuplesort_heap_delete_top(state, false);
 if ((tupIndex = state->mergenext[srcTape]) == 0)
 {
 	/*
@@ -2643,7 +2643,7 @@ mergeonerun(Tuplesortstate *state)
 		spaceFreed = state->availMem - priorAvail;
 		state->mergeavailmem[srcTape] += spaceFreed;
 		/* compact the heap */
-		tuplesort_heap_siftup(state, false);
+		tuplesort_heap_delete_top(state, false);
 		if ((tupIndex = state->mergenext[srcTape]) == 0)
 		{
 			/* out of preloaded data on this tape, try to read more */
@@ -3275,13 +3275,12 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 			 * Still holding out for a case favorable to replacement
 			 * selection. Still incrementally spilling using heap.
 			 *
-			 * Dump the heap's frontmost entry, and sift up to remove it from
-			 * the heap.
+			 * Dump the heap's top entry, and delete it from the heap.
 			 */
 			Assert(state->memtupcount > 0);
 			WRITETUP(state, state->tp_tapenum[state->destTape],
 	 >memtuples[0]);
-			tuplesort_heap_siftup(state, true);
+			tuplesort_heap_delete_top(state, true);
 		}
 		else
 		{
@@ -3663,7 +3662,7 @@ make_bounded_heap(Tuplesortstate *state)
 			if (state->memtupcount > state->bound)
 			{
 free_sort_tuple(state, >memtuples[0]);
-tuplesort_heap_siftup(state, false);
+tuplesort_heap_delete_top(state, false);
 			}
 		}
 	}
@@ -3693,8 +3692,8 @@ sort_bounded_heap(Tuplesortstate *state)
 	{
 		SortTuple	stup = state->memtuples[0];
 
-		/* this sifts-up the next-largest entry and decreases memtupcount */
-		tuplesort_heap_siftup(state, false);
+		/* this puts next-largest entry at top, and decreases memtupcount */
+		tuplesort_heap_delete_top(state, false);
 		state->memtuples[state->memtupcount] = stup;
 	}
 	state->memtupcount = tupcount;
@@ -3784,10 +3783,10 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 
 /*
  * The tuple at state->memtuples[0] has been removed from the heap.
- * Decrement memtupcount, and sift up to maintain the heap invariant.
+ * Decrement memtupcount, and shift down to maintain the heap invariant.
  */
 static void
-tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
+tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)
 {
 	SortTuple  *memtuples = state->memtuples;
 	SortTuple  *tuple;
@@ -3801,7 +3800,7 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
 	

Re: [HACKERS] Preventing deadlock on parallel backup

2016-09-08 Thread Tom Lane
Lucas  writes:
> I made a small modification in pg_dump to prevent parallel backup failures
> due to exclusive lock requests made by other tasks.

> The modification I made take shared locks for each parallel backup worker
> at the very beginning of the job. That way, any other job that attempts to
> acquire exclusive locks will wait for the backup to finish.

I do not think this would eliminate the problem; all it's doing is making
the window for trouble a bit narrower.  Also, it implies taking out many
locks that would never be used, since no worker process will be touching
all of the tables.

I think a real solution involves teaching the backend to allow a worker
process to acquire a lock as long as its master already has the same lock.
There's already queue-jumping logic of that sort in the lock manager, but
it doesn't fire because we don't see that there's a potential deadlock.
What needs to be worked out, mostly, is how we can do that without
creating security hazards (since the backend would have to accept a
command enabling this behavior from the client).  Maybe it's good enough
to insist that leader and follower be same user ID, or maybe not.

There's some related problems in parallel query, which AFAIK we just have
an ugly kluge solution for ATM.  It'd be better if there were a clear
model of when to allow a parallel worker to get a lock out-of-turn.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] COPY command with RLS bug

2016-09-08 Thread Adam Brightwell
All,

I have discovered a bug with the COPY command, specifically related to RLS.

The issue:

When running COPY as superuser on a table that has RLS enabled, RLS is
bypassed and therefore no issue exists.  However, when performing a
COPY as a non-privileged user on the same table causes issues when
more than one column is specified as part of the command.

Assuming:

CREATE TABLE foo (a int, b int, c int);

... set up RLS

Connecting as a non-privileged user provides the following results:

COPY foo TO stdout; -- pass
COPY foo (a) TO stdout; -- pass
COPY foo (a, b, c) TO stdout; -- fail

The error related to the failure is the following:

ERROR:  missing FROM-clause entry for table "b"
LINE 1: copy foo (a, b, c) to stdout;

I don't believe this to be a vulnerability simply because it doesn't
'leak' any data to a non-privileged user, it will just throw an error.
As well, this is only an issue when more than one column is specified
and '*' isn't implied. I have attached a 'test' file that can be used
to observe this behavior.

I have verified that this is an issue on REL9_5_STABLE, REL9_6_STABLE
and master.

Solution:

The issue seems to be that the target list for the resulting SELECT
statement is not being built correctly. I have attached a proposed
patch to fix this issue.  As well, I have added a few regression tests
for this case.

If deemed appropriate, then I'll add this to the currently open Commitfest.

Please let me know if there are any questions.

Thanks,
Adam


test-copy-rls.sql
Description: application/sql
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 432b0ca..a3777d9 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -871,6 +871,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt, uint64 *processed)
 			ColumnRef  *cr;
 			ResTarget  *target;
 			RangeVar   *from;
+			List	   *targetList = NIL;
 
 			if (is_from)
 ereport(ERROR,
@@ -878,21 +879,62 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt, uint64 *processed)
    errmsg("COPY FROM not supported with row-level security"),
 		 errhint("Use INSERT statements instead.")));
 
-			/* Build target list */
-			cr = makeNode(ColumnRef);
-
+			/*
+			 * Build target list
+			 *
+			 * If no columns are specified in the attribute list of the COPY
+			 * command, then the target list is 'all' columns. Therefore, '*'
+			 * should be used as the target list for the resulting SELECT
+			 * statement.
+			 *
+			 * In the case that columns are specified in the attribute list,
+			 * create a ColumnRef and ResTarget for each column and add them to
+			 * the target list for the resulting SELECT statement.
+			 */
 			if (!stmt->attlist)
+			{
+cr = makeNode(ColumnRef);
 cr->fields = list_make1(makeNode(A_Star));
+cr->location = 1;
+
+target = makeNode(ResTarget);
+target->name = NULL;
+target->indirection = NIL;
+target->val = (Node *) cr;
+target->location = 1;
+
+targetList = list_make1(target);
+			}
 			else
-cr->fields = stmt->attlist;
+			{
+ListCell   *lc;
+int			location = 1;
+
+foreach(lc, stmt->attlist)
+{
+	/*
+	 * Build the ColumnRef for each column.  The ColumnRef
+	 * 'fields' property is a String 'Value' node (see
+	 * nodes/value.h) that correspond to the column name
+	 * respectively.
+	 */
+	cr = makeNode(ColumnRef);
+	cr->fields = list_make1(lfirst(lc));
+	cr->location = location;
+
+	/* Build the ResTarget and add the ColumnRef to it. */
+	target = makeNode(ResTarget);
+	target->name = NULL;
+	target->indirection = NIL;
+	target->val = (Node *) cr;
+	target->location = location;
 
-			cr->location = 1;
+	/* Add each column to the SELECT statements target list */
+	targetList = lappend(targetList, target);
 
-			target = makeNode(ResTarget);
-			target->name = NULL;
-			target->indirection = NIL;
-			target->val = (Node *) cr;
-			target->location = 1;
+	location += 1;
+}
+			}
 
 			/*
 			 * Build RangeVar for from clause, fully qualified based on the
@@ -903,7 +945,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt, uint64 *processed)
 
 			/* Build query */
 			select = makeNode(SelectStmt);
-			select->targetList = list_make1(target);
+			select->targetList = targetList;
 			select->fromClause = list_make1(from);
 
 			query = (Node *) select;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 5f6260a..13251c6 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -460,9 +460,57 @@ select * from check_con_tbl;

 (2 rows)
 
+-- test with RLS enabled.
+CREATE USER regress_rls_copy_user;
+CREATE TABLE rls_t1 (a int, b int, c int);
+COPY rls_t1 (a, b, c) from stdin;
+CREATE POLICY p1 ON rls_t1 FOR SELECT USING (a % 2 = 0);
+ALTER TABLE rls_t1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE rls_t1 FORCE ROW LEVEL SECURITY;
+GRANT SELECT ON TABLE rls_t1 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Jeff Janes
On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma 
wrote:

> > Thanks to Ashutosh Sharma for doing the testing of the patch and
> > helping me in analyzing some of the above issues.
>
> Hi All,
>
> I would like to summarize the test-cases that i have executed for
> validating WAL logging in hash index feature.
>
> 1) I have mainly ran the pgbench test with read-write workload at the
> scale factor of 1000 and various client counts like 16, 64 and 128 for
> time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
> on highly configured power2 machine with 128 cores and 512GB of RAM. I
> ran the test-case both with and without the replication setup.
>
> Please note that i have changed the schema of pgbench tables created
> during initialisation phase.
>
> The new schema of pgbench tables looks as shown below on both master
> and standby:
>
> postgres=# \d pgbench_accounts
>Table "public.pgbench_accounts"
>   Column  | Type  | Modifiers
> --+---+---
>  aid  | integer   | not null
>  bid  | integer   |
>  abalance | integer   |
>  filler   | character(84) |
> Indexes:
> "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
> "pgbench_accounts_bid" hash (bid)
>
> postgres=# \d pgbench_history
>   Table "public.pgbench_history"
>  Column |Type | Modifiers
> +-+---
>  tid| integer |
>  bid| integer |
>  aid| integer |
>  delta  | integer |
>  mtime  | timestamp without time zone |
>  filler | character(22)   |
> Indexes:
> "pgbench_history_bid" hash (bid)
>
>
Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will never
use hash indexes for searching, so it limits the amount of test coverage
you will get.  While searching shouldn't generate novel types of WAL
records (that I know of), it will generate locking and timing issues that
might uncover bugs (if there are any left to uncover, of course).

I would drop the primary key on pgbench_accounts and replace it with a hash
index and test it that way (except I don't have a 128 core machine at my
disposal, so really I am suggesting that you do this...)

The lack of primary key and the non-uniqueness of the hash index should not
be an operational problem, because the built in pgbench runs never attempt
to violate the constraints anyway.

In fact, I'd replace all of the indexes on the rest of the pgbench tables
with hash indexes, too, just for additional testing.

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals.  I think that none of pgbench's built in
tests are likely to give the bucket splitting and squeezing code very much
exercise.

Is there a way to gather statistics on how many of each type of WAL record
are actually getting sent over the replication link?  The only way I can
think of is to turn on wal archving as well as replication, then using
pg_xlogdump to gather the stats.

I've run my original test for a while now and have not seen any problems.
But I realized I forgot to compile with enable-casserts, to I will have to
redo it to make sure the assertion failures have been fixed.  In my
original testing I did very rarely get a deadlock (or some kind of hang),
and I haven't seen that again so far.  It was probably the same source as
the one Mark observed, and so the same fix.

Cheers,

Jeff


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-08 Thread Heikki Linnakangas

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan  wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.


I spent some more time polishing this up, and also added some code to 
logtape.c, to use larger read buffers, to compensate for the fact that 
we don't do pre-reading from tuplesort.c anymore. That should trigger 
the OS read-ahead, and make the I/O more sequential, like was the 
purpose of the old pre-reading code. But simpler. I haven't tested that 
part much yet, but I plan to run some tests on larger data sets that 
don't fit in RAM, to make the I/O effects visible.


I wrote a little testing toolkit, see third patch. I'm not proposing to 
commit that, but that's what I used for testing. It creates four tables, 
about 1GB in size each (it also creates smaller and larger tables, but I 
used the "medium" sized ones for these tests). Two of the tables contain 
integers, and two contain text strings. Two of the tables are completely 
ordered, two are in random order. To measure, it runs ORDER BY queries 
on the tables, with different work_mem settings.


Attached are the full results. In summary, these patches improve 
performance in some of the tests, and are a wash on others. The patches 
help in particular in the randomly ordered cases, with up to 512 MB of 
work_mem.


For example, with work_mem=256MB, which is enough to get a single merge 
pass:


with patches:

ordered_ints: 7078 ms, 6910 ms, 6849 ms
random_ints: 15639 ms, 15575 ms, 15625 ms
ordered_text: 11121 ms, 12318 ms, 10824 ms
random_text: 53462 ms, 53420 ms, 52949 ms

unpatched master:

ordered_ints: 6961 ms, 7040 ms, 7044 ms
random_ints: 19205 ms, 18797 ms, 18955 ms
ordered_text: 11045 ms, 11377 ms, 11203 ms
random_text: 57117 ms, 54489 ms, 54806 ms

(The same queries were run three times in a row, that's what the three 
numbers on each row mean. I.e. the differences between the numbers on 
same row are noise)



It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.


I'm not 100% sure I'm accounting for all the memory correctly. But I 
didn't touch the way the initial quicksort works, nor the way the runs 
are built. And the merge passes don't actually need or benefit from a 
lot of memory, so I doubt it's very sensitive to that.


In this patch, the memory available for the read buffers is just divided 
evenly across maxTapes. The buffers for the tapes that are not currently 
active are wasted. It could be made smarter, by freeing all the 
currently-unused buffers for tapes that are not active at the moment. 
Might do that later, but this is what I'm going to benchmark for now. I 
don't think adding buffers is helpful beyond a certain point, so this is 
probably good enough in practice. Although it would be nice to free the 
memory we don't need earlier, in case there are other processes that 
could make use of it.



FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).


Yeah, unless you get rid of the pre-reading in tuplesort.c, you're just 
double-buffering.


- Heikki

>From d4d89c88c5e26be70c976a756e874af65ad6ec55 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 8 Sep 2016 14:31:31 +0300
Subject: [PATCH 1/3] Don't bother to pre-read tuples into SortTuple slots
 during merge.

That only seems to add overhead. We're doing the same 

Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-09-08 Thread Tom Lane
Andrew Borodin  writes:
> Thank you for your corrections.
> Here is the patch with suggestions taken into account, except 6th.

I'll pick this up, unless some other committer is already on it.

I think that we should buy back the addition of PageIndexTupleOverwrite
to bufpage.c by getting rid of PageIndexDeleteNoCompact, as suggested
earlier by Alvaro.  The latter seems a lot messier in concept, and it's
more general than BRIN needs.  It also kinda looks like we could get
rid of PAI_ALLOW_FAR_OFFSET, which is a messy kluge in itself.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index Onlys Scan for expressions

2016-09-08 Thread Vladimir Sitnikov
Ildar> Could you please try the patch and tell if it works for you?

I've tested patch6 against recent head. The patch applies with no problems.

The previous case (filter on top of i-o-s) is fixed. Great work.

Here are the test cases and results:
https://gist.github.com/vlsi/008e18e18b609fcaaec53d9cc210b7e2

However, it looks there are issues when accessing non-indexed columns.
The error is "ERROR: variable not found in subplan target list"
The case is 02_case2_fails.sql (see the gist link above)

The essence of the case is "create index on substr(vc, 1, 128)"
and assume that majority of the rows have length(vc)<128.
Under that conditions, it would be nice to do index-only-scan
and filter (like in my previous case), but detect "long" rows
and do additional recheck for them.

Vladimir


[HACKERS] Preventing deadlock on parallel backup

2016-09-08 Thread Lucas
People,

I made a small modification in pg_dump to prevent parallel backup failures
due to exclusive lock requests made by other tasks.

The modification I made take shared locks for each parallel backup worker
at the very beginning of the job. That way, any other job that attempts to
acquire exclusive locks will wait for the backup to finish.

In my case, each server was taking a day to complete the backup, now with
parallel backup one is taking 3 hours and the others less than a hour.

The code below is not very elegant, but it works for me. My whishlist for
the backup is:

1) replace plpgsql by c code reading the backup toc and assembling the lock
commands.
2) create an timeout to the locks.
3) broadcast the end of copy to every worker in order to release the locks
as early as possible;
4) create a monitor thread that prioritize an copy job based on a exclusive
lock acquired;
5) grant the lock for other connection of the same distributed transaction
if it is held by any connection of the same distributed transaction. There
is some sideefect I can't see on that?

1 to 4 are within my capabilities and I may do it in the future. 4 is to
advanced for me and I do not dare to mess with something so fundamental
rights now.

Anyone else is working on that?

On, Parallel.c, void RunWorker(...), add:

PQExpBuffer query;
PGresult   *res;

query = createPQExpBuffer();
resetPQExpBuffer(query);
appendPQExpBuffer(query,
"do language 'plpgsql' $$"
" declare "
"x record;"
" begin"
"for x in select * from pg_tables where schemaname not in
('pg_catalog','information_schema') loop"
"raise info 'lock table %.%', x.schemaname, x.tablename;"
"execute 'LOCK TABLE
'||quote_ident(x.schemaname)||'.'||quote_ident(x.tablename)||' IN ACCESS
SHARE MODE NOWAIT';"
"end loop;"
"end"
"$$" );

res = PQexec(AH->connection, query->data);

if (!res || PQresultStatus(res) != PGRES_COMMAND_OK)
exit_horribly(modulename,"Could not lock the tables to begin the
work\n\n");
PQclear(res);
destroyPQExpBuffer(query);


Re: [HACKERS] Add support for restrictive RLS policies

2016-09-08 Thread Stephen Frost
Greetings!

* Stephen Frost (sfr...@snowman.net) wrote:
> Based on Robert's suggestion and using Thom's verbiage, I've tested this
> out:
> 
> CREATE POLICY pol ON tab AS [PERMISSIVE|RESTRICTIVE] ...
> 
> and it appears to work fine with the grammar, etc.
> 
> Unless there's other thoughts on this, I'll update the patch to reflect
> this grammar in a couple days.

Updated patch attached which uses the above approach, includes
some initial documentation, and has fixes for the tab completion, 

I'm planning to add more documentation.  Otherwise, testing and code
review would certainly be appreciated.

Thanks!

Stpehen
From 6fad316de3cc50f4cc2c3bbe3c6fac91afc881e6 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 1 Sep 2016 02:11:30 -0400
Subject: [PATCH] Add support for restrictive RLS policies

We have had support for restrictive RLS policies since 9.5, but they
were only available through extensions which use the appropriate hooks.
This adds support into the grammer, catalog, psql and pg_dump for
restrictive RLS policies, thus reducing the cases where an extension is
necessary.
---
 doc/src/sgml/ref/create_policy.sgml   |  16 +++
 src/backend/commands/policy.c |   9 ++
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/parser/gram.y |  32 +++--
 src/backend/rewrite/rowsecurity.c |   7 +-
 src/bin/pg_dump/pg_dump.c |  38 --
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |  39 +-
 src/bin/psql/describe.c   | 109 
 src/bin/psql/tab-complete.c   |  29 -
 src/include/catalog/pg_policy.h   |  16 ++-
 src/include/nodes/parsenodes.h|   1 +
 src/include/parser/kwlist.h   |   2 +
 src/include/rewrite/rowsecurity.h |   1 +
 src/test/regress/expected/rowsecurity.out | 209 ++
 src/test/regress/sql/rowsecurity.sql  |  24 +++-
 17 files changed, 417 insertions(+), 118 deletions(-)

diff --git a/doc/src/sgml/ref/create_policy.sgml b/doc/src/sgml/ref/create_policy.sgml
index 89d2787..d930052 100644
--- a/doc/src/sgml/ref/create_policy.sgml
+++ b/doc/src/sgml/ref/create_policy.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
  
 
 CREATE POLICY name ON table_name
+[ AS ( PERMISSIVE | RESTRICTIVE ) ]
 [ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ]
 [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ]
 [ USING ( using_expression ) ]
@@ -120,6 +121,21 @@ CREATE POLICY name ON 
 

+permissive
+
+ 
+  If the policy is a "permissive" or "restrictive" policy.  Permissive
+  policies are the default and always add visibillity- they ar "OR"d
+  together to allow the user access to all rows through any of the
+  permissive policies they have access to.  Alternativly, a policy can
+  instead by "restrictive", meaning that the policy will be "AND"d
+  with other restrictive policies and with the result of all of the
+  permissive policies on the table.
+ 
+
+   
+
+   
 command
 
  
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index d694cf8..70e22c1 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -235,6 +235,7 @@ RelationBuildRowSecurity(Relation relation)
 		{
 			Datum		value_datum;
 			char		cmd_value;
+			bool		permissive_value;
 			Datum		roles_datum;
 			char	   *qual_value;
 			Expr	   *qual_expr;
@@ -257,6 +258,12 @@ RelationBuildRowSecurity(Relation relation)
 			Assert(!isnull);
 			cmd_value = DatumGetChar(value_datum);
 
+			/* Get policy permissive or restrictive */
+			value_datum = heap_getattr(tuple, Anum_pg_policy_polpermissive,
+	   RelationGetDescr(catalog), );
+			Assert(!isnull);
+			permissive_value = DatumGetBool(value_datum);
+
 			/* Get policy name */
 			value_datum = heap_getattr(tuple, Anum_pg_policy_polname,
 	   RelationGetDescr(catalog), );
@@ -298,6 +305,7 @@ RelationBuildRowSecurity(Relation relation)
 			policy = palloc0(sizeof(RowSecurityPolicy));
 			policy->policy_name = pstrdup(policy_name_value);
 			policy->polcmd = cmd_value;
+			policy->permissive = permissive_value;
 			policy->roles = DatumGetArrayTypePCopy(roles_datum);
 			policy->qual = copyObject(qual_expr);
 			policy->with_check_qual = copyObject(with_check_qual);
@@ -796,6 +804,7 @@ CreatePolicy(CreatePolicyStmt *stmt)
 	values[Anum_pg_policy_polname - 1] = DirectFunctionCall1(namein,
 		 CStringGetDatum(stmt->policy_name));
 	values[Anum_pg_policy_polcmd - 1] = CharGetDatum(polcmd);
+	values[Anum_pg_policy_polpermissive - 1] = BoolGetDatum(stmt->permissive);
 	values[Anum_pg_policy_polroles - 1] = PointerGetDatum(role_ids);
 
 	/* Add qual if present. */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Masahiko Sawada
On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee
 wrote:
>
>
> On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire 
> wrote:
>>
>> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark  wrote:
>> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs 
>> > wrote:
>> >> On 6 September 2016 at 19:59, Tom Lane  wrote:
>> >>
>> >>> The idea of looking to the stats to *guess* about how many tuples are
>> >>> removable doesn't seem bad at all.  But imagining that that's going to
>> >>> be
>> >>> exact is folly of the first magnitude.
>> >>
>> >> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> >> so I think we agree that a reasonably accurate, yet imprecise
>> >> calculation is possible in most cases.
>> >
>> > That would all be well and good if it weren't trivial to do what
>> > Robert suggested. This is just a large unsorted list that we need to
>> > iterate throught. Just allocate chunks of a few megabytes and when
>> > it's full allocate a new chunk and keep going. There's no need to get
>> > tricky with estimates and resizing and whatever.
>>
>> I agree. While the idea of estimating the right size sounds promising
>> a priori, considering the estimate can go wrong and over or
>> underallocate quite severely, the risks outweigh the benefits when you
>> consider the alternative of a dynamic allocation strategy.
>>
>> Unless the dynamic strategy has a bigger CPU impact than expected, I
>> believe it's a superior approach.
>>
>
> How about a completely different representation for the TID array? Now this
> is probably not something new, but I couldn't find if the exact same idea
> was discussed before. I also think it's somewhat orthogonal to what we are
> trying to do here, and will probably be a bigger change. But I thought I'll
> mention since we are at the topic.
>
> What I propose is to use a simple bitmap to represent the tuples. If a tuple
> at  is dead then the corresponding bit in the bitmap is set.
> So clearly the searches through dead tuples are O(1) operations, important
> for very large tables and large arrays.
>
> Challenge really is that a heap page can theoretically have MaxOffsetNumber
> of line pointers (or to be precise maximum possible offset number). For a 8K
> block, that comes be about 2048. Having so many bits per page is neither
> practical nor optimal. But in practice the largest offset on a heap page
> should not be significantly greater than MaxHeapTuplesPerPage, which is a
> more reasonable value of 291 on my machine. Again, that's with zero sized
> tuple and for real life large tables, with much wider tuples, the number may
> go down even further.
>
> So we cap the offsets represented in the bitmap to some realistic value,
> computed by looking at page density and then multiplying it by a small
> factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line
> pointers. That should practically represent majority of the dead tuples in
> the table, but we then keep an overflow area to record tuples beyond the
> limit set for per page. The search routine will do a direct lookup for
> offsets less than the limit and search in the sorted overflow area for
> offsets beyond the limit.
>
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.
>
> I guess we can make a choice between two representations at the start
> looking at various table stats. We can also be smart and change from bitmap
> to traditional representation as we scan the table and see many more tuples
> in the overflow region than we provisioned for. There will be some
> challenges in converting representation mid-way, especially in terms of
> memory allocation, but I think those can be sorted out if we think that the
> idea has merit.
>

Making the vacuum possible to choose between two data representations
sounds good.
I implemented the patch that changes dead tuple representation to bitmap before.
I will measure the performance of bitmap representation again and post them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make 

Re: [HACKERS] Default make target in test modules

2016-09-08 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I happened to notice that if you type "make" in
> src/test/modules/test_pg_dump, you will get a "make check" action
> not "make all".  I hope this is just somebody being thoughtless
> about Makefile ordering and not an intentional override of the
> default make target.  If the latter, I beg to differ about it
> being a good idea.

No, it wasn't intentional.

> > Done.
> 
> Thanks.

Thanks to you both!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Cache Hash Index meta page.

2016-09-08 Thread Jesper Pedersen

On 09/05/2016 02:50 PM, Mithun Cy wrote:

On Sep 2, 2016 7:38 PM, "Jesper Pedersen" 
wrote:

Could you provide a rebased patch based on Amit's v5 ?


Please find the the patch, based on Amit's V5.

I have fixed following things

1. now in "_hash_first" we check if (opaque->hasho_prevblkno ==
InvalidBlockNumber) to see if bucket is from older version hashindex and
has been upgraded. Since as of now InvalidBlockNumber is one value greater
than maximum value the variable "metap->hashm_maxbucket" can be set (see
_hash_expandtable). We can distinguish it from rest. I tested the upgrade
issue reported by amit. It is fixed now.

2. One case which buckets hasho_prevblkno is used is where we do backward
scan. So now before testing for previous block number I test whether
current page is bucket page if so we end the bucket scan (see changes in
_hash_readprev). On other places where hasho_prevblkno is used it is not
for bucket page, so I have not put any extra check to verify if is a bucket
page.



I think that the

+ pageopaque->hasho_prevblkno = metap->hashm_maxbucket;

trick should be documented in the README, as hashm_maxbucket is defined 
as uint32 where as hasho_prevblkno is a BlockNumber.


(All bucket variables should probably use the Bucket definition instead 
of uint32).


For the archives, this patch conflicts with the WAL patch [1].

[1] 
https://www.postgresql.org/message-id/CAA4eK1JS%2BSiRSQBzEFpnsSmxZKingrRH7WNyWULJeEJSj1-%3D0w%40mail.gmail.com


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default make target in test modules

2016-09-08 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Alvaro Herrera  writes:
>>> I suppose this is a very easy mistake to make -- but also fortunately an
>>> easy one to correct.  Do you want me to fix the affected modules?

>> I was going to do it, but if you want to, feel free.

> Done.

Thanks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default make target in test modules

2016-09-08 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera  writes:

> > I suppose this is a very easy mistake to make -- but also fortunately an
> > easy one to correct.  Do you want me to fix the affected modules?
> 
> I was going to do it, but if you want to, feel free.

Done.

> (BTW, I notice that test_pg_dump's Makefile misquotes its own path name,
> probably copy-paste sloppiness.)

Uh, I hadn't noticed.  Fixed that one too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] feature request: explain "with details" option

2016-09-08 Thread Roger Pack
My apologies if this was already requested before...

I think it would be fantastic if postgres had an "explain the explain" option:
Today's explain tells us what loops and scans were used, and relative
costs, etc.  It doesn't seem to tell *why* the planner elected to use
what it did.

For instance, in the case of a corrupted index, it doesn't say why
it's not using that index, it just doesn't use it, causing some
confusion to end users.  At least causing confusion to me.

Or in the case of where it iterates over an entire table (seq. scan)
instead of using an index because the index range specified "is most
of the table" (thus not helpful to use the index)...The choice is
appropriate.  The reasoning why is not explicitly mentioned.  Again
causing possibility for some delay as you try to "decipher the mind"
of the planner.  Sometimes tables (ex: tables after having been first
propagated) need an "analyze" run on them, but it's not clear from an
"explain" output that the analyze statistics are faulty.  Not even a
hint.

So this is a feature request for an "EXPLAIN DETAILS" option or
something, basically like today's explain but with more "rationale"
included.  This could be immensely useful to many Postgres users.

I'd even be willing to chip in a couple hundred bucks if it would help
grease the wheels for somebody taking up the challenge if that helps
at all :)

Thank you for your consideration in this regard.
-roger-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas  wrote:
>> I still think tuplesort_heap_siftup is a confusing name, although I'm not
>> sure that Peter's "compact" is much better. I suggest that we rename it to
>> tuplesort_heap_delete_top(). In comments within the function, explain that
>> the *loop* corresponds to the "siftup" in Knuth's book.

> I'm also fine with that name.

I can live with it too.

>> Interestingly, as Knuth explains the siftup algorithm, it performs a
>> "replace the top" operation, rather than "remove the top". But we use it to
>> remove the top, by moving the last element to the top and running the
>> algorithm after that. Peter's other patch, to change the way we currently
>> replace the top node, to do it in one pass rather than as delete+insert, is
>> actually Knuth's siftup algorithm.

> Knuth must have a strange sense of humor.

I'm too lazy to get the book back down off the shelf to check, but I think
that Knuth uses sift-up to describe two different algorithms that have
essentially the same inner loop.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical replication WIP: FailedAssertion("!(list_length(stmt->tables) > 0)", File: "publicationcmds.c", Line: 350)

2016-09-08 Thread Erik Rijkers

Hi,

I found these steps to a reliable crash (I know the patch is WIP but I 
assume you want to hear about these).


(Running a single instance suffices)


psql (10devel_logical_replication_20160901_1237_6f7c0ea32f80)
Type "help" for help.

(PID 9809) [testdb]  # CREATE PUBLICATION pub_all;
CREATE PUBLICATION

(PID 9809) [testdb]  # ALTER PUBLICATION pub_all FOR ALL TABLES;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 40851.246 ms
(PID ) [] ? >

Logfile:
[...]
2016-09-08 19:24:43.157 CEST 9765 LOG:  MultiXact member wraparound 
protections are now enabled
2016-09-08 19:24:43.160 CEST 9770 LOG:  logical replication launcher 
started
2016-09-08 19:24:43.160 CEST 9764 LOG:  database system is ready to 
accept connections
TRAP: FailedAssertion("!(list_length(stmt->tables) > 0)", File: 
"publicationcmds.c", Line: 350)
2016-09-08 19:25:45.673 CEST 9764 LOG:  server process (PID 9809) was 
terminated by signal 6: Aborted
2016-09-08 19:25:45.673 CEST 9764 DETAIL:  Failed process was running: 
ALTER PUBLICATION pub_all FOR ALL TABLES;
2016-09-08 19:25:45.674 CEST 9764 LOG:  terminating any other active 
server processes
2016-09-08 19:25:45.676 CEST 9764 LOG:  all server processes terminated; 
reinitializing
2016-09-08 19:25:47.463 CEST 9819 LOG:  database system was interrupted; 
last known up at 2016-09-08 19:24:43 CEST
2016-09-08 19:25:47.465 CEST 9820 FATAL:  the database system is in 
recovery mode
2016-09-08 19:25:47.493 CEST 9819 LOG:  database system was not properly 
shut down; automatic recovery in progress

2016-09-08 19:25:47.494 CEST 9819 LOG:  redo starts at 0/A6D8C070
2016-09-08 19:25:47.494 CEST 9819 LOG:  invalid record length at 
0/A6D8D498: wanted 24, got 0

2016-09-08 19:25:47.494 CEST 9819 LOG:  redo done at 0/A6D8D460
2016-09-08 19:25:47.494 CEST 9819 LOG:  last completed transaction was 
at log time 2016-09-08 19:25:00.774988+02
2016-09-08 19:25:47.511 CEST 9819 LOG:  MultiXact member wraparound 
protections are now enabled
2016-09-08 19:25:47.514 CEST 9826 LOG:  logical replication launcher 
started
2016-09-08 19:25:47.515 CEST 9764 LOG:  database system is ready to 
accept connections



Thanks,

Erik Rijkers


(BTW, the issue I reported a few days ago was indeed avoided when I 
created a receiving table subscriber-side, thanks)




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] DISABLE_PAGE_SKIPPING option for vacuumdb

2016-09-08 Thread Masahiko Sawada
Hi,

Attached patch add --disable-page-skipping option to vacuumed command for 9.6.
If by any chance the freeze map causes the serious data corruption bug
then the user will want to run pg_check_visible() and vacuum with
DISABLE_PAGE_SKIPPING option to the multiple tables having problem.
In this case, I think that this option for vacuumdb would be convenient.

Please give me feedback.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


disable_page_skipping_option_for_vacuumdb_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 10:18 AM, Claudio Freire  wrote:
> That particular case I believe is using work_mem=4MB, easy strings, 1.5GB 
> table.

Cool. I wonder where this leaves Heikki's draft patch, that completely
removes batch memory, etc.



-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-08 Thread Claudio Freire
On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan  wrote:
> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire  wrote:
>> setup:
>>
>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>> insert into lotsofitext select cast(random() * 10.0 as text)
>> || 'blablablawblabla', cast(random() * 10.0 as text) ||
>> 'blablablawjjjblabla', cast(random() * 10.0 as text) ||
>> 'blablabl
>> awwwabla', random() * 10.0, random() * 1.0 from
>> generate_series(1, 1000);
>>
>> timed:
>>
>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>
>> Unpatched Time: 100351.251 ms
>> Patched Time: 75180.787 ms
>>
>> That's like a 25% speedup on random input. As we say over here, rather
>> badly translated, not a turkey's boogers (meaning "nice!")
>
> Cool! What work_mem setting were you using here?

The script iterates over a few variations of string patterns (easy
comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
4GB), and table sizes (~350M, ~650M, ~1.5G).

That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas  wrote:
> I still think tuplesort_heap_siftup is a confusing name, although I'm not
> sure that Peter's "compact" is much better. I suggest that we rename it to
> tuplesort_heap_delete_top(). In comments within the function, explain that
> the *loop* corresponds to the "siftup" in Knuth's book.

I'm also fine with that name.

> Interestingly, as Knuth explains the siftup algorithm, it performs a
> "replace the top" operation, rather than "remove the top". But we use it to
> remove the top, by moving the last element to the top and running the
> algorithm after that. Peter's other patch, to change the way we currently
> replace the top node, to do it in one pass rather than as delete+insert, is
> actually Knuth's siftup algorithm.

Knuth must have a strange sense of humor.

> Peter, looking at your "displace" patch in this light, I think
> tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm
> calling it now), should share a common subroutine. Displace replaces the top
> node with the new node passed as argument, and then calls the subroutine to
> push it down to the right place. Delete_top replaces the top node with the
> last value in the heap, and then calls the subroutine. Or perhaps delete_top
> should remove the last value from the heap, and then call displace() with
> it, to re-insert it.

I can see the value in that, but I'd still rather not. The code reuse
win isn't all that large, and having a separate
tuplesort_heap_root_displace() routine allows us to explain what's
going on for merging, to assert what we're able to assert with tape
numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex
cruft that the existing routines need (if only barely) to support
replacement selection. I would be surprised if with a very tight inner
loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it
increases pipeline stalls).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump with tables created in schemas created by extensions

2016-09-08 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Aug 30, 2016 at 5:43 AM, Martín Marqués  
> wrote:
>> This is v4 of the patch, which is actually a cleaner version from the
>> v2 one Michael sent.

> Let's do as you suggest then, and just focus on the schema issue. I
> just had an extra look at the patch and it looks fine to me. So the
> patch is now switched as ready for committer.

Pushed with cosmetic adjustments (mainly, I thought the comment needed
to be much more explicit).

I'm not sure whether we want to try to fix this in older branches.
It would be a significantly larger patch, and maybe more of a behavior
change than we want to make in stable branches.  So I'm inclined to call
it done at this point.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-08 Thread Peter Geoghegan
On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire  wrote:
> setup:
>
> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
> insert into lotsofitext select cast(random() * 10.0 as text)
> || 'blablablawblabla', cast(random() * 10.0 as text) ||
> 'blablablawjjjblabla', cast(random() * 10.0 as text) ||
> 'blablabl
> awwwabla', random() * 10.0, random() * 1.0 from
> generate_series(1, 1000);
>
> timed:
>
> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>
> Unpatched Time: 100351.251 ms
> Patched Time: 75180.787 ms
>
> That's like a 25% speedup on random input. As we say over here, rather
> badly translated, not a turkey's boogers (meaning "nice!")

Cool! What work_mem setting were you using here?

>> More than using "n" or "memtupcount" what I'm saying is to assert that
>> memtuples[imin] is inside the heap, which would catch the same errors
>> the original assert would, and more.
>>
>> Assert(imin < state->memtupcount)
>>
>> If you prefer.
>>
>> The original asserts allows any value of imin for memtupcount>1, and
>> that's my main concern. It shouldn't.
>
> So, for the assertions to properly avoid clobbering/reading out of
> bounds memory, you need both the above assert:
>
>  +*/
>  +   memtuples[i] = memtuples[imin];
>  +   i = imin;
>  +   }
>  +
>>+   Assert(imin < state->memtupcount);
>  +   memtuples[imin] = *newtup;
>  +}
>
> And another one at the beginning, asserting:
>
>  +   SortTuple  *memtuples = state->memtuples;
>  +   int n,
>  +   imin,
>  +   i;
>  +
>>+   Assert(state->memtupcount > 0 && memtuples[0].tupindex == 
>>newtup->tupindex);
>  +
>  +   CHECK_FOR_INTERRUPTS();
>
> It's worth making that change, IMHO, unless I'm missing something.

You're supposed to just not call it with an empty heap, so the
assertions trust that much. I'll look into that.

Currently, producing a new revision of this entire patchset. Improving
the cost model (used when the parallel_workers storage parameter is
not specified within CREATE INDEX) is taking a bit of time, but hope
to have it out in the next couple of days.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server

2016-09-08 Thread Ashutosh Bapat
> While checking for shippability, we build the target list which is passed
> to
> the foreign server as fdw_scan_tlist. The target list contains
> a. All the GROUP BY expressions
> b. Shippable entries from the target list of upper relation
> c. Var and Aggref nodes from non-shippable entries from the target list of
> upper relation
>

The code in the patch doesn't seem to add Var nodes explicitly. It assumes
that
the Var nodes will be part of GROUP BY clause. The code is correct, I think.


> d. Var and Aggref nodes from non-shippable HAVING conditions.
>

This needs to be changed as per the comments below.

>
> Costing:
>
> If use_foreign_estimate is true for input relation, like JOIN case, we use
> EXPLAIN output to get the cost of query with aggregation/grouping on the
> foreign server. If not we calculate the costs locally. Similar to core, we
> use
> get_agg_clause_costs() to get costs for aggregation and then using logic
> similar to cost_agg() we calculate startup and total cost. Since we have no
> idea which aggregation strategy will be used at foreign side, we add all
> startup cost (startup cost of input relation, aggregates etc.) into startup
> cost for the grouping path and similarly for total cost.
>

This looks OK to me.


>
> Deparsing the query:
>
> Target list created while checking for shippability is deparsed using
> deparseExplicitTargetList(). sortgroupref are adjusted according to this
> target list. Most of the logic to deparse an Aggref is inspired from
> get_agg_expr(). For an upper relation, FROM and WHERE clauses come from the
> underlying scan relation and thus for simplicity, FROM clause deparsing
> logic
> is moved from deparseSelectSql() to a new function deparseFromClause(). The
> same function adds WHERE clause to the remote SQL.
>

Ok.


>
>
> Area of future work:
>
> 1. Adding path with path-keys to push ORDER BY clause along with
> aggregation/
> grouping.  Should be supported as a separate patch.
>
> 2. Grouping Sets/Rollup/Cube is not supported in current version. I have
> left
> this aside to keep patch smaller. If required I can add that support in the
> next version of the patch.
>

I am fine with these limitations for first cut of this feature.

I think we should try to measure performance gain because of aggregate
pushdown. The EXPLAIN
doesn't show actual improvement in the execution times.

Here are the comments on the patch.

Patch compiles without errors/warnings - Yes
make check passes - Yes.
make check in contrib/postgres_fdw passes - Yes

Attached patch (based on your patch) has some typos corrected, some comments
rephrased. It also has some code changes, as explained in various comments
below. Please see if those look good.

1. Usually for any deparseSomething function, Something is the type of node
produced by the parser when the string output by that function is parsed.
deparseColumnRef, for example, produces a string, which when parsed
produces a
columnRef node. There is are nodes of type FromClause and AggOrderBy. I
guess,
you meant FromExpr instead of FromClause. deparseAggOrderBy() may be
renamed as
appendOrderByFromList() or something similar. It may be utilized for window
functions if required.

2. Can we infer the value of foragg flag from the RelOptInfo passed to
is_foreign_expr()? For any upper relation, it is ok to have aggregate in
there. For any other relation aggregate is not expected.

3. In function foreign_grouping_ok(), should we use classifyConditions()?
The
function is written and used for base relations. There's nothing in that
function, which prohibits it being used for other relations. I feel that
foreign_join_ok() should have used the same function to classify the other
clauses.

4. May be the individual criteria in the comment block
+/*
+ * Aggregate is safe to pushdown if
+ * 1. It is a built-in aggregate
+ * 2. All its arguments are safe to push-down
+ * 3. The functions involved are immutable.
+ * 4. Other expressions involved like aggorder,
aggdistinct are
+ *safe to be pushed down.
+ */
should be associated with the code blocks which implement those. As the
criteria change it's difficult to maintain the numbered list in sync with
the
code.

5. The comment
+/* Aggregates other than simple one are non-pushable. */
should read /* Only non-split aggregates are pushable. */ as AGGSPLIT_SIMPLE
means a complete, non-split aggregation step.

6. An aggregate function has transition, combination and finalization
function
associated with it. Should we check whether all of the functions are
shippable?
But probably it suffices to check whether aggregate function as whole is
shippable or not using is_shippable() since it's the whole aggregate we are
interested in and not the intermediate results. Probably, we should add a
comment explaining why it's sufficient to check the aggregate function 

Re: [HACKERS] Default make target in test modules

2016-09-08 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> I happened to notice that if you type "make" in
>> src/test/modules/test_pg_dump, you will get a "make check" action
>> not "make all".

> Strange.  Don't all these makefiles depend on the pgxs stuff emitting
> something sane, which would have "all" as the first one? ...  Ooh, I see
> that test_pg_dump adds "check" before including pgxs, which is what
> AFAICS causes the problem.

Right.

> I suppose this is a very easy mistake to make -- but also fortunately an
> easy one to correct.  Do you want me to fix the affected modules?

I was going to do it, but if you want to, feel free.

(BTW, I notice that test_pg_dump's Makefile misquotes its own path name,
probably copy-paste sloppiness.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee
On Thu, Sep 8, 2016 at 8:42 PM, Claudio Freire 
wrote:

> On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
>  wrote:
> > For example, for a table with 60 bytes wide tuple (including 24 byte
> > header), each page can approximately have 8192/60 = 136 tuples. Say we
> > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> > bitmap. First 272 offsets in every page are represented in the bitmap and
> > anything greater than are in overflow region. On the other hand, the
> current
> > representation will need about 16 bytes per page assuming 2% dead
> tuples, 40
> > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> > tuples. So bitmap will take more space for small tuples or when vacuum is
> > run very aggressively, both seems unlikely for very large tables. Of
> course
> > the calculation does not take into account the space needed by the
> overflow
> > area, but I expect that too be small.
>
> I thought about something like this, but it could be extremely
> inefficient for mostly frozen tables, since the bitmap cannot account
> for frozen pages without losing the O(1) lookup characteristic
>

Well, that's correct. But I thought the whole point is when there are large
number of dead tuples which requires large memory. If my math was correct
as explained above, then even at 5% dead tuples, bitmap representation will
consume approximate same memory but provide O(1) search time.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

2016-09-08 Thread Tom Lane
Peter Eisentraut  writes:
> More generally, I'm concerned that appendShellString() looks pretty
> attractive for future use.  It's not inconceivable that someone will
> want to use it for say calling pg_dump from pg_dumpall or pg_upgrade at
> some point, and then maybe we'll accidentally disallow LF/CR in
> tablespace names, say.

That's fair, but I'm not sure how you propose to avoid it, given that
we don't have a method for safely quoting those characters on Windows.

I would certainly far rather find a way to do that and then remove the
prohibition.  But lacking such a solution, we're going to be forced into
messy compromises.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default make target in test modules

2016-09-08 Thread Alvaro Herrera
Tom Lane wrote:
> I happened to notice that if you type "make" in
> src/test/modules/test_pg_dump, you will get a "make check" action
> not "make all".  I hope this is just somebody being thoughtless
> about Makefile ordering and not an intentional override of the
> default make target.  If the latter, I beg to differ about it
> being a good idea.

Strange.  Don't all these makefiles depend on the pgxs stuff emitting
something sane, which would have "all" as the first one? ...  Ooh, I see
that test_pg_dump adds "check" before including pgxs, which is what
AFAICS causes the problem.

I suppose this is a very easy mistake to make -- but also fortunately an
easy one to correct.  Do you want me to fix the affected modules?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Default make target in test modules

2016-09-08 Thread Tom Lane
I happened to notice that if you type "make" in
src/test/modules/test_pg_dump, you will get a "make check" action
not "make all".  I hope this is just somebody being thoughtless
about Makefile ordering and not an intentional override of the
default make target.  If the latter, I beg to differ about it
being a good idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove superuser() checks from pgstattuple

2016-09-08 Thread Peter Eisentraut
On 9/4/16 7:36 PM, Stephen Frost wrote:
> Perhaps if the versioned install script was actually a non-versioned
> install script in the source tree, and the Makefile simply installed it
> into the correct place, then we could eliminate that part.  (All very
> hand-wavy, I've not looked at what it'd take to do.)

A number of external extensions actually do it that way, but it also has
its problems.  For example, if you don't preserve the old base install
scripts, you can't actually test whether your upgrade scripts work in
the sense that they upgrade you to the same place.

This is now being obsoleted by the later idea of allowing base installs
from a chain of upgrade scripts.  But if your upgrade scripts contain
ALTER TABLE commands, you will probably still want to write base install
scripts that do a fresh CREATE TABLE instead.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-08 Thread Daniel Verite
Rahila Syed wrote:

> However, including the check here will require returning the status
> of command from this hook. i.e if we throw error inside this
> hook we will need to return false indicating the value has not changed.

Looking at the other variables hooks, they already emit errors and can
deny the effect of a change corresponding to a new value, without
informing the caller. Why would autocommit be different?

For example echo_hook does this:

/* ...if the value is in (queries,errors,all,none) then
   assign pset.echo accordingly ... */
else
{
  psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n",
   newval, "ECHO", "none");
  pset.echo = PSQL_ECHO_NONE;
}


If the user issues \set ECHO FOOBAR, then it produces the above error
message and makes the same internal change as if \set ECHO none
had been issued.

But, the value of the variable as seen by the user is still FOOBAR:

\set
[...other variables...]
ECHO = 'FOOBAR'

The proposed patch doesn't follow that behavior, as it denies
a new value for AUTOCOMMIT. You might argue that it's better,
but on the other side, there are two reasons against it:

- it's not consistent with the other uses of \set that affect psql,
such as ECHO, ECHO_HIDDEN,  ON_ERROR_ROLLBACK,
 COMP_KEYWORD_CASE... and even AUTOCOMMIT as
 \set AUTOCOMMIT FOOBAR is not denied, just reinterpreted.

- it doesn't address the case of another method than \set being used
 to set the variable, as in : SELECT 'on' as "AUTOCOMMIT" \gset
 whereas the hook would work in that case.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-08 Thread Peter Eisentraut
On 9/8/16 11:16 AM, Tom Lane wrote:
> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

It would offer a significant upgrade over the current situation.

First, it offers stability inside the same version.  Whereas glibc might
change a collation in a minor upgrade, ICU won't do that.  And the
postgres binary is bound to a major version of ICU by the soname (which
changes with every major release).  So this would avoid the situation
that a simple OS update could break collations.

Second, it offers a way to detect that something has changed.  With
glibc, you don't know anything unless you read the source diffs.  With
ICU, you can compare the collation version before and after and at least
tell the user that they need to refresh indexes or whatever.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Forbid use of LF and CR characters in database and role names

2016-09-08 Thread Peter Eisentraut
On 9/6/16 1:42 PM, Robert Haas wrote:
> If we were talking about pathnames containing spaces, I would agree,
> but I've never heard of a legitimate pathname containing CR or LF.  I
> can't see us losing much by refusing to allow such pathnames, except
> for security holes.

The flip side of that is that if we're doing a half-way job of only
prohibiting these characters in 67% of cases, then a new generation of
tools will be written on top of that with the assumption that these
characters cannot appear.  But then those tools will be easy to break or
exploit because it's possible to sneak stuff in in creative ways.  So
we're on the road to having an endless stream of "I can sneak in a CR/LF
character in here" bugs.

The current setup is more robust:  We are prohibiting these characters
in specific locations where we know we can't handle them.  But we don't
give any guarantees about anything else.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

2016-09-08 Thread Peter Eisentraut
On 9/6/16 1:08 PM, Tom Lane wrote:
>> As just mentioned elsewhere, this accidentally introduces a failure if
>> > the PostgreSQL installation path contains LF/CR, because of the use of
>> > appendShellString().
> I think that's intentional, not accidental.  What actual use case is
> there for allowing such paths?

There probably isn't one.  But we ought to be introducing this change in
a more intentional and consistent way.

For example, pg_basebackup has no such restriction.  So using
pg_basebackup, then promote, then pg_upgrade will (probably) fail now
for some paths.

More generally, I'm concerned that appendShellString() looks pretty
attractive for future use.  It's not inconceivable that someone will
want to use it for say calling pg_dump from pg_dumpall or pg_upgrade at
some point, and then maybe we'll accidentally disallow LF/CR in
tablespace names, say.

Also, if we're concerned about the potential for confusion that these
characters can cause, maybe we should be disallowing more control
characters in similar places.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v2] Add overflow checks to money type input function

2016-09-08 Thread Peter Eisentraut
I have updated the patch with additional tests and comments per your
review.  Final(?) version attached.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From ee34d7d64a4b10c9f7fbe8c905a56cea1584c8c9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 8 Sep 2016 12:00:00 -0400
Subject: [PATCH v3] Add overflow checks to money type input function

The money type input function did not have any overflow checks at all.
There were some regression tests that purported to check for overflow,
but they actually checked for the overflow behavior of the int8 type
before casting to money.  Remove those unnecessary checks and add some
that actually check the money input function.

Reviewed-by: Fabien COELHO 
---
 src/backend/utils/adt/cash.c| 53 ++--
 src/test/regress/expected/money.out | 98 ++---
 src/test/regress/sql/money.sql  | 30 ++--
 3 files changed, 165 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/adt/cash.c b/src/backend/utils/adt/cash.c
index b336185..a146b0a 100644
--- a/src/backend/utils/adt/cash.c
+++ b/src/backend/utils/adt/cash.c
@@ -189,13 +189,30 @@ cash_in(PG_FUNCTION_ARGS)
 	printf("cashin- string is '%s'\n", s);
 #endif
 
+	/*
+	 * We accumulate the absolute amount in "value" and then apply the sign at
+	 * the end.  (The sign can appear before or after the digits, so it would
+	 * be more complicated to do otherwise.)  Because of the larger range of
+	 * negative signed integers, we build "value" in the negative and then
+	 * flip the sign at the end, catching most-negative-number overflow if
+	 * necessary.
+	 */
+
 	for (; *s; s++)
 	{
 		/* we look for digits as long as we have found less */
 		/* than the required number of decimal places */
 		if (isdigit((unsigned char) *s) && (!seen_dot || dec < fpoint))
 		{
-			value = (value * 10) + (*s - '0');
+			Cash newvalue = (value * 10) - (*s - '0');
+
+			if (newvalue / 10 != value)
+ereport(ERROR,
+		(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+		 errmsg("value \"%s\" is out of range for type money",
+str)));
+
+			value = newvalue;
 
 			if (seen_dot)
 dec++;
@@ -214,11 +231,27 @@ cash_in(PG_FUNCTION_ARGS)
 
 	/* round off if there's another digit */
 	if (isdigit((unsigned char) *s) && *s >= '5')
-		value++;
+		value--;  /* remember we build the value in the negative */
+
+	if (value > 0)
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type money",
+		str)));
 
 	/* adjust for less than required decimal places */
 	for (; dec < fpoint; dec++)
-		value *= 10;
+	{
+		Cash newvalue = value * 10;
+
+		if (newvalue / 10 != value)
+			ereport(ERROR,
+	(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+	 errmsg("value \"%s\" is out of range for type money",
+			str)));
+
+		value = newvalue;
+	}
 
 	/*
 	 * should only be trailing digits followed by whitespace, right paren,
@@ -247,7 +280,19 @@ cash_in(PG_FUNCTION_ARGS)
 			str)));
 	}
 
-	result = value * sgn;
+	/* If the value is supposed to be positive, flip the sign, but check for
+	 * the most negative number. */
+	if (sgn > 0)
+	{
+		result = -value;
+		if (result < 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+	 errmsg("value \"%s\" is out of range for type money",
+			str)));
+	}
+	else
+		result = value;
 
 #ifdef CASHDEBUG
 	printf("cashin- result is " INT64_FORMAT "\n", result);
diff --git a/src/test/regress/expected/money.out b/src/test/regress/expected/money.out
index 538235c..5695f87 100644
--- a/src/test/regress/expected/money.out
+++ b/src/test/regress/expected/money.out
@@ -185,6 +185,96 @@ SELECT * FROM money_data;
  $123.46
 (1 row)
 
+-- input checks
+SELECT '1234567890'::money;
+   money   
+---
+ $1,234,567,890.00
+(1 row)
+
+SELECT '12345678901234567'::money;
+   money
+
+ $12,345,678,901,234,567.00
+(1 row)
+
+SELECT '123456789012345678'::money;
+ERROR:  value "123456789012345678" is out of range for type money
+LINE 1: SELECT '123456789012345678'::money;
+   ^
+SELECT '9223372036854775807'::money;
+ERROR:  value "9223372036854775807" is out of range for type money
+LINE 1: SELECT '9223372036854775807'::money;
+   ^
+SELECT '-12345'::money;
+money
+-
+ -$12,345.00
+(1 row)
+
+SELECT '-1234567890'::money;
+   money
+
+ -$1,234,567,890.00
+(1 row)
+
+SELECT '-12345678901234567'::money;
+money
+-
+ -$12,345,678,901,234,567.00
+(1 row)
+
+SELECT '-123456789012345678'::money;
+ERROR:  value "-123456789012345678" is out of range for type money
+LINE 1: SELECT '-123456789012345678'::money;
+   ^
+SELECT 

  1   2   >