Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-11-14 Thread Haribabu Kommi
On Sat, Nov 12, 2016 at 10:12 PM, Pavan Deolasee 
wrote:

>
>
> On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi 
> wrote:
>
>>
>> Thanks for the patch. This shows a very good performance improvement.
>>
>>
> Thank you. Can you please share the benchmark you ran, results and
> observations?
>

I just ran a performance test on my laptop with minimal configuration, it
didn't show much
improvement, currently I don't have access to a big machine to test the
performance.



> I started reviewing the patch, during this process and I ran the regression
>> test on the WARM patch. I observed a failure in create_index test.
>> This may be a bug in code or expected that needs to be corrected.
>>
>
> Can you please share the diff? I ran regression after applying patch on
> the current master and did not find any change? Does it happen consistently?
>


Yes, it is happening consistently. I ran the make installcheck. Attached
the regression.diffs file with the failed test.
I applied the previous warm patch on this commit -
e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15


Regards,
Hari Babu
Fujitsu Australia


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-11-14 Thread Kyotaro HORIGUCHI
Hello,

At Mon, 14 Nov 2016 16:53:35 +0530, Amit Kapila  wrote 
in 
> On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier
> >>> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
> >>> This function is called not only in LogStandbySnapshot(), but during
> >>> DDL operations as well where lockmode >= AccessExclusiveLock.
> >>
> >> This does not remove any record from WAL. So theoretically any
> >> kind of record can be NO_PROGRESS, but practically as long as
> >> checkpoints are not unreasonably suppressed. Any explicit
> >> database operation must be accompanied with at least commit
> >> record that triggers checkpoint. NO_PROGRESSing there doesn't
> >> seem to me to harm database durability for this reason.
> >>
> 
> By this theory, you can even mark the insert record as no progress
> which is not good.

Of course. So we carefully choose the kinds of records to be
so. If we mark all xlog records to be SKIP_PROGRESS,
archive_timeout gets useless and as its result vacuum may leave
certain number of records not removed for maybe problematic time.

> >> The objective of this patch is skipping WALs on completely-idle
> >> state and the NO_PROGRESSing is necessary to do its work. Of
> >> course we can distinguish exclusive lock with PROGRESS and
> >> without PROGRESS but it is unnecessary complexity.
> >
> > The point that applies here is that logging the exclusive lock
> > information is necessary for the *standby* recovery conflicts, not the
> > primary  which is why it should not influence the checkpoint activity
> > that is happening on the primary. So marking this record with
> > NO_PROGRESS is actually fine to me.
> >
> 
> The progress parameter is used not only for checkpoint activity but by
> bgwriter as well for logging standby snapshot.  If you want to keep
> this record under no_progress category (which I don't endorse), then
> it might be better to add a comment, so that it will be easier for the
> readers of this code to understand the reason.

I rather agree to that. But how detailed it should be?

>LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
>{
>...
>   XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
>   /* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot */
>   XLogSetFlags(XLOG_SKIP_PROGRESS);

or 

>   /*
>* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot.
>* See the comment for LogCurrentRunningXact for the detail.
>*/

or more detiled?

The term "WAL activity' is used in the comment for
GetProgressRecPtr. Its meaning is not clear but not well
defined. Might need a bit detailed explanation about that or "WAL
activity tracking". What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

2016-11-14 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Let me try to be more clear.  I will not commit this patch if it is not
> properly commented.  I doubt that anyone else will, either.
> 
> The fact that those code changes already exist in 9.4+ is not a reason to
> back-port them to earlier releases without a proper explanation of why we
> are doing it.  Very possibly, we should also improve the comments in newer
> branches so that future authors don't reintroduce whatever bugs were fixed
> by these changes.  But whether we do that or not, I am not going to commit
> uncommented patches to complex code in order to fix obscure bugs in
> 3+-year-old branches.  I think that is a non-starter.
> 

OK, although I'm not perfectly sure what to add as a comment, I added an 
example scenario as a comment because I thought a concrete situation helps to 
understand the existing two paragraphs.  Is this good?

Regards
Takayuki Tsunakawa



cascading_standby_stuck_v2.patch
Description: cascading_standby_stuck_v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding the optional clause 'AS' in CREATE TRIGGER

2016-11-14 Thread Tom Lane
"Okano, Naoki"  writes:
> I would like to add the following support for a trigger. 
> This implementation enables to create a trigger efficiently 
> in single command. 

> It had been discussed before. The link is as shown below.
> https://www.postgresql.org/message-id/CAA-aLv4m%3Df9cc1zcUzM49pE8%2B2NpytUDraTgfBmkTOkMN_wO2w%40mail.gmail.com

I think people pretty much lost interest in that proposal, which is
why it never got finished.  Aside from the definitional issues discussed
in that thread, I can think of a few other problems:

1. The impression I have is that most people write trigger functions
so that they can be shared across multiple uses.  That'd be impossible
with anonymous trigger function blocks.  So you'd be trading off merging
two commands into one, versus having to write out the whole body of the
trigger multiple times, which wouldn't be much of a win.

2. I am concerned that every time we add any syntax to CREATE FUNCTION,
we'll have to think about whether we need to add it to CREATE TRIGGER.
(Or more likely, we'll forget and then users will complain.)

3. There's a lot of infrastructure that's built up around CREATE FUNCTION,
such as psql's \ef and \sf commands.  We'd soon find ourselves having to
duplicate all that for triggers.

So personally I feel that the value-for-work-expended ratio here is
pretty low.

But in any case it would be a serious mistake to do this without first
implementing CREATE OR REPLACE TRIGGER.  I think that's an entirely
separate proposal and you would be well advised to treat it as such.
It's far from trivial because of locking considerations.  You probably
have to take an exclusive lock on the owning relation in order to change
trigger properties.  (An advantage of the separate-function-definition
approach is that you can replace the function body with a much weaker
lock.)

> Supporting 'AS' clause in CREATE TRIGGER syntax will enable the option of 
> defining the trigger in single command.
> As a bonus, it will be compatible with oracle.

It certainly would not match Oracle's syntax for trigger bodies.  It might
slightly reduce the conversion pain, but only slightly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

2016-11-14 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila
> It looks like the code in 9.3 or later version uses the recptr as the target
> segment location
> (targetSegmentPtr) whereas 9.2 uses recptr as beginning of segment (readOff
> = 0;).  If above understanding is right then it will set different values
> for latestPagePtr in 9.2 and 9.3 onwards code.
> 

In 9.2, the relevant variable is not recptr but recaddr.  recaddr in 9.2 and 
recptr in later releases point to the beginning of a page just read, which is 
not always the beginning of the segment (targetSegmentPtr).

Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Noah Misch
On Mon, Nov 14, 2016 at 10:21:53AM -0800, Peter Geoghegan wrote:
> BTW, I recently noticed that the latest version of Valgrind, 3.12,
> added this new feature:
> 
> * Memcheck:
> 
>   - Added meta mempool support for describing a custom allocator which:
>  - Auto-frees all chunks assuming that destroying a pool destroys all
>objects in the pool
>  - Uses itself to allocate other memory blocks
> 
> It occurred to me that we might be able to make good use of this.

PostgreSQL memory contexts don't acquire blocks from other contexts; they all
draw from malloc() directly.  Therefore, I don't see this feature giving us
something that the older Memcheck MEMPOOL support does not give us.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove the comment on the countereffectiveness of large shared_buffers on Windows

2016-11-14 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila
> Okay, not a problem.  However, I am not sure the results in this thread
> are sufficient proof as for read-only tests, there is no noticeable win
> by increasing shared buffers and read-write tests seems to be quite short
> (60 seconds) to rely on it.

I think the reason why increasing shared_buffers didn't give better performance 
for read-only tests than you expect is that the relation files are cached in 
the filesystem cache.  The purpose of this verification is to know that the 
effective upper limit is not 512MB (which is too small now), and I think the 
purpose is achieved.  There may be another threshold, say 32GB or 128GB, over 
which the performance degrades due to PostgreSQL implementation, but that's 
another topic which also applies to other OSes.

How about 3 minutes for read-write tests?  How long do you typically run?

Regards
Takayuki Tsunakawa





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Adding the optional clause 'AS' in CREATE TRIGGER

2016-11-14 Thread Okano, Naoki
Hi, 

I would like to add the following support for a trigger. 
This implementation enables to create a trigger efficiently 
in single command. 

It had been discussed before. The link is as shown below.
https://www.postgresql.org/message-id/CAA-aLv4m%3Df9cc1zcUzM49pE8%2B2NpytUDraTgfBmkTOkMN_wO2w%40mail.gmail.com

Currently, PostgreSQL requires two steps to create a trigger.
1. to create a function.
2. to define a trigger with action specified via already created function.

Supporting 'AS' clause in CREATE TRIGGER syntax will enable the option of 
defining the trigger in single command.
As a bonus, it will be compatible with oracle.

Also, the optional clause 'OR REPLACE' is required as below.
https://www.postgresql.org/message-id/CAA-aLv6KYgVt2CwaRdcnptzWVngEm72Cp4mUFnF-MfeH0gS91g%40mail.gmail.com

Currently, to change the definition of a trigger, trigger needs to 
be dropped first before creating it again with new definition.
To change the definition of a function in CREATE TRIGGER syntax,
trigger needs to be dropped first before creating it again with new definition, 
too! 
So, we need to add the optional clause 'OR REPLACE'.

Adding the optional clauses 'AS' and 'OR REPLACE' in CREATE TRIGGER syntax gives
the comfort of defining the trigger or redefining the trigger definition 
which contains the function definition in single command. 

Here is the syntax based on the previous discussion.  

CREATE [ OR REPLACE ] [ CONSTRAINT ]  TRIGGER name { BEFORE | AFTER | INSTEAD 
OF } 
{ event [ OR ... ] }
ON table_name
[ FROM referenced_table_name ]
[ NOT DEFERRABLE | [ DEFERRABLE ] { INITIALLY IMMEDIATE | INITIALLY 
DEFERRED } ]
[ FOR [ EACH ] { ROW | STATEMENT } ]
[ WHEN ( condition ) ]
{ EXECUTE PROCEDURE function_name ( arguments )
  | AS 'trigger function definition' [ LANGUAGE lang_name ]
[ SET configuration_parameter { TO value | = value | FROM CURRENT }]
}

If you have your opinion on this concept, please give me it. 

Regards, 
Okano Naoki 
Fujitsu  


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-11-14 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Mithun Cy
> Thanks, my concern is suppose you have 3 server in cluster A(new version),
> B(new version), C(old version). If we implement as above only new servers
> will send ParameterStatus message to indicate what type of server we are
> connected. Server C will not send same. So we will not be able to use new
> feature "failover to new master" for such a kind of cluster.

No, the streaming replication requires the same major release for all member 
servers, so there's no concern about the mixed-version cluster.

Sorry, pmState can only be used in postmaster.  In our context, postgres can 
use RecoveryInProgress().  Anyway, in addition to the reduced round trip, the 
libpq code would be much simpler.

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 TAP tests and extensions

2016-11-14 Thread Craig Ringer
On 14 November 2016 at 14:55, Craig Ringer  wrote:
> On 27 October 2016 at 00:42, Robert Haas  wrote:
>> On Wed, Oct 26, 2016 at 7:17 AM, Andres Freund  wrote:
>>> On 2016-09-23 16:04:32 -0400, Tom Lane wrote:
 Looking back over the thread, I see that you also proposed installing
 isolationtester and pg_isolation_regress for the benefit of extensions.
 I'm very much less excited about that idea.  It'd be substantially more
 dead weight in typical installations, and I'm not sure that it'd be useful
 to common extensions, and I'm not eager to treat isolationtester's API
 and behavior as something we need to hold stable for extension use.
>>>
>>> FWIW, I'd be quite happy if it were installed. Running isolationtester
>>> when compiling extensions against distribution postgres packages would
>>> be quite useful.
>>
>> +1.
>
> Patch attached.

As Andres pointed out elsewhere this is only half the picture. It
should really add PGXS support for ISOLATION_TESTS and bringing up the
test harness too.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-11-14 Thread Tomas Vondra

Attached is v6 of the patch series, fixing most of the points:

* common bits (valgrind/randomization/wipe) moved to memdebug.h/c

Instead of introducing a new header file, I've added the prototypes to 
memdebug.h (which was already used for the valgrind stuff anyway), and 
the implementations to a new memdebug.c file. Not sure what you meant by 
"static inlines" though.


So the patch series now has three parts - 0001 with memdebug stuff, 0002 
with slab and 0003 with gen (still a poor name).


* removing AllocSet references from both new memory contexts

* using FLEXIBLE_ARRAY_ELEMENT in SlabContext

* using dlist instead of the custom linked list

I've however kept SlabContext->freelist as an array, because there may 
be many blocks with the same number of free chunks, in which case moving 
the block in the list would be expensive. This way it's simply 
dlist_delete + dlist_push.


* use StandardChunkHeader instead of the common fields

* removing pointer to context from block header for both contexts

* fix format in FreeInfo/AllocInfo (including for AllocSet)

* improved a bunch of comments (bitmap size, chunksPerBlock formula)

* did a pgindent run on the patch

* implement the missing methods in Gen (Stats/Check)

* fix a few minor bugs in both contexts

I haven't done anything with hiding pointers behind typedefs, because I 
don't know what's so wrong about that.


I also haven't done anything with the bitmap access in SlabAlloc - I 
haven't found any reasonable case when it would be measurable, and I 
don't expect this to be even measurable in practice.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


slab-allocators-v6.tgz
Description: application/compressed-tar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] DECLARE STATEMENT setting up a connection in ECPG

2016-11-14 Thread Ideriha, Takeshi
Hi,
Making developing ECPG application more efficiently and improve portability,
I'd like to suggest DECLARE STATEMENT support in ECPG.

This DECLARE STATEMENT is one of the statement that lets users declare an 
identifier pointing a connection.
This identifier will be used in other embedded dynamic SQL statement 
such as PREPARE, EXECUTE, DECLARE CURSOR and so on.
(Oracle implements this.)
https://docs.oracle.com/cd/B10501_01/appdev.920/a42525/apf.htm#declare_stmt

Under the current system, a user must use the AT clause in every SQL statements 
when executing the dynamic SQL at non-default connection.
https://www.postgresql.org/docs/current/static/ecpg-connect.html

When a user needs to connect to a non-default connection, 
AT clause can be used in DECLARE STATEMENT once and need not to be in every 
dynamic SQL statements.

This helps a user with making ECPG application easily and efficiently 
without explicitly designating a connection for each SQL statement.

Moreover, writing code without designating connection explicitly
improves portability.

[Use-case]
It is very useful when the data needed for a report, business decision is 
spread across several data sources,
and one application needs to connect multiple database server.

Especially these days, multiple database servers are easily set up without 
taking time and cost
 because virtualization such as docker and microservices are in fashion.

This trend leads to be growing importance of this handy connection switching 
function.

[Interface]
The syntax for the DECLARE STATEMENT in ECPG is as following:

 EXEC SQL [ AT connection-name ] DECLARE statement-name STATEMENT
 
, where "statement-name" is an SQL identifier
and "connection name" points to the connection which will be used to execute 
the dynamic SQL statements.

[Example]
 EXEC SQL AT con1 DECLARE sql_stmt STATEMENT 
 EXEC SQL DECLARE cursor_name CURSOR FOR sql_stmt
 EXEC SQL PREPARE sql_stmt FROM :dyn_string

[System Design Plan]
To support above functionality, ecpg precompiler should support:
 - To understand the DECLARE STATEMENT syntax 
 - Translate the DECLARE STATEMENT into a new function with parameters. 
   These parameters carry the information like connection_name and 
statement_name. 
 - The function is a new function defined in the ECPG library.

 Following functions are going to be modified:
  - ECPGprepare
  - ECPGdeallocate
  - ECPGdescribe
  - ECPGdo
 But I think there is room for discussing modifying ECPGdo, 
 because it's a very common function that will map many SQL statement 
 including SELECT, INSERT, EXECTUTE, CURSOR and so on.

It seems to me there is no discussion on this topic.
But if exists, would you let me know?

Regards. 
Ideriha Takeshi, 
Fujitsu (Fujitsu Enterprise Postgres )



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do we need use more meaningful variables to replace 0 in catalog head files?

2016-11-14 Thread Tom Lane
Andrew Dunstan  writes:
> On 11/13/2016 11:11 AM, Tom Lane wrote:
>> 1. Are we going to try to keep these things in the .h files, or split
>> them out?  I'd like to get them out, as that eliminates both the need
>> to keep the things looking like macro calls, and the need for the data
>> within the macro call to be at least minimally parsable as C.

> That would work fine for pg_proc.h, less so for pg_type.h where we have 
> a whole lot of
> #define FOOOID nn
> directives in among the data lines. Moving these somewhere remote from 
> the catalog lines they relate to seems like quite a bad idea.

We certainly don't want multiple files to be sources of truth for that.
What I was anticipating is that those #define's would also be generated
from the same input files, much as fmgroids.h is handled today.  We
could imagine driving the creation of a macro off an additional, optional
field in the data entries, say "macro=FOOOID", if we want only selected
entries to have #defines.  Or we could do like we do with pg_proc.h and
generate macros for everything according to some fixed naming rule.
I could see approaching pg_type that way, but am less excited about
pg_operator, pg_opclass, etc, where we only need macros for a small
fraction of the entries.

>> 2. Andrew's example above implies some sort of mapping between the
>> keywords and the actual column names (or at least column positions).
>> Where and how is that specified?

> There are several possibilities. The one I was leaning towards was to 
> parse out the Anum_pg_foo_* definitions.

I'm okay with that if the field labels used in the data entries are to be
exactly the same as the column names.  Your example showed abbreviated
names (omitting "pro"), which is something I doubt we want to try to
hard-wire a rule for.  Also, if we are going to abbreviate at all,
I think it might be useful to abbreviate *a lot*, say like "v" for
"provolatile", and that would be something that ought to be set up with
some explicit manually-provided declarations.

>> 3. Also where are we going to provide the per-column default values?
>> How does the converter script know which columns to convert to type oids,
>> proc oids, etc?  Is it going to do any data validation beyond that, and
>> if so on what basis?

> a) something like DATA_DEFAULTS( foo=bar );
> b) something like DATA_TYPECONV ( rettype argtypes allargtypes );

I'm thinking a bit about per-column declarations in the input file,
along the line of this for provolatile:

declare v col=15 type=char default='v'

Some of those items could be gotten out of pg_proc.h, but not all.
I guess a second alternative would be to add the missing info to
pg_proc.h and have the conversion script parse it out of there.

>> I think we want to do them all.  pg_proc.h is actually one of the easier
>> catalogs to work on presently, IMO, because the only kind of
>> cross-references it has are type OIDs.  Things like pg_amop are a mess.
>> And I really don't want to be dealing with multiple notations for catalog
>> data.  Also I think this will be subject to Polya's paradox: designing a
>> general solution will be easier and cleaner than a hack that works only
>> for one catalog.

> I don't know that we need to handle everything at once, as long as the 
> solution is sufficiently general.

Well, we could convert the catalogs one at a time if that seems useful,
but I don't want to be rewriting the bki-generation script repeatedly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Allow TAP tests to be run individually

2016-11-14 Thread Michael Paquier
On Tue, Nov 15, 2016 at 12:02 AM, Peter Eisentraut
 wrote:
> On 11/14/16 3:52 AM, Michael Paquier wrote:
>> I don't mind. This patch uses the following pattern:
>> $(or $(PROVE_TESTS),t/*.pl)
>> While something more spread in Postgres source would be something like that:
>> $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
>> It seems to me that we'd prefer that for consistency, but I see no
>> reason to not keep your patch as well. I am marking that as ready for
>> committer.
>
> ($or ...) is a newer feature of GNU make, so we have avoided that so
> far.  I have committed your v2 with $(if ...).

Thanks, I am just going to use it...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something is broken about connection startup

2016-11-14 Thread Tom Lane
I wrote:
> Robert Haas  writes:
>> I assume you're going to back-patch this, and the consequences of
>> failing to reset it before going idle could easily be vastly worse
>> than the problem you're trying to fix.  So I'd rather not make
>> assumptions like "the client will probably never sleep between Bind
>> and Execute".  I bet that's actually false, and I certainly wouldn't
>> want to bet the farm on it being true.

> No, I'm not at all proposing to assume that.  But I may be willing to
> assume that "we don't hold a CatalogSnapshot between Bind and Execute
> unless we're also holding a transaction snapshot".  I need to do a bit
> more research to see if that's actually true, though.

Turns out it's not true.

I still don't much like having the main loop in PostgresMain know about
this hack, so I experimented with putting the InvalidateCatalogSnapshot()
calls into places in postgres.c that were already dealing with transaction
commit/abort or snapshot management.  I ended up needing five such calls
(as in the attached patch), which is just about equally ugly.  So at this
point I'm inclined to hold my nose and stick a call into step (1) of the
main loop instead.

Also, wherever we end up putting those calls, is it worth providing a
variant invalidation function that only kills the catalog snapshot when
it's the only one outstanding?  (If it isn't, the transaction snapshot
should be older, so there's no chance of advancing our xmin by killing
it.)  In principle this would save some catalog snapshot rebuilds for
inside-a-transaction-block cases, but I'm not sure it's worth sweating
that when we're doing client message exchange anyway.

Lastly, I find myself disliking the separate CatalogSnapshotStale flag
variable.  The other special snapshots in snapmgr.c are managed by setting
the pointer to NULL when it's not valid, so I wonder why CatalogSnapshot
wasn't done that way.  Since this patch is touching almost every use of
that flag already, it wouldn't take much to switch it over.

Comments?

regards, tom lane

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 599874e..c9da8ed 100644
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
*** exec_parse_message(const char *query_str
*** 1352,1357 
--- 1352,1358 
  		/* Done with the snapshot used for parsing */
  		if (snapshot_set)
  			PopActiveSnapshot();
+ 		InvalidateCatalogSnapshot();
  	}
  	else
  	{
*** exec_bind_message(StringInfo input_messa
*** 1778,1783 
--- 1779,1785 
  	/* Done with the snapshot used for parameter I/O and parsing/planning */
  	if (snapshot_set)
  		PopActiveSnapshot();
+ 	InvalidateCatalogSnapshot();
  
  	/*
  	 * And we're ready to start portal execution.
*** exec_execute_message(const char *portal_
*** 2000,2005 
--- 2002,2012 
  			 * those that start or end a transaction block.
  			 */
  			CommandCounterIncrement();
+ 
+ 			/*
+ 			 * Flush catalog snapshot after every query, too.
+ 			 */
+ 			InvalidateCatalogSnapshot();
  		}
  
  		/* Send appropriate CommandComplete to client */
*** finish_xact_command(void)
*** 2456,2461 
--- 2463,2471 
  
  		CommitTransactionCommand();
  
+ 		/* Flush catalog snapshot if any */
+ 		InvalidateCatalogSnapshot();
+ 
  #ifdef MEMORY_CONTEXT_CHECKING
  		/* Check all memory contexts that weren't freed during commit */
  		/* (those that were, were checked before being deleted) */
*** PostgresMain(int argc, char *argv[],
*** 3871,3876 
--- 3881,3888 
  		 */
  		AbortCurrentTransaction();
  
+ 		InvalidateCatalogSnapshot();
+ 
  		if (am_walsender)
  			WalSndErrorCleanup();
  
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1ec9f70..842e135 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
*** GetTransactionSnapshot(void)
*** 316,321 
--- 316,327 
  	/* First call in transaction? */
  	if (!FirstSnapshotSet)
  	{
+ 		/*
+ 		 * Don't allow catalog snapshot to be older than xact snapshot.  Must
+ 		 * do this first to allow the empty-heap Assert to succeed.
+ 		 */
+ 		InvalidateCatalogSnapshot();
+ 
  		Assert(pairingheap_is_empty());
  		Assert(FirstXactSnapshot == NULL);
  
*** GetTransactionSnapshot(void)
*** 347,355 
  		else
  			CurrentSnapshot = GetSnapshotData();
  
- 		/* Don't allow catalog snapshot to be older than xact snapshot. */
- 		CatalogSnapshotStale = true;
- 
  		FirstSnapshotSet = true;
  		return CurrentSnapshot;
  	}
--- 353,358 
*** GetTransactionSnapshot(void)
*** 358,364 
  		return CurrentSnapshot;
  
  	/* Don't allow catalog snapshot to be older than xact snapshot. */
! 	CatalogSnapshotStale = true;
  
  	CurrentSnapshot = GetSnapshotData();
  
--- 361,367 
  		return CurrentSnapshot;
  
  	/* Don't allow catalog snapshot to be older than 

Re: [HACKERS] Physical append-only tables

2016-11-14 Thread Greg Stark
On Sun, Nov 13, 2016 at 3:45 PM, Magnus Hagander  wrote:
> For a scenario like this, would it make sense to have an option that could
> be set on an individual table making it physical append only? Basically
> VACUUM would run as normal and clean up the old space when rows are deleted
> back in history, but when new space is needed for a row the system would
> never look at the old blocks, and only append to the end.

I don't think "appending" is the right way to think about this. It
happens to address the problem but only accidentally and only
partially. More generally what you have is two different kinds of data
with two different access patterns and storage requirements in the
same table. They're logically similar but have different practical
requirements.

If there was some way to teach the database that your table is made of
two different types of data and how to distinguish the two types then
when the update occurs it could move the row to the right section of
storage... This might be something the new partitioning could handle
or it might need something more low-level and implicit.

That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.

2016-11-14 Thread Alvaro Herrera
Andres Freund wrote:
> On 2016-11-12 11:30:42 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > Committed after simplifying the Makefile.
> > 
> > While I have no particular objection to adding these tests, the
> > commit log's claim that this will improve buildfarm testing is
> > quite wrong.  The buildfarm runs "make installcheck" not
> > "make check" in contrib.
> 
> Gah.  It was more aimed at the coverage stuff, but that might work the
> same. Alvaro?

Already replied on IM, but for the record, coverage.postgresql.org
currently runs this:

make -j4 >> $LOG 2>&1
make check-world >> $LOG 2>&1
make -C src/test/ssl check >> $LOG 2>&1

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.

2016-11-14 Thread Andres Freund
On 2016-11-14 12:14:10 -0800, Andres Freund wrote:
> On 2016-11-12 11:42:12 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2016-11-12 11:30:42 -0500, Tom Lane wrote:
> > >> which is a rather blatant waste of cycles. I would suggest an explicit
> > >> do-nothing installcheck rule rather than the hack you came up with here.
> >
> > > I had that at first, but that generates a warning about overwriting the
> > > makefile target - which afaics cannot be fixed.
> >
> > Hm.  What about inventing an additional macro NO_INSTALLCHECK that
> > prevents pgxs.mk from generating an installcheck rule?  It's not
> > like we don't have similar issues elsewhere, eg contrib/sepgsql.
>
> Well, that one seems a bit different.  Seems easy enough to do. Do we
> want to make that macro that prevents installcheck from being defined,
> or one that forces it to be empty? The former has the disadvantage that
> one has to be careful to define a target, to avoid breaking recursion
> from the upper levels.

Oh, that disadvantage doesn't actually exist, because installcheck is a
.PHONY target...

I've for now not added a variant that prevents plain 'make check' from
doing anything. I guess it could be useful when a contrib module wants
to run tests via something else than pg_regress?

Patch attached.

Andres
>From 906051a23831c5d337556e56d19962f2e857 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 14 Nov 2016 12:29:30 -0800
Subject: [PATCH] Provide NO_INSTALLCHECK for pgxs.

This allows us to avoid running the regression tests in contrib modules
like pg_stat_statement in a less ugly manner.

Discussion: <22432.1478968...@sss.pgh.pa.us>
---
 contrib/pg_stat_statements/Makefile | 7 +++
 doc/src/sgml/extend.sgml| 9 +
 src/makefiles/pgxs.mk   | 4 
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index f1a45eb..298951a 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -13,6 +13,9 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = pg_stat_statements
+# Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
+# which typical installcheck users do not have (e.g. buildfarm clients).
+NO_INSTALLCHECK = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -24,7 +27,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
-# which typical installcheck users do not have (e.g. buildfarm clients).
-installcheck: REGRESS=
diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index e19c657..f9d91a3 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -1194,6 +1194,15 @@ include $(PGXS)
  
 
  
+  NO_INSTALLCHECK
+  
+   
+don't define an installcheck target, useful e.g. if tests require special configuration, or don't use pg_regress
+   
+  
+ 
+
+ 
   EXTRA_CLEAN
   

diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk
index 2b4d684..c27004e 100644
--- a/src/makefiles/pgxs.mk
+++ b/src/makefiles/pgxs.mk
@@ -40,6 +40,8 @@
 # which need to be built first
 #   REGRESS -- list of regression test cases (without suffix)
 #   REGRESS_OPTS -- additional switches to pass to pg_regress
+#   NO_INSTALLCHECK -- don't define an installcheck target, useful e.g. if
+# tests require special configuration, or don't use pg_regress
 #   EXTRA_CLEAN -- extra files to remove in 'make clean'
 #   PG_CPPFLAGS -- will be added to CPPFLAGS
 #   PG_LIBS -- will be added to PROGRAM link line
@@ -268,8 +270,10 @@ ifndef PGXS
 endif
 
 # against installed postmaster
+ifndef NO_INSTALLCHECK
 installcheck: submake $(REGRESS_PREP)
 	$(pg_regress_installcheck) $(REGRESS_OPTS) $(REGRESS)
+endif
 
 ifdef PGXS
 check:
-- 
2.10.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.

2016-11-14 Thread Andres Freund
On 2016-11-12 11:42:12 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2016-11-12 11:30:42 -0500, Tom Lane wrote:
> >> which is a rather blatant waste of cycles. I would suggest an explicit
> >> do-nothing installcheck rule rather than the hack you came up with here.
> 
> > I had that at first, but that generates a warning about overwriting the
> > makefile target - which afaics cannot be fixed.
> 
> Hm.  What about inventing an additional macro NO_INSTALLCHECK that
> prevents pgxs.mk from generating an installcheck rule?  It's not
> like we don't have similar issues elsewhere, eg contrib/sepgsql.

Well, that one seems a bit different.  Seems easy enough to do. Do we
want to make that macro that prevents installcheck from being defined,
or one that forces it to be empty? The former has the disadvantage that
one has to be careful to define a target, to avoid breaking recursion
from the upper levels.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do we need use more meaningful variables to replace 0 in catalog head files?

2016-11-14 Thread Andrew Dunstan



On 11/13/2016 11:11 AM, Tom Lane wrote:

Andrew Dunstan  writes:

I'm not convinced the line prefix part is necessary, though. What I'm
thinking of is something like this:
PROCDATA( oid=1242 name=boolin isstrict=t volatile=i parallel=s nargs=1
  rettype=bool argtypes="cstring" src=boolin );

We could go in that direction too, but the apparent flexibility to split
entries into multiple lines is an illusion, at least if you try to go
beyond a few lines; you'd end up with duplicated line sequences in
different entries and thus ambiguity for patch(1).  I don't have any
big objection to the above, but it's not obviously better either.



Yeah, I looked and there are too many cases where the name would be 
outside the normal 3 lines of context.





Some things we should try to resolve before settling definitively on
a data representation:

1. Are we going to try to keep these things in the .h files, or split
them out?  I'd like to get them out, as that eliminates both the need
to keep the things looking like macro calls, and the need for the data
within the macro call to be at least minimally parsable as C.



That would work fine for pg_proc.h, less so for pg_type.h where we have 
a whole lot of


   #define FOOOID nn

directives in among the data lines. Moving these somewhere remote from 
the catalog lines they relate to seems like quite a bad idea.





2. Andrew's example above implies some sort of mapping between the
keywords and the actual column names (or at least column positions).
Where and how is that specified?



There are several possibilities. The one I was leaning towards was to 
parse out the Anum_pg_foo_* definitions.





3. Also where are we going to provide the per-column default values?
How does the converter script know which columns to convert to type oids,
proc oids, etc?  Is it going to do any data validation beyond that, and
if so on what basis?



a) something like DATA_DEFAULTS( foo=bar );
b) something like DATA_TYPECONV ( rettype argtypes allargtypes );


Hadn't thought about procoids, but something similar.



4. What will we do about the #define's that some of the .h files provide
for (some of) their object OIDs?  I assume that we want to move in the
direction of autogenerating those macros a la fmgroids.h, but this needs
a concrete spec as well.  If we don't want this change to result in a big
hit to the source code, we're probably going to need to be able to specify
the macro names to generate in the data files.



Yeah, as I noted above it's a bit messy,




5. One of the requirements that was mentioned in previous discussions
was to make it easier to add new columns to catalogs.  This format
does that only to the extent that you don't have to touch entries that
can use the default value for such a column.  Is that good enough, and
if not, what might we be able to do to make it better?



I think it is good enough, at least for a first cut.




I'd actually like to roll up the DESCR lines in pg_proc.h into this too,
they strike me as a bit of a wart. But I'm flexible on that.

+1, if we can come up with a better syntax.  This together with the
OID-macro issue suggests that there will be items in each data entry that
correspond to something other than columns of the target catalog.  But
that seems fine.


If we can generalize this to other catalogs, then that will be good, but
my inclination is to handle the elephant in the room (pg_proc.h) and
worry about the gnats later.

I think we want to do them all.  pg_proc.h is actually one of the easier
catalogs to work on presently, IMO, because the only kind of
cross-references it has are type OIDs.  Things like pg_amop are a mess.
And I really don't want to be dealing with multiple notations for catalog
data.  Also I think this will be subject to Polya's paradox: designing a
general solution will be easier and cleaner than a hack that works only
for one catalog.



I don't know that we need to handle everything at once, as long as the 
solution is sufficiently general.




cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Peter Geoghegan
On Mon, Nov 14, 2016 at 10:17 AM, Andres Freund  wrote:
> I think so, yes. IIRC I discussed it with Noah and Peter G. at a
> conference recently. We'd basically mark the content of shared buffers
> inaccessible at backend startup, and mark it accessible whenever a
> PinBuffer() happens, and then inaccessible during unpinning. We probably
> have to exclude the page header though, as we intentionally access them
> unpinned in some cases IIRC.

BTW, I recently noticed that the latest version of Valgrind, 3.12,
added this new feature:

* Memcheck:

  - Added meta mempool support for describing a custom allocator which:
 - Auto-frees all chunks assuming that destroying a pool destroys all
   objects in the pool
 - Uses itself to allocate other memory blocks

It occurred to me that we might be able to make good use of this. To
be clear, I don't think that there is reason to tie it to adding the
PinBuffer() stuff, which we've been talking about for years now. It
just caught my eye.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Andres Freund
On 2016-11-14 13:12:28 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2016-11-14 12:32:53 -0500, Tom Lane wrote:
> >> Basically my concern is that this restriction isn't documented anywhere
> >> and I'm not entirely certain it's been adhered to everywhere.  I'd feel
> >> much better about it if there were some way we could verify that.
> 
> > Would support for valgrind complaining about access to unpinned buffers
> > suffice?
> 
> I don't think it directly addresses the issue, but certainly it'd help.

Well, it detects situations where removed pins cause "unprotected
access", but of course that doesn't protect against cases where
independent pins hide that issue.


> Do you think that's easily doable?

I think so, yes. IIRC I discussed it with Noah and Peter G. at a
conference recently. We'd basically mark the content of shared buffers
inaccessible at backend startup, and mark it accessible whenever a
PinBuffer() happens, and then inaccessible during unpinning. We probably
have to exclude the page header though, as we intentionally access them
unpinned in some cases IIRC.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Tom Lane
Andres Freund  writes:
> On 2016-11-14 12:32:53 -0500, Tom Lane wrote:
>> Basically my concern is that this restriction isn't documented anywhere
>> and I'm not entirely certain it's been adhered to everywhere.  I'd feel
>> much better about it if there were some way we could verify that.

> Would support for valgrind complaining about access to unpinned buffers
> suffice?

I don't think it directly addresses the issue, but certainly it'd help.
Do you think that's easily doable?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Andres Freund
On 2016-11-14 12:32:53 -0500, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 11/14/2016 06:18 PM, Tom Lane wrote:
> >> You're implicitly assuming that a scan always returns its results in the
> >> same slot, and that no other slot could contain a copy of that data, but
> >> there is no guarantee of either.  See bug #14344 and d8589946d for a
> >> pretty recent example where that failed to be true --- admittedly, for
> >> a tuplesort scan not a table scan.
> 
> > It's the other way round. ExecProcNode might not always return its 
> > result in the same slot, but all the callers must assume that it might.
> 
> Basically my concern is that this restriction isn't documented anywhere
> and I'm not entirely certain it's been adhered to everywhere.  I'd feel
> much better about it if there were some way we could verify that.

Would support for valgrind complaining about access to unpinned buffers
suffice?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Peter Geoghegan
On Mon, Nov 14, 2016 at 9:22 AM, Heikki Linnakangas  wrote:
> I think that difference in the API is exactly what caught Peter by surprise
> and led to bug #14344. And I didn't see it either, until you two debugged
> it.

That is accurate, of course.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Tom Lane
Heikki Linnakangas  writes:
> On 11/14/2016 06:18 PM, Tom Lane wrote:
>> You're implicitly assuming that a scan always returns its results in the
>> same slot, and that no other slot could contain a copy of that data, but
>> there is no guarantee of either.  See bug #14344 and d8589946d for a
>> pretty recent example where that failed to be true --- admittedly, for
>> a tuplesort scan not a table scan.

> It's the other way round. ExecProcNode might not always return its 
> result in the same slot, but all the callers must assume that it might.

Basically my concern is that this restriction isn't documented anywhere
and I'm not entirely certain it's been adhered to everywhere.  I'd feel
much better about it if there were some way we could verify that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Heikki Linnakangas

On 11/14/2016 06:18 PM, Tom Lane wrote:

Robert Haas  writes:

On Mon, Nov 14, 2016 at 10:23 AM, Tom Lane  wrote:

I don't believe Andres' claim anyway.  There are certainly cases where an
allegedly-valid slot could be pointing at garbage, but table scans aren't
one of them, precisely because of the pin held by the slot.  It would take
a fairly wide-ranging code review to convince me that it's okay to lose
that mechanism.



I don't understand your objection.  It seems to me that the
TupleTableSlot is holding a pin, and the scan is also holding a pin,
so one of them is redundant.  You speculated that the slot could
continue to point at the tuple after the scan has moved on, but how
could such a thing actually happen?


You're implicitly assuming that a scan always returns its results in the
same slot, and that no other slot could contain a copy of that data, but
there is no guarantee of either.  See bug #14344 and d8589946d for a
pretty recent example where that failed to be true --- admittedly, for
a tuplesort scan not a table scan.


It's the other way round. ExecProcNode might not always return its 
result in the same slot, but all the callers must assume that it might.


The tuplesort interface is slightly different: the caller passes a slot 
to tuplesort_gettupleslot() as argument, and tuplesort_gettupleslot() 
places the next tuple in that slot. With executor nodes, a node returns 
a slot that it allocated itself, or it got from a child node. After you 
call ExecProcNode(), you can *not* assume that the old tuple in the slot 
is still valid. The child node can, and in most cases will, reuse the 
same slot, overwriting its contents.


I think that difference in the API is exactly what caught Peter by 
surprise and led to bug #14344. And I didn't see it either, until you 
two debugged it.



Also, there might well be places that are relying on the ability of a
slot to hold a pin for slots that are not simply the return slot of
a plan node.  We could perhaps remove the *use* of this slot feature in
the normal table-scan case, without removing the feature itself.


I didn't see any such use.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Robert Haas
On Mon, Nov 14, 2016 at 11:18 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Nov 14, 2016 at 10:23 AM, Tom Lane  wrote:
>>> I don't believe Andres' claim anyway.  There are certainly cases where an
>>> allegedly-valid slot could be pointing at garbage, but table scans aren't
>>> one of them, precisely because of the pin held by the slot.  It would take
>>> a fairly wide-ranging code review to convince me that it's okay to lose
>>> that mechanism.
>
>> I don't understand your objection.  It seems to me that the
>> TupleTableSlot is holding a pin, and the scan is also holding a pin,
>> so one of them is redundant.  You speculated that the slot could
>> continue to point at the tuple after the scan has moved on, but how
>> could such a thing actually happen?
>
> You're implicitly assuming that a scan always returns its results in the
> same slot, and that no other slot could contain a copy of that data, but
> there is no guarantee of either.  See bug #14344 and d8589946d for a
> pretty recent example where that failed to be true --- admittedly, for
> a tuplesort scan not a table scan.
>
> We could certainly set up some global convention that would make this
> universally true, but I think we'd have to do a lot of code-reading
> to ensure that everything is following that convention.
>
> Also, there might well be places that are relying on the ability of a
> slot to hold a pin for slots that are not simply the return slot of
> a plan node.  We could perhaps remove the *use* of this slot feature in
> the normal table-scan case, without removing the feature itself.

I'm still not getting it.  We don't have to guess who is using this
feature; we can find it out by looking at every place that calls
ExecStoreTuple and looking at whether they are passing InvalidBuffer.
As Heikki's original patch upthread demonstrates, most of them are.
The exceptions are in BitmapHeapNext, IndexNext, ExecOnConflictUpdate,
SampleNext, SeqNext, and TidNext.  If all of those places are or can
be made to hold a buffer pin for as long as the slot would have held
the same pin, we're done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 14, 2016 at 10:23 AM, Tom Lane  wrote:
>> I don't believe Andres' claim anyway.  There are certainly cases where an
>> allegedly-valid slot could be pointing at garbage, but table scans aren't
>> one of them, precisely because of the pin held by the slot.  It would take
>> a fairly wide-ranging code review to convince me that it's okay to lose
>> that mechanism.

> I don't understand your objection.  It seems to me that the
> TupleTableSlot is holding a pin, and the scan is also holding a pin,
> so one of them is redundant.  You speculated that the slot could
> continue to point at the tuple after the scan has moved on, but how
> could such a thing actually happen?

You're implicitly assuming that a scan always returns its results in the
same slot, and that no other slot could contain a copy of that data, but
there is no guarantee of either.  See bug #14344 and d8589946d for a
pretty recent example where that failed to be true --- admittedly, for
a tuplesort scan not a table scan.

We could certainly set up some global convention that would make this
universally true, but I think we'd have to do a lot of code-reading
to ensure that everything is following that convention.

Also, there might well be places that are relying on the ability of a
slot to hold a pin for slots that are not simply the return slot of
a plan node.  We could perhaps remove the *use* of this slot feature in
the normal table-scan case, without removing the feature itself.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-11-14 Thread Dilip Kumar
On Mon, Nov 14, 2016 at 9:30 PM, Robert Haas  wrote:
> Couldn't we just change the current memory context before calling
> heap_getnext()?  And then change back?

Right, seems like it will not have any problem..
>
> Also, what if we abandoned the idea of pushing qual evaluation all the
> way down into the heap and just tried to do HeapKeyTest in SeqNext
> itself?  Would that be almost as fast, or would it give up most of the
> benefits?
This we can definitely test. I will test and post the data.

Thanks for the suggestion.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-11-14 Thread Robert Haas
On Sun, Nov 13, 2016 at 12:16 AM, Dilip Kumar  wrote:
> Problem1:  As Andres has mentioned, HeapKeyTest uses heap_getattr,
> whereas ExecQual use slot_getattr().So we can have worst case
> performance problem when very less tuple are getting filter out and we
> have table with many columns with qual on most of the columns.
>
> Problem2. In HeapKeyTest we are under per_query_ctx, whereas in
> ExecQual we are under per_tuple_ctx , so in former we can not afford
> to have any palloc.

Couldn't we just change the current memory context before calling
heap_getnext()?  And then change back?

Also, what if we abandoned the idea of pushing qual evaluation all the
way down into the heap and just tried to do HeapKeyTest in SeqNext
itself?  Would that be almost as fast, or would it give up most of the
benefits?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-11-14 Thread Robert Haas
On Sat, Nov 12, 2016 at 11:09 AM, Andres Freund  wrote:
> I'm very tempted to rename this during the move to GUCs
...
> Slightly less so, but still tempted to also rename these. They're not
> actually necessarily pointing towards a primary, and namespace-wise
> they're not grouped with recovery_*, which has become more important now
> that recovery.conf isn't a separate namespace anymore.

-1 for renaming these.  I don't think the current names are
particularly bad, and I think trying to agree on what would be better
could easily sink the whole patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-11-14 Thread Robert Haas
On Fri, Nov 4, 2016 at 10:17 AM, Abhijit Menon-Sen  wrote:
> At 2016-11-04 10:35:05 +, si...@2ndquadrant.com wrote:
>> Since the "lots of parameters" approach is almost exactly what we have
>> now, I think we should do that first, get this patch committed and
>> then sit back and discuss an improved API and see what downsides it
>> introduces.
>
> Thanks, I agree strongly with this.
>
> Someone (Michael?) earlier mentioned the potential for introducing bugs
> with this patch (the idea of merging recovery.conf into postgresql.conf
> at all, not this particular incarnation).
>
> I think the current proposed approach with
>
> recovery_target=xid
> recovery_target_xid=xxx
>
> is preferable because (a) it doesn't introduce much new code, e.g., new
> parsing functions, nor (b) need many changes in the documentation—all we
> need is to say that of the various recovery_target_xxx parameters, the
> one that has priority is the one named by recovery_target.
>
> If I were to introduce recovery_target='xid xxx', I would at a minimum
> need to factor out (or duplicate) parsing and error handling code, make
> a type/value union-in-struct to store in the GUC *extra, then make sure
> that we handle the older values in a backwards-compatible way, and move
> a bunch of documentation around.

Just to be clear, the sum total of the additional "parsing" we are
talking about is looking for the first sequence of 1 or more spaces in
the input string and separating the stuff before them from the stuff
after them.  I agree that there's some work there, but I'm surprised
to hear that you think it's a lot of work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Robert Haas
On Mon, Nov 14, 2016 at 10:23 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sat, Nov 12, 2016 at 10:28 AM, Andres Freund  wrote:
>>> On 2016-08-30 07:38:10 -0400, Tom Lane wrote:
 I think this is probably wrong, or at least very dangerous to remove.
 The reason for the feature is that the slot may continue to point at
 the tuple after the scan has moved on.
>
>>> FWIW, that's not safe to assume in upper layers *anyway*. If you want to
>>> do that, the slot has to be materialized, and that'd make a local
>>> copy. If you don't materialize tts_values/isnull can point into random
>>> old memory (common e.g. for projections and virtual tuples in general).
>
>> So, I think you are arguing in favor of proceeding with this patch?
>
> I don't believe Andres' claim anyway.  There are certainly cases where an
> allegedly-valid slot could be pointing at garbage, but table scans aren't
> one of them, precisely because of the pin held by the slot.  It would take
> a fairly wide-ranging code review to convince me that it's okay to lose
> that mechanism.

I don't understand your objection.  It seems to me that the
TupleTableSlot is holding a pin, and the scan is also holding a pin,
so one of them is redundant.  You speculated that the slot could
continue to point at the tuple after the scan has moved on, but how
could such a thing actually happen?  Once the scan finds a tuple, it's
going to store the tuple in the slot and return.  It won't get control
back to advance the scan until the next time it's asked for a tuple,
and then it has to update the slot before returning.  So I don't see
the problem.  If in the future somebody wants to write an executor
node that does random extra work - like advancing the scan - before
returning the tuple the scan already found, they'll have to take
special precautions, but why should anybody want to do that?

I'm kind of puzzled, here.  Perhaps I am missing something obvious.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Andres Freund
On 2016-11-14 10:09:02 -0500, Robert Haas wrote:
> On Sat, Nov 12, 2016 at 10:28 AM, Andres Freund  wrote:
> > On 2016-08-30 07:38:10 -0400, Tom Lane wrote:
> >> Heikki Linnakangas  writes:
> >> > While profiling some queries and looking at executor overhead, I
> >> > realized that we're not making much use of TupleTableSlot's ability to
> >> > hold a buffer pin. In a SeqScan, the buffer is held pinned by the
> >> > underlying heap-scan anyway. Same with an IndexScan, and the SampleScan.
> >>
> >> I think this is probably wrong, or at least very dangerous to remove.
> >> The reason for the feature is that the slot may continue to point at
> >> the tuple after the scan has moved on.
> >
> > FWIW, that's not safe to assume in upper layers *anyway*. If you want to
> > do that, the slot has to be materialized, and that'd make a local
> > copy. If you don't materialize tts_values/isnull can point into random
> > old memory (common e.g. for projections and virtual tuples in general).
>
> So, I think you are arguing in favor of proceeding with this patch?

Not really, now. I don't buy the argument here against it. I do think
the overhead is quite noticeable. But I also think it has quite the
potential for subtle bugs.  I think I'd feel better if we had some form
of instrumentation trapping buffer accesses without pins present. We've
previously discussed doing that with valgrind...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something is broken about connection startup

2016-11-14 Thread Tom Lane
Robert Haas  writes:
> I assume you're going to back-patch this, and the consequences of
> failing to reset it before going idle could easily be vastly worse
> than the problem you're trying to fix.  So I'd rather not make
> assumptions like "the client will probably never sleep between Bind
> and Execute".  I bet that's actually false, and I certainly wouldn't
> want to bet the farm on it being true.

No, I'm not at all proposing to assume that.  But I may be willing to
assume that "we don't hold a CatalogSnapshot between Bind and Execute
unless we're also holding a transaction snapshot".  I need to do a bit
more research to see if that's actually true, though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

2016-11-14 Thread Robert Haas
On Thu, Nov 10, 2016 at 12:13 AM, Tsunakawa, Takayuki
 wrote:
> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
>> OK.  I agree that's a problem.  However, your patch adds zero new comment
>> text while removing some existing comments, so I can't easily tell how it
>> solves that problem or whether it does so correctly.  Even if I were smart
>> enough to figure it out, I wouldn't want to rely on the next person also
>> being that smart.  This is obviously a subtle problem in tricky code, so
>> a clear explanation of the fix seems like a very good idea.
>
> The comment describes what the code is trying to achieve.  Actually, I just 
> imitated the code and comment of later major releases.  The only difference 
> between later releases and my patch (for 9.2) is whether the state is stored 
> in XLogReaderStruct or as global variables.  Below is the comment from 9.6, 
> where the second paragraph describes what the two nested if conditions mean.  
> The removed comment lines are what became irrelevant, which is also not 
> present in later major releases.

Let me try to be more clear.  I will not commit this patch if it is
not properly commented.  I doubt that anyone else will, either.

The fact that those code changes already exist in 9.4+ is not a reason
to back-port them to earlier releases without a proper explanation of
why we are doing it.  Very possibly, we should also improve the
comments in newer branches so that future authors don't reintroduce
whatever bugs were fixed by these changes.  But whether we do that or
not, I am not going to commit uncommented patches to complex code in
order to fix obscure bugs in 3+-year-old branches.  I think that is a
non-starter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something is broken about connection startup

2016-11-14 Thread Robert Haas
On Mon, Nov 14, 2016 at 10:17 AM, Tom Lane  wrote:
>> I think the really important thing is that we don't leave xmin set
>> when the backend is idle.
>
> Agreed.  I did some stats-gathering on this over the weekend, seeing how
> often the various situations occur during the core regression tests.
> It's not as big a problem as I first thought, because we hold a snapshot
> while doing planning, but the case does arise for various utility commands
> (particularly LOCK TABLE) so it's not negligible.
>
> What is looking like the best answer to me is to treat CatalogSnapshot
> as registered, but forcibly unregister it before going idle.  We don't
> need to unregister it more often than that, because we'd hold a
> transaction snapshot of some description throughout any long-running
> command, and the CatalogSnapshot would be no older than that anyway.

Makes sense.

> "Before going idle" could be implemented with an InvalidateCatalogSnapshot
> call either in postgres.c's finish_xact_command(), or right in the main
> message-receipt loop.  The latter seems like a wart, but it would cover
> cases like a client going to sleep between Bind and Execute phases of an
> extended query.  On the other hand, I think we might be holding a
> transaction snapshot at that point anyway.  It's at least arguable that
> we should be going through finish_xact_command() at any point where it's
> likely that we have a snapshot to release.

I assume you're going to back-patch this, and the consequences of
failing to reset it before going idle could easily be vastly worse
than the problem you're trying to fix.  So I'd rather not make
assumptions like "the client will probably never sleep between Bind
and Execute".  I bet that's actually false, and I certainly wouldn't
want to bet the farm on it being true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Tom Lane
Robert Haas  writes:
> On Sat, Nov 12, 2016 at 10:28 AM, Andres Freund  wrote:
>> On 2016-08-30 07:38:10 -0400, Tom Lane wrote:
>>> I think this is probably wrong, or at least very dangerous to remove.
>>> The reason for the feature is that the slot may continue to point at
>>> the tuple after the scan has moved on.

>> FWIW, that's not safe to assume in upper layers *anyway*. If you want to
>> do that, the slot has to be materialized, and that'd make a local
>> copy. If you don't materialize tts_values/isnull can point into random
>> old memory (common e.g. for projections and virtual tuples in general).

> So, I think you are arguing in favor of proceeding with this patch?

I don't believe Andres' claim anyway.  There are certainly cases where an
allegedly-valid slot could be pointing at garbage, but table scans aren't
one of them, precisely because of the pin held by the slot.  It would take
a fairly wide-ranging code review to convince me that it's okay to lose
that mechanism.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something is broken about connection startup

2016-11-14 Thread Tom Lane
Robert Haas  writes:
> On Fri, Nov 11, 2016 at 3:15 PM, Tom Lane  wrote:
>> The basic problem here, therefore, is that SnapshotResetXmin isn't aware
>> that GetCatalogSnapshot is keeping a possibly-unregistered snapshot in
>> its hip pocket.  That has to change.  We could either treat the saved
>> CatalogSnapshot as always being registered, or we could add some logic
>> to force invalidating the CatalogSnapshot whenever we clear MyPgXact->xmin
>> or advance it past that snapshot's xmin.
>> 
>> Neither of those is very nice from a performance standpoint.  With the
>> first option we risk delaying global cleanup by holding back
>> MyPgXact->xmin to protect a CatalogSnapshot we might not actually use
>> anytime soon.  With the second option we will likely end up doing many
>> more GetSnapshotData calls than we do now, because in a transaction
>> that hasn't yet set a regular snapshot, we will need to get a new
>> CatalogSnapshot for every catalog access (since we'll go back to
>> MyPgXact->xmin = 0 after every access).  And the parser, in particular,
>> tends to do a lot of catalog accesses before we ever get a regular
>> transaction snapshot.
>> 
>> Ideally, perhaps, we'd treat the saved CatalogSnapshot as registered
>> but automatically kill it "every so often".  I'm not sure how to drive
>> that though.

> I think the really important thing is that we don't leave xmin set
> when the backend is idle.

Agreed.  I did some stats-gathering on this over the weekend, seeing how
often the various situations occur during the core regression tests.
It's not as big a problem as I first thought, because we hold a snapshot
while doing planning, but the case does arise for various utility commands
(particularly LOCK TABLE) so it's not negligible.

What is looking like the best answer to me is to treat CatalogSnapshot
as registered, but forcibly unregister it before going idle.  We don't
need to unregister it more often than that, because we'd hold a
transaction snapshot of some description throughout any long-running
command, and the CatalogSnapshot would be no older than that anyway.

"Before going idle" could be implemented with an InvalidateCatalogSnapshot
call either in postgres.c's finish_xact_command(), or right in the main
message-receipt loop.  The latter seems like a wart, but it would cover
cases like a client going to sleep between Bind and Execute phases of an
extended query.  On the other hand, I think we might be holding a
transaction snapshot at that point anyway.  It's at least arguable that
we should be going through finish_xact_command() at any point where it's
likely that we have a snapshot to release.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary

2016-11-14 Thread Robert Haas
On Sat, Nov 12, 2016 at 10:28 AM, Andres Freund  wrote:
> On 2016-08-30 07:38:10 -0400, Tom Lane wrote:
>> Heikki Linnakangas  writes:
>> > While profiling some queries and looking at executor overhead, I
>> > realized that we're not making much use of TupleTableSlot's ability to
>> > hold a buffer pin. In a SeqScan, the buffer is held pinned by the
>> > underlying heap-scan anyway. Same with an IndexScan, and the SampleScan.
>>
>> I think this is probably wrong, or at least very dangerous to remove.
>> The reason for the feature is that the slot may continue to point at
>> the tuple after the scan has moved on.
>
> FWIW, that's not safe to assume in upper layers *anyway*. If you want to
> do that, the slot has to be materialized, and that'd make a local
> copy. If you don't materialize tts_values/isnull can point into random
> old memory (common e.g. for projections and virtual tuples in general).

So, I think you are arguing in favor of proceeding with this patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something is broken about connection startup

2016-11-14 Thread Robert Haas
On Fri, Nov 11, 2016 at 3:15 PM, Tom Lane  wrote:
> So this has pretty much been broken since we put in MVCC snapshots for
> catalog searches.  The problem would be masked when inside a transaction
> that has already got a transaction snapshot, but whenever we don't have
> one already, our catalog lookups are basically unprotected against
> premature row removal.  I don't see any reason to think that this is
> specific to connection startup.

Yeah.  I was focused on the case where the backend is already in a
transaction, and obviously didn't think hard enough about the scenario
where that isn't the case.

> The basic problem here, therefore, is that SnapshotResetXmin isn't aware
> that GetCatalogSnapshot is keeping a possibly-unregistered snapshot in
> its hip pocket.  That has to change.  We could either treat the saved
> CatalogSnapshot as always being registered, or we could add some logic
> to force invalidating the CatalogSnapshot whenever we clear MyPgXact->xmin
> or advance it past that snapshot's xmin.
>
> Neither of those is very nice from a performance standpoint.  With the
> first option we risk delaying global cleanup by holding back
> MyPgXact->xmin to protect a CatalogSnapshot we might not actually use
> anytime soon.  With the second option we will likely end up doing many
> more GetSnapshotData calls than we do now, because in a transaction
> that hasn't yet set a regular snapshot, we will need to get a new
> CatalogSnapshot for every catalog access (since we'll go back to
> MyPgXact->xmin = 0 after every access).  And the parser, in particular,
> tends to do a lot of catalog accesses before we ever get a regular
> transaction snapshot.
>
> Ideally, perhaps, we'd treat the saved CatalogSnapshot as registered
> but automatically kill it "every so often".  I'm not sure how to drive
> that though.

I think the really important thing is that we don't leave xmin set
when the backend is idle.  If we set MyProc->xmin during startup or in
some other out-of-transaction activity and never updated it until we
went idle, I think that wouldn't hurt much because we're not going to
spend tons of time in that state anyway.  But if we start leaving xmin
set for idle backends, our users will be howling in agony.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-11-14 Thread Petr Jelinek
On 13/11/16 10:21, Andres Freund wrote:
> On 2016-11-13 00:40:12 -0500, Peter Eisentraut wrote:
>> On 11/12/16 2:18 PM, Andres Freund wrote:
>  I also wonder if we want an easier to
>>> extend form of pubinsert/update/delete (say to add pubddl, pubtruncate,
>>> pub ... without changing the schema).
>>>
>
> So like, text array that's then parsed everywhere (I am not doing
> bitmask/int definitely)?
>>> Yes, that sounds good to me. Then convert it to individual booleans or a
>>> bitmask when loading the publications into the in-memory form (which you
>>> already do).
>>
>> I'm not sure why that would be better.  Adding catalog columns in future
>> versions is not a problem.
> 
> It can be extended from what core provides, for extended versions of
> replication solutions, for one. I presume publications/subscriptions
> aren't only going to be used by built-in code.
> 

I understand the desire here (especially as an author of such out of the
core tools), but I am not sure if this is a good place where to start
having pluggable catalogs given that we have no generic idea for those.
Currently, plugins writing arbitrary data to catalogs will cause things
to break when those plugins get uninstalled (and we don't have good
mechanism for cleaning that up when that happens). And that won't change
if we convert this into array. Besides, shouldn't the code then anyway
check that we only have expected data in that array otherwise we might
miss corruption?

So if the main reason for turning this into array is extendability for
other providers then I am -1 on the idea. IMHO this is for completely
different path that adds user catalogs with proper syscache-like
interface and everything but has nothing to do with publications.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2016-11-14 Thread Robert Haas
On Mon, Nov 14, 2016 at 9:57 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
>>  wrote:
>>> Costing PartitionJoinPath needs more thought so that we don't end up
>>> with bad overall plans. Here's an idea. Partition-wise joins are
>>> better compared to the unpartitioned ones, because of the smaller
>>> sizes of partitions. If we think of join as O(MN) operation where M
>>> and N are sizes of unpartitioned tables being joined, partition-wise
>>> join computes P joins each with average O(M/P * N/P) order where P is
>>> the number of partitions, which is still O(MN) with constant factor
>>> reduced by P times. I think, we need to apply similar logic to
>>> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
>>> where S and R are setup cost and joining cost (for M, N rows) resp.
>>> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
>>> + P * R(M/P, N/P). Each of the join methods will have different S and
>>> R functions and may not be linear on the number of rows. So,
>>> PartitionJoinPath costs are obtained from corresponding regular path
>>> costs subjected to above transformation. This way, we will be
>>> protected from choosing a PartitionJoinPath when it's not optimal.
>
>> I'm not sure that I really understand the stuff with big-O notation
>> and M, N, and P.  But I think what you are saying is that we could
>> cost a PartitionJoinPath by costing some of the partitions (it might
>> be a good idea to choose the biggest ones) and assuming the cost for
>> the remaining ones will be roughly proportional.  That does seem like
>> a reasonable strategy to me.
>
> I'm not sure to what extent the above argument depends on the assumption
> that join is O(MN), but I will point out that in no case of practical
> interest for large tables is it actually O(MN).  That would be true
> only for the stupidest possible nested-loop join method.  It would be
> wise to convince ourselves that the argument holds for more realistic
> big-O costs, eg hash join is more like O(M+N) if all goes well.

Yeah, I agree.  To recap briefly, the problem we're trying to solve
here is how to build a path for a partitionwise join without an
explosion in the amount of memory the planner uses or the number of
paths created.  In the initial design, if there are N partitions per
relation, the total number of paths generated by the planner increases
by a factor of N+1, which gets ugly if, say, N = 1000, or even N =
100.  To reign that in, we want to do a rough cut at costing the
partitionwise join that will be good enough to let us throw away
obviously inferior paths, and then work out the exact paths we're
going to use only for partitionwise joins that are actually selected.
I think costing one or a few of the larger sub-joins and assuming
those costs are representative is probably a reasonable approach to
that problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Allow TAP tests to be run individually

2016-11-14 Thread Peter Eisentraut
On 11/14/16 3:52 AM, Michael Paquier wrote:
> I don't mind. This patch uses the following pattern:
> $(or $(PROVE_TESTS),t/*.pl)
> While something more spread in Postgres source would be something like that:
> $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
> It seems to me that we'd prefer that for consistency, but I see no
> reason to not keep your patch as well. I am marking that as ready for
> committer.

($or ...) is a newer feature of GNU make, so we have avoided that so
far.  I have committed your v2 with $(if ...).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2016-11-14 Thread Tom Lane
Robert Haas  writes:
> On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
>  wrote:
>> Costing PartitionJoinPath needs more thought so that we don't end up
>> with bad overall plans. Here's an idea. Partition-wise joins are
>> better compared to the unpartitioned ones, because of the smaller
>> sizes of partitions. If we think of join as O(MN) operation where M
>> and N are sizes of unpartitioned tables being joined, partition-wise
>> join computes P joins each with average O(M/P * N/P) order where P is
>> the number of partitions, which is still O(MN) with constant factor
>> reduced by P times. I think, we need to apply similar logic to
>> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
>> where S and R are setup cost and joining cost (for M, N rows) resp.
>> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
>> + P * R(M/P, N/P). Each of the join methods will have different S and
>> R functions and may not be linear on the number of rows. So,
>> PartitionJoinPath costs are obtained from corresponding regular path
>> costs subjected to above transformation. This way, we will be
>> protected from choosing a PartitionJoinPath when it's not optimal.

> I'm not sure that I really understand the stuff with big-O notation
> and M, N, and P.  But I think what you are saying is that we could
> cost a PartitionJoinPath by costing some of the partitions (it might
> be a good idea to choose the biggest ones) and assuming the cost for
> the remaining ones will be roughly proportional.  That does seem like
> a reasonable strategy to me.

I'm not sure to what extent the above argument depends on the assumption
that join is O(MN), but I will point out that in no case of practical
interest for large tables is it actually O(MN).  That would be true
only for the stupidest possible nested-loop join method.  It would be
wise to convince ourselves that the argument holds for more realistic
big-O costs, eg hash join is more like O(M+N) if all goes well.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor improvement to delete.sgml

2016-11-14 Thread Robert Haas
On Sun, Nov 13, 2016 at 10:55 PM, Etsuro Fujita
 wrote:
> On 2016/10/19 2:51, Robert Haas wrote:
>>
>> On Fri, Oct 14, 2016 at 12:05 AM, Etsuro Fujita
>>  wrote:
>>>
>>> I think it's better to mention that an alias is needed for the target
>>> table
>>> specified in the USING clause of a DELETE statement, to set up a
>>> self-join,
>>> as the documentation on the from_list parameter of UPDATE does.  Please
>>> find
>>> attached a patch.
>
>> The statement you are proposing to add to the documentation isn't true.
>
> Consider a counterexample of DELETE doing a self-join of a target table:
>
> postgres=# create table t1 (c1 int);
> CREATE TABLE
> postgres=# insert into t1 values (1);
> INSERT 0 1
> postgres=# delete from t1 using t1 where t1.c1 = t1.c1;
> ERROR:  table name "t1" specified more than once
>
> Giving an alias to the target table t1 in the USING clause,
>
> postgres=# delete from t1 using t1 r1 where t1.c1 = r1.c1;
> DELETE 1
>
> Am I missing something?

Well, you could also alias the target table, like this:

delete from t1 q1 using t1 where q1.c1 = t1.c1;

The point is that, while it's true that you can't have the same table
alias twice at the same query level, you can fix that in more than one
way.  Your suggestion of adding an alias to the appearance in the
using list is one approach, but not the only one.

I don't think there's any real need for a documentation change here.
The fact that repeating a table alias doesn't work is not unique to
DELETE, nor is it unique to self-joins.  The documentation here just
needs to explain that repeating the table name will set up a
self-join; it doesn't need to describe every SQL mistake that you
could make while trying to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Physical append-only tables

2016-11-14 Thread Magnus Hagander
On Mon, Nov 14, 2016 at 3:39 PM, Tom Lane  wrote:

> Magnus Hagander  writes:
> > What I'm talking about is something that would be a lot simpler than
> > auto-clustering. I'm not even talking about trying to detect if the row
> was
> > about to go into the right place -- this might be expensive and certainly
> > more complicated. I'm only talking about a simple case where we *never*
> put
> > anything anywhere other than at the end of the table, period. That should
> > make the check both cheap and simple.
>
> It also makes it so much of a corner case that even a cheap check could be
> a net performance degradation, especially for people whose usage pattern
> doesn't match this.
>
>
I agree that it definitely solves just one problem. But it seems to be a
fairly common problem, particularly for users of BRIN users.

Full auto-clustering would cover many more usecases, but would also be a
lot more expensive to maintain.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2016-11-14 Thread Robert Haas
On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
 wrote:
> Costing PartitionJoinPath needs more thought so that we don't end up
> with bad overall plans. Here's an idea. Partition-wise joins are
> better compared to the unpartitioned ones, because of the smaller
> sizes of partitions. If we think of join as O(MN) operation where M
> and N are sizes of unpartitioned tables being joined, partition-wise
> join computes P joins each with average O(M/P * N/P) order where P is
> the number of partitions, which is still O(MN) with constant factor
> reduced by P times. I think, we need to apply similar logic to
> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
> where S and R are setup cost and joining cost (for M, N rows) resp.
> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
> + P * R(M/P, N/P). Each of the join methods will have different S and
> R functions and may not be linear on the number of rows. So,
> PartitionJoinPath costs are obtained from corresponding regular path
> costs subjected to above transformation. This way, we will be
> protected from choosing a PartitionJoinPath when it's not optimal.

I'm not sure that I really understand the stuff with big-O notation
and M, N, and P.  But I think what you are saying is that we could
cost a PartitionJoinPath by costing some of the partitions (it might
be a good idea to choose the biggest ones) and assuming the cost for
the remaining ones will be roughly proportional.  That does seem like
a reasonable strategy to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Physical append-only tables

2016-11-14 Thread Tom Lane
Magnus Hagander  writes:
> What I'm talking about is something that would be a lot simpler than
> auto-clustering. I'm not even talking about trying to detect if the row was
> about to go into the right place -- this might be expensive and certainly
> more complicated. I'm only talking about a simple case where we *never* put
> anything anywhere other than at the end of the table, period. That should
> make the check both cheap and simple.

It also makes it so much of a corner case that even a cheap check could be
a net performance degradation, especially for people whose usage pattern
doesn't match this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Physical append-only tables

2016-11-14 Thread Magnus Hagander
On Mon, Nov 14, 2016 at 2:35 AM, Alvaro Herrera 
wrote:

> Magnus Hagander wrote:
>
> > But then consider the same table. Except rows are typically updated once
> or
> > twice when they are new, and *then* go read only. And we also have a
> > process that at some point deletes *some* old rows (but not all - in
> fact,
> > only a small portion).
> >
> > In this case, the next INSERT once VACUUM has run is likely to stick a
> > "new" row somewhere very "far back" in the table, since there is now free
> > space there. This more or less completely ruins the BRIN index usability,
> > as the "old" blocks will now contain a single row from a "new" series.
>
> Yeah.  When we initially discussed BRIN, there was a mention of allowing
> a BRIN index to guide new tuple location -- something like
> auto-clustering, if you will.  We haven't discussed the exact details
> but I think something along those lines is worth considering.
>

What I'm talking about is something that would be a lot simpler than
auto-clustering. I'm not even talking about trying to detect if the row was
about to go into the right place -- this might be expensive and certainly
more complicated. I'm only talking about a simple case where we *never* put
anything anywhere other than at the end of the table, period. That should
make the check both cheap and simple.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Transaction user id through logical decoding

2016-11-14 Thread valeriof
Hi Craig,
Thanks for your answer, I'll need to dig deep in the solutions you suggested
although I was looking for something less convoluted.
Valerio



--
View this message in context: 
http://postgresql.nabble.com/Transaction-user-id-through-logical-decoding-tp5923261p5930219.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Danger of automatic connection reset in psql

2016-11-14 Thread Oleksandr Shulgin
On Fri, Nov 11, 2016 at 5:37 AM, Pavel Stehule 
wrote:

>
> 2016-11-11 5:14 GMT+01:00 Ashutosh Bapat 
> :
>
>> >
>> > How about, instead of all this, adding an option to psql to suppress
>> > the automatic reconnect behavior?  When enabled, psql just exits
>> > instead of trying to reconnect.
>> >
>> +1. But, existing users may not notice addition of the new option and
>> still continue to face problem. If we add the option and make it
>> default not to reconnect, they will notice it and use option to get
>> older behaviour, but that will break applications relying on the
>> current behaviour. Either way, users will have at least something to
>> control the connection reset.
>>
>
> The reconnect in not interactive mode is a bad idea - and there should be
> disabled everywhere (it cannot to break any application - the behave of
> script when a server is restarted must be undefined (100% if ON_ERROR_STOP
> is active). In interactive mode it can be controlled by some psql session
> variables like AUTOCOMMIT.
>

Yes, I've never suggested it should affect non-interactive mode.

Automatic connection reset is a nice feature for server development, IMO.
Is it really useful for anything else is a good question.

At least an option to control that behavior seems like a good idea, maybe
even set it to 'no reconnect' by default, so that people who really use it
can make conscious choice about enabling it in their .psqlrc or elsewhere.

--
Alex


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-11-14 Thread Amit Kapila
On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier
 wrote:
> On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI
>  wrote:
>> At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila  
>> wrote in 
>>> I think it is good to check the performance impact of this patch on
>>> many core m/c.  Is it possible for you to once check with Alexander
>>> Korotkov to see if he can provide you access to his powerful m/c which
>>> has 70 cores (if I remember correctly)?
>
> I heard about a number like that, and there is no reason to not do
> tests to be sure. With that many cores we are more likely going to see
> the limitation of the number of XLOG insert slots popping up as a
> bottleneck, but that's just an assumption without any numbers.
> Alexander (added in CC), could it be possible to get an access to this
> machine?
>
>>> @@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
>>> xl_standby_lock *locks)
>>>   XLogBeginInsert();
>>>   XLogRegisterData((char *) , offsetof(xl_standby_locks, locks));
>>>   XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
>>> + XLogSetFlags(XLOG_NO_PROGRESS);
>>>
>>>
>>> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
>>> This function is called not only in LogStandbySnapshot(), but during
>>> DDL operations as well where lockmode >= AccessExclusiveLock.
>>
>> This does not remove any record from WAL. So theoretically any
>> kind of record can be NO_PROGRESS, but practically as long as
>> checkpoints are not unreasonably suppressed. Any explicit
>> database operation must be accompanied with at least commit
>> record that triggers checkpoint. NO_PROGRESSing there doesn't
>> seem to me to harm database durability for this reason.
>>

By this theory, you can even mark the insert record as no progress
which is not good.

>> The objective of this patch is skipping WALs on completely-idle
>> state and the NO_PROGRESSing is necessary to do its work. Of
>> course we can distinguish exclusive lock with PROGRESS and
>> without PROGRESS but it is unnecessary complexity.
>
> The point that applies here is that logging the exclusive lock
> information is necessary for the *standby* recovery conflicts, not the
> primary  which is why it should not influence the checkpoint activity
> that is happening on the primary. So marking this record with
> NO_PROGRESS is actually fine to me.
>

The progress parameter is used not only for checkpoint activity but by
bgwriter as well for logging standby snapshot.  If you want to keep
this record under no_progress category (which I don't endorse), then
it might be better to add a comment, so that it will be easier for the
readers of this code to understand the reason.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-11-14 Thread Mithun Cy
On Mon, Nov 14, 2016 at 1:37 PM, Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:
 > No, there's no concern about compatibility.  Please look at this:
 > https://www.postgresql.org/docs/devel/static/protocol-
flow.html#PROTOCOL-ASYNC
Thanks, my concern is suppose you have 3 server in cluster A(new version),
B(new version), C(old version). If we implement as above only new servers
will send ParameterStatus message to indicate what type of server we are
connected. Server C will not send same. So we will not be able to use new
feature "failover to new master" for such a kind of cluster.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Floating point comparison inconsistencies of the geometric types

2016-11-14 Thread Emre Hasegeli
> What way to deal with it is in your mind? The problem hides
> behind operators. To fix it a user should rewrite a expression
> using more primitive operators. For example, (line_a # point_a)
> should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in
> more primitive way. I regared this that the operator # just
> become useless.

Simple equations like this works well with the current algorithm:

> select '(0.1,0.1)'::point @ '(0,0),(1,1)'::line;

The operator does what you expect from it.  Users can use something
like you described to get fuzzy behaviour with an epsilon they choose.

> Regarding optimization, at least gcc generates seemingly not so
> different code for the two. The both generally generates extended
> code directly calling isnan() and so. Have you measured the
> performance of the two implement (with -O2, without
> --enable-cassert)?  This kind of hand-optimization gets
> legitimacy when we see a siginificant difference, according to
> the convention here.. I suppose.

I tested it with this program:

> int main()
> {
>double  i,
>j;
>int result = 0;
>
>for (i = 0.1; i < 1.0; i += 1.0)
>for (j = 0.1; j < 1.0; j += 1.0)
>if (float8_lt(i, j))
>result = (result + 1) % 10;
>
>return result;
> }

The one calling cmp() was noticeable slower.

./test1  0.74s user 0.00s system 99% cpu 0.748 total
./test2  0.89s user 0.00s system 99% cpu 0.897 total

This would probably be not much noticeable by calling SQL functions
which are doing a few comparisons only, but it may be necessary to do
many more comparisons on some places.  I don't find the optimised
versions less clear than calling the cmp().  I can change it the other
way, if you find it more clear.

> At least the comment you dropped by the patch,
>
>  int
>  float4_cmp_internal(float4 a, float4 b)
>  {
> -   /*
> -* We consider all NANs to be equal and larger than any non-NAN. This 
> is
> -* somewhat arbitrary; the important thing is to have a consistent 
> sort
> -* order.
> -*/
>
> seems very significant and should be kept anywhere relevant.

I will add it back on the next version.

> I seached pgsql-general ML but counldn't find a complaint about
> the current behavior. Even though I'm not familar with PostGIS, I
> found that it uses exactly the same EPSILON method with
> PostgreSQL.

Is it?  I understood from Paul Ramsey's comment on this thread [1]
that they don't.

> If we had an apparent plan to use them for other than
> earth-scale(?)  geometric usage, we could design what they should
> be. But without such a plan it is just a breakage of the current
> usage.

We give no promises about the geometric types being useful in earth scale.

> About What kind of (needless) complication you are saying? The
> fuzziness seems to me essential for geometric comparisons to work
> practically. Addition to that, I don't think that we're not
> allowed to change the behavior in such area of released versions
> the time after time.

Even when it is a total mess?

> I don't think index scan and tolerant comparison are not
> contradicting. Could you let me have an description about the
> indexing capabilities and the inconsistencies?

The first problem is that some operators are not using the epsilon.
This requires special treatment while developing index support for
operators.  I have tried to support point for BRIN using the box
operators, and failed because of that.

Comparing differences with epsilon is not applicable the same way to
every operator.  Even with simple operators like "point in box" it
covers different distances outside the box depending on where the
point is.  For example, "point <-> box < EPSILON" wouldn't be
equivalent with "point <@ box", when the point is outside corner of
the box.  Things get more complicated with lines.  Because of these,
we are easily violating basic expectations of the operators:

> regression=# select '{1000,0.01,0}'::line ?|| '{9,0.9,0}'::line;
>
> ?column?
> --
> f
> (1 row)
>
> regression=# select '{9,0.9,0}'::line ?|| '{1000,0.01,0}'::line;
> ?column?
> --
> t
> (1 row)

Another problem is lack of hash and btree operator classes.  In my
experience, the point datatype is by far most used one.  People often
try to use it on DISTINCT, GROUP BY, or ORDER BY clauses and complain
when it doesn't work.  There are many complaints like this on the
archives.  If we get rid of the epsilon, we can easily add those
operator classes.

[1] 
https://www.postgresql.org/message-id/CACowWR0DBEjCfBscKKumdRLJUkObjB7D%3Diw7-0_ZwSFJM9_gpw%40mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove the comment on the countereffectiveness of large shared_buffers on Windows

2016-11-14 Thread Amit Kapila
On Sun, Nov 13, 2016 at 1:23 AM, Jeff Janes  wrote:
> On Mon, Nov 7, 2016 at 7:03 PM, Amit Kapila  wrote:
>>
>> On Tue, Nov 8, 2016 at 5:12 AM, Jeff Janes  wrote:
>> > On Mon, Nov 7, 2016 at 5:55 AM, Amit Kapila 
>> > wrote:
>> >>
>> >>
>> >> Isn't it somewhat strange that writes are showing big win whereas
>> >> reads doesn't show much win?
>> >
>> >
>> > I don't find that unusual, and have seen the same thing on Linux.
>> >
>> > With small shared_buffers, you are constantly throwing dirty buffers at
>> > the
>> > kernel in no particular order, and the kernel does a poor job of
>> > predicting
>> > when the same buffer will be dirtied repeatedly and only needs the final
>> > version of the data actually written to disk.
>> >
>>
>> Okay and I think partially it might be because we don't have writeback
>> optimization (done in 9.6) for Windows.
>
>
> Is the writeback optimization the introduction of checkpoint_flush_after, or
> is it something else?
>

Yes.

> If it is checkpoint_flush_after, then I don't think that that is related.
> In fact, they operate in opposite directions.  The writeback optimization
> forces the kernel to be more eager about writing out dirty data, so it
> doesn't have a giant pile of it when the fsyncs comes at the end of the
> checkpoint.  Using a large shared_buffers forces it to be less eager, by not
> turning the dirty buffers over to the kernel as often.
>

Okay, I got your point.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gather Merge

2016-11-14 Thread Thomas Munro
On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathia
 wrote:
> On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro 
> wrote:
>> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
>> + * Portions Copyright (c) 1994, Regents of the University of California
>>
>> Shouldn't this say just "(c) 2016, PostgreSQL Global Development
>> Group"?
>
> Fixed.

The year also needs updating to 2016 in nodeGatherMerge.h.

>> + /* Per-tuple heap maintenance cost */
>> + run_cost += path->path.rows * comparison_cost * 2.0 * logN;
>>
>> Why multiply by two?  The comment above this code says "about log2(N)
>> comparisons to delete the top heap entry and another log2(N)
>> comparisons to insert its successor".  In fact gather_merge_getnext
>> calls binaryheap_replace_first, which replaces the top element without
>> any comparisons at all and then performs a sift-down in log2(N)
>> comparisons to find its new position.  There is no per-tuple "delete"
>> involved.  We "replace" the top element with the value it already had,
>> just to trigger the sift-down, because we know that our comparator
>> function might have a new opinion of the sort order of this element.
>> Very clever!  The comment and the 2.0 factor in cost_gather_merge seem
>> to be wrong though -- or am I misreading the code?
>>
> See cost_merge_append.

That just got tweaked in commit 34ca0905.

> Looking at the plan I realize that this is happening because wrong costing
> for Gather Merge. Here in the plan we can see the row estimated by
> Gather Merge is wrong. This is because earlier patch GM was considering
> rows = subpath->rows, which is not true as subpath is partial path. So
> we need to multiple it with number of worker. Attached patch also fixed
> this issues. I also run the TPC-H benchmark with the patch and results
> are same as earlier.

In create_grouping_paths:
+   double  total_groups = gmpath->rows *
gmpath->parallel_workers;

This hides a variable of the same name in the enclosing scope.  Maybe confusing?

In some other places like create_ordered_paths:
+   double  total_groups = path->rows * path->parallel_workers;

Though it probably made sense to use this variable name in
create_grouping_paths, wouldn't total_rows be better here?

It feels weird to be working back to a total row count estimate from
the partial one by simply multiplying by path->parallel_workers.
Gather Merge will underestimate the total rows when parallel_workers <
4, if using partial row estimates ultimately from  cost_seqscan which
assume some leader contribution.  I don't have a better idea though.
Reversing cost_seqscan's logic certainly doesn't seem right.  I don't
know how to make them agree on the leader's contribution AND give
principled answers, since there seems to be some kind of cyclic
dependency in the costing logic (cost_seqscan really needs to be given
a leader contribution estimate from its superpath which knows whether
it will allow the leader to pull tuples greedily/fairly or not, but
that superpath hasn't been created yet; cost_gather_merge needs the
row count from its subpath).  Or maybe I'm just confused.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove the comment on the countereffectiveness of large shared_buffers on Windows

2016-11-14 Thread Amit Kapila
On Sat, Nov 12, 2016 at 11:00 PM, Magnus Hagander  wrote:
> On Sat, Nov 12, 2016 at 4:03 AM, Amit Kapila 
> wrote:
>>
>> On Fri, Nov 11, 2016 at 3:01 PM, Magnus Hagander 
>> wrote:
>> >> > Based on this optimization we might want to keep the text that says
>> >> > large
>> >> > shared buffers on Windows aren't as effective perhaps,
>>
>> Sounds sensible or may add a line to say why it isn't as effective as on
>> Linux.
>
>
> Do we actually know *why*?
>

No, I have never investigated it for Windows.  I am just telling based
on results reported in this thread.  I have seen that there is a
noticeable difference of read-only performance when data fits in
shared buffers as compared to when it doesn't fit on Linux systems.

>>
>> Right, but for other platforms, the recommendation seems to be 25% of
>> RAM, can we safely say that for Windows as well?  As per test results
>> in this thread, it seems the read-write performance degrades when
>> shared buffers have increased from 12.5 to 25%.  I think as the test
>> is done for a short duration so that degradation could be just a run
>> to run to run variation, that's why I suggested doing few more tests.
>
>
> We talk about 25%, but only up to a certain size. It's suggested as a
> starting point. The 25% value us probably good as a starting point, as it's
> recommended, but not as a "recommended setting". I'm fine with doing
> something similar for Windows -- say "10-15% as a starting point, but you
> have to check with your workload" kind of statements.
>

Okay, not a problem.  However, I am not sure the results in this
thread are sufficient proof as for read-only tests, there is no
noticeable win by increasing shared buffers and read-write tests seems
to be quite short (60 seconds) to rely on it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-14 Thread Ashutosh Bapat
On Mon, Nov 14, 2016 at 5:16 AM, Tsunakawa, Takayuki
 wrote:
> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Ashutosh Bapat
>> I have changed some comments around this block. See attached patch.
>> Let me know if that looks good.
>
> Thanks, it looks good.
>
Thanks. The patch 02_close_listen... closes the listen sockets
explicitly when it's known that postmaster is going to stop all the
children and then die. I have tried to see, if there's a possibility
that it closes the listen sockets but do not actually die, thus
causing a server which doesn't accept any connections and doesn't die.
But I have not found that possibility.

I do not have any further comments about both the patches. None else
has volunteered to review the patch till now. So, marking the entry as
ready for committer.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Allow TAP tests to be run individually

2016-11-14 Thread Craig Ringer
On 14 November 2016 at 16:52, Michael Paquier  wrote:
> On Mon, Nov 14, 2016 at 3:45 PM, Craig Ringer  wrote:
>> On 11 November 2016 at 18:13, Michael Paquier  
>> wrote:
>>> On Fri, Nov 11, 2016 at 6:10 PM, Craig Ringer  wrote:
 Please backpatch to at least 9.6 since it's trivial and we seem to be
 doing that for TAP. 9.5 and 9.4 would be nice too :)
>>>
>>> Yes please!
>>
>> No immediate takers, so adding to CF.
>>
>> I've taken the liberty of adding you as a reviewer based on your
>> response and the simplicity of the patch. if you get the chance to
>> test and verify please set ready for committer.
>
> I don't mind. This patch uses the following pattern:
> $(or $(PROVE_TESTS),t/*.pl)
> While something more spread in Postgres source would be something like that:
> $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
> It seems to me that we'd prefer that for consistency, but I see no
> reason to not keep your patch as well. I am marking that as ready for
> committer.

Thanks.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-11-14 Thread Michael Paquier
On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHI
 wrote:
> It applies the master and compiled cleanly and no error by
> regtest.  (I didn't confirmed that the problem is still fixed but
> seemingly no problem)

Thanks for double-checking.

> If I'm not missing something, at the worst we have a checkpoint
> after a checkpointer restart that should have been supressed. Is
> it worth picking it up for the complexity?

I think so, that's not that much code if you think about it as there
is already a routine to get the timestamp of the lastly switched
segment that gets used by checkpointer.c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-11-14 Thread Kyotaro HORIGUCHI
Hello,

It applies the master and compiled cleanly and no error by
regtest.  (I didn't confirmed that the problem is still fixed but
seemingly no problem)

At Mon, 14 Nov 2016 15:09:09 +0900, Michael Paquier  
wrote in 
> On Sat, Nov 12, 2016 at 9:01 PM, Andres Freund  wrote:
> > On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:
> >
> >> + * This takes also
> >> + * advantage to avoid 8-byte torn reads on some platforms by using the
> >> + * fact that each insert lock is located on the same cache line.
> >
> > Something residing on the same cache line doens't provide that guarantee
> > on all platforms.
> 
> OK. Let's remove it then.
> 
> >> + * XXX: There is still room for more improvements here, particularly
> >> + * WAL operations related to unlogged relations (INIT_FORKNUM) should not
> >> + * update the progress LSN as those relations are reset during crash
> >> + * recovery so enforcing buffers of such relations to be flushed for
> >> + * example in the case of a load only on unlogged relations is a waste
> >> + * of disk write.
> >
> > Commit records still have to be written, everything else doesn't write
> > WAL. So I'm doubtful this matters much?
> 
> Hm, okay. In most cases this may not matter... Let's rip that off.
> 
> >> @@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr 
> >> fpw_lsn)
> >>   inserted = true;
> >>   }
> >>
> >> + /*
> >> +  * Update the LSN progress positions. At least one WAL insertion lock
> >> +  * is already taken appropriately before doing that, and it is 
> >> simpler
> >> +  * to do that here when the WAL record data and type are at hand.
> >
> > But we don't use the "WAL record data and type"?
> 
> Yes, at some point this patch did so...
> 
> >> + * GetProgressRecPtr -- Returns the newest WAL activity position, or in
> >> + * other words any activity not referring to standby logging or segment
> >> + * switches. Finding the last activity position is done by scanning each
> >> + * WAL insertion lock by taking directly the light-weight lock associated
> >> + * to it.
> >> + */
> >
> > I'd prefer not to list the specific records here - that's just
> > guaranteed to get out of date. Why not say something "any activity not
> > requiring a checkpoint to be triggered" or such?
> 
> OK. Makes sense to minimize maintenance.
> 
> >> +  * If this isn't a shutdown or forced checkpoint, and if there has 
> >> been no
> >> +  * WAL activity, skip the checkpoint.  The idea here is to avoid 
> >> inserting
> >> +  * duplicate checkpoints when the system is idle. That wastes log 
> >> space,
> >> +  * and more importantly it exposes us to possible loss of both 
> >> current and
> >> +  * previous checkpoint records if the machine crashes just as we're 
> >> writing
> >> +  * the update.
> >
> > Shouldn't this mention archiving and also that we also ignore some forms
> > of WAL activity?
> 
> I have reworded that as:
> "If this isn't a shutdown or forced checkpoint, and if there has been
> no WAL activity requiring a checkpoint, skip it."
> 
> >> -/* Should the in-progress insertion log the origin? */
> >> -static bool include_origin = false;
> >> +/* status flags of the in-progress insertion */
> >> +static uint8 status_flags = 0;
> >
> > that seems a bit too generic a name. 'curinsert_flags'?
> 
> OK.
> 
> >>   /*
> >> -  * only log if enough time has passed and some xlog 
> >> record has
> >> -  * been inserted.
> >> +  * Only log if enough time has passed, that some WAL 
> >> activity
> >> +  * has happened since last checkpoint, and that some 
> >> new WAL
> >> +  * records have been inserted since the last time we 
> >> came here.
> >
> > I think that sentence needs some polish.
> 
> Let's do this better:
> /*
> -* only log if enough time has passed and some xlog record has
> -* been inserted.
> +* Only log if one of the following conditions is satisfied since
> +* the last time we came here::
> +* - timeout has been reached.
> +* - WAL activity has happened since last checkpoint.
> +* - New WAL records have been inserted.
>  */
> 
> >>*/
> >>   if (now >= timeout &&
> >> - last_snapshot_lsn != GetXLogInsertRecPtr())
> >> + GetLastCheckpointRecPtr() < 
> >> current_progress_lsn &&
> >> + last_progress_lsn < current_progress_lsn)
> >>   {
> >
> > Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
> > Don't we need to do the comparisons here (and when doing the checkpoint
> > itself) with the REDO pointer 

Re: [HACKERS] pg_dump, pg_dumpall and data durability

2016-11-14 Thread Michael Paquier
On Mon, Nov 14, 2016 at 6:04 PM, Albe Laurenz  wrote:
> Michael Paquier wrote:
>> Meh. I forgot docs and --help output updates.
>
> Looks good, except that you left the "N" option in the getopt_long
> call for pg_dumpall.  I fixed that in the attached patch.

No, v5 has removed it, but it does not matter much now...

> I'll mark the patch "ready for committer".

Thanks!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump, pg_dumpall and data durability

2016-11-14 Thread Albe Laurenz
Michael Paquier wrote:
> Meh. I forgot docs and --help output updates.

Looks good, except that you left the "N" option in the getopt_long
call for pg_dumpall.  I fixed that in the attached patch.

I'll mark the patch "ready for committer".

Yours,
Laurenz Albe


pgdump-sync-v5.patch
Description: pgdump-sync-v5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Allow TAP tests to be run individually

2016-11-14 Thread Michael Paquier
On Mon, Nov 14, 2016 at 3:45 PM, Craig Ringer  wrote:
> On 11 November 2016 at 18:13, Michael Paquier  
> wrote:
>> On Fri, Nov 11, 2016 at 6:10 PM, Craig Ringer  wrote:
>>> Please backpatch to at least 9.6 since it's trivial and we seem to be
>>> doing that for TAP. 9.5 and 9.4 would be nice too :)
>>
>> Yes please!
>
> No immediate takers, so adding to CF.
>
> I've taken the liberty of adding you as a reviewer based on your
> response and the simplicity of the patch. if you get the chance to
> test and verify please set ready for committer.

I don't mind. This patch uses the following pattern:
$(or $(PROVE_TESTS),t/*.pl)
While something more spread in Postgres source would be something like that:
$(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
It seems to me that we'd prefer that for consistency, but I see no
reason to not keep your patch as well. I am marking that as ready for
committer.
-- 
Michael
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index ea61eb5..aa1fa65 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -354,12 +354,12 @@ ifeq ($(enable_tap_tests),yes)
 
 define prove_installcheck
 rm -rf $(CURDIR)/tmp_check/log
-cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
+cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
 endef
 
 define prove_check
 rm -rf $(CURDIR)/tmp_check/log
-cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
+cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
 endef
 
 else
diff --git a/src/test/perl/README b/src/test/perl/README
index 710a0d8..cfb45a1 100644
--- a/src/test/perl/README
+++ b/src/test/perl/README
@@ -6,6 +6,12 @@ across the source tree, particularly tests in src/bin and src/test. It's used
 to drive tests for backup and restore, replication, etc - anything that can't
 really be expressed using pg_regress or the isolation test framework.
 
+The tests are invoked via perl's 'prove' command, wrapped in PostgreSQL
+makefiles to handle instance setup etc. See the $(prove_check) and
+$(prove_installcheck) targets in Makefile.global. By default every test in the
+t/ subdirectory is run. Individual test(s) can be run instead by passing
+something like PROVE_TESTS="t/001_testname.pl t/002_othertestname.pl" to make.
+
 You should prefer to write tests using pg_regress in src/test/regress, or
 isolation tester specs in src/test/isolation, if possible. If not, check to
 see if your new tests make sense under an existing tree in src/test, like

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-11-14 Thread Masahiko Sawada
On Tue, Nov 8, 2016 at 10:12 PM, Michael Paquier
 wrote:
> On Tue, Nov 8, 2016 at 12:25 AM, Masahiko Sawada  
> wrote:
>> On Tue, Oct 25, 2016 at 10:35 PM, Michael Paquier
>>  wrote:
>>> +   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
>>> +   return SyncRepGetSyncStandbysPriority(am_sync);
>>> +   else /* SYNC_REP_QUORUM */
>>> +   return SyncRepGetSyncStandbysQuorum(am_sync);
>>> Both routines share the same logic to detect if a WAL sender can be
>>> selected as a candidate for sync evaluation or not, still per the
>>> selection they do I agree that it is better to keep them as separate.
>>>
>>> +   /* In quroum method, all sync standby priorities are always 1 */
>>> +   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
>>> +   priority = 1;
>>> Honestly I don't understand why you are enforcing that. Priority can
>>> be important for users willing to switch from ANY to FIRST to have a
>>> look immediately at what are the standbys that would become sync or
>>> potential.
>>
>> I thought that since all standbys appearing in s_s_names list are
>> treated equally in quorum method, these standbys should have same
>> priority.
>> If these standby have different sync_priority, it looks like that
>> master server replicates to standby server based on priority.
>
> No actually, because we know that they are a quorum set, and that they
> work in the same set. The concept of priorities has no real meaning
> for quorum as there is no ordering of the elements. Another, perhaps
> cleaner idea may be to mark the field as NULL actually.

We know that but I'm concerned it might confuse the user.
If these priorities are the same, it can obviously imply that all of
the standby listed in s_s_names are handled equally.

>>> It is not possible to guess from how many standbys this needs to wait
>>> for. One idea would be to mark the sync_state not as "quorum", but
>>> "quorum-N", or just add a new column to indicate how many in the set
>>> need to give a commit confirmation.
>>
>> As Simon suggested before, we could support another feature that
>> allows the client to control the quorum number.
>> Considering adding that feature, I thought it's better to have and
>> control that information as a GUC parameter.
>> Thought?
>
> Similarly that would be a SIGHUP parameter? Why not. Perhaps my worry
> is not that much legitimate, users could just look at s_s_names to
> guess how many in hte set a commit needs to wait for.

It would be PGC_USRSET similar to synchronous_commit. The user can
specify it in statement level.

> +
> +FIRST and ANY are case-insensitive word
> +and the standby name having these words are must be double-quoted.
> +
> s/word/words/.
>
> +FIRST and ANY specify the method of
> +that how master server controls the standby servers.
> A little bit hard to understand, I would suggest:
> FIRST and ANY specify the method used by the master to control the
> standby servers.
>
> +The keyword FIRST, coupled with an integer
> +number N higher-priority standbys and makes transaction commit
> +when their WAL records are received.
> This is unclear to me. Here is a correction:
> The keyword FIRST, coupled with an integer N, makes transaction commit
> wait until WAL records are received fron the N standbys with higher
> priority number.
>
> +synchronous_standby_names. For example, a setting
> +of ANY 3 (s1, s2, s3, s4) makes transaction commits
> +wait until receiving receipts from at least any three standbys
> +of four listed servers s1, s2, 
> s3,
> This could just mention WAL records instead of "receipts".
>
> Instead of saying "an integer number N", we could use num_sync.
>
> + If FIRST or ANY are not specified,
> this parameter
> + behaves as ANY. Note that this grammar is
> incompatible with
> + PostgresSQL 9.6, where no keyword specified
> is equivalent
> + as if FIRST was specified. In short, there is no
> real need to
> + specify num_sync as 
> this
> + behavior does not have changed, as well as it is not
> necessary to mention
> + pre-9.6 versions are the multi-sync grammar has been added in 9.6.
> This paragraph could be reworked, say:
> if FIRST or ANY are not specified this parameter behaves as if ANY is
> used. Note that this grammar is incompatible with PostgreSQL 9.6 which
> is the first version supporting multiple standbys with synchronous
> replication, where no such keyword FIRST or ANY can be used. Note that
> the grammar behaves as if FIRST is used, which is incompatible with
> the post-9.6 behavior.
>
> + Synchronous state of this standby server. quorum-N
> + , where N is the number of synchronous standbys that transactions
> + need to wait for replies from, when standby is considered as a
> + candidate of quorum commit.
> 

Re: [HACKERS] WIP: Covering + unique indexes

2016-11-14 Thread Brad DeJong
Anastasia, et al,

This is a review of including_columns_9.7_v5.patch.

I looked through the commit fest list and this patch was interesting and I 
wanted to try it.

I have used include columns on Microsoft SQL Server; DB2 for Linux, Unix, 
Windows; and DB2 for z/OS.

After reading the e-mail discussions, there are still points that I am not 
clear on.

Given "create table foo (a int, b int, c int, d int)" and "create unique index 
foo_a_b on foo (a, b) including (c)".

   index only?   heap tuple 
needed?
select a, b, c from foo where a = 1yes  no
select a, b, d from foo where a = 1no   yes
select a, bfrom foo where a = 1 and c = 1  ??

It was clear from the discussion that the index scan/access method would not 
resolve the "c = 1" predicate but it was not clear if "c = 1" could be resolved 
without accessing the heap tuple.

Are included columns counted against the 32 column and 2712 byte index limits? 
I did not see either explicitly mentioned in the discussion or the 
documentation. I only ask because in SQL Server the limits are different for 
include columns.


1. syntax - on 2016-08-14, Andrey Borodin wrote "I think MS SQL syntax INCLUDE 
instead of INCLUDING would be better". I would go further than that. This 
feature is already supported by 2 of the top 5 SQL databases and they both use 
INCLUDE. Using different syntax because of an internal implementation detail 
seems short sighted.

2. The patch does not seem to apply cleanly anymore.

> git checkout master
Already on 'master'
> git pull
From http://git.postgresql.org/git/postgresql
   d49cc58..06f5fd2  master -> origin/master
Updating d49cc58..06f5fd2
Fast-forward
 src/include/port.h | 2 +-
 src/port/dirmod.c  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
> patch -pl < including_columns_9.7_v5.patch
patching file contrib/dblink/dblink.c
...
patching file src/backend/parser/gram.y
...
Hunk #2 FAILED at 375.
...
1 out of 12 hunks FAILED -- saving rejects to file 
src/backend/parser/gram.y.rej
...
patching file src/bin/pg_dump/pg_dump.c
...
Hunk #8 FAILED at 6399.
Hunk #9 FAILED at 6426.
...
2 out of 13 hunks FAILED -- saving rejects to file 
src/bin/pg_dump/pg_dump.c.rej
...

3. After "fixing" compilation errors (best guess based on similar change in 
other location), "make check" failed.

> make check
...
parallel group (3 tests):  index_including create_view create_index
 create_index ... ok
 create_view  ... ok
 index_including  ... FAILED
...

Looking at regression.diffs, it looks like the output format of \d tbl 
changed (lines 288,300) from the expected "Column | Type | Modifiers" to 
"Column | Type | Collation | Nullable | Default".

4. documentation - minor items (these are not actual diffs)
  create_index.sgml
  -[ INCLUDING ( column_name )]
  +[ INCLUDING ( column_name [, ...] )]

   An optional INCLUDING clause allows a list of 
columns to be
  -specified which will be included in the index, in the non-key 
portion of the index.
  +specified which will be included in the non-key portion of the 
index.

  The whole paragraph on INCLUDING does not include many of the points 
raised in the feature discussions.

  create_table.sgml (for both UNIQUE and PRIMARY KEY constraints) (similar 
change could be made to nbtree/README)
  -Optional clause INCLUDING allows to add into 
the index
  -a portion of columns on which the uniqueness is not enforced 
upon.
  -Note, that althogh constraint is not enforced on included 
columns, it still
  -depends on them. Consequently, some operations on these columns 
(e.g. DROP COLUMN)
  -can cause cascade constraint and index deletion.

  +An optional INCLUDING clause allows a list of 
columns
  +to be specified which will be included in the non-key portion of 
the index.
  +Although uniqueness is not enforced on the included columns, the 
constraint
  +still depends on them. Consequently, some operations on the 
included columns
  +(e.g. DROP COLUMN) can cause cascading 
constraint and index deletion.

  indexcmds.c
  -* are key columns, and which are just part of the INCLUDING list 
by check
  +* are key columns, and which are just part of the INCLUDING list 
by checking

ruleutils.c
  -   * meaningless there, so do not include them into the 
message.
  +   * meaningless there, so do not include them in the 
message.

pg_dump.c (does "if (fout->remoteVersion >= 90600)" also need to change?)
  -   * In 9.6 we add 

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-11-14 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Mithun Cy
> If you are suggesting me to change in protocol messages, I think that would
> not be backward compatible to older version servers. I also think such level
> of protocol changes will not be allowed. with connection status
> CONNECTION_SETENV used for protocol version 2.0 setup, we sent some query
> like "select pg_catalog.pg_client_encoding()" for same. So I think using
> "SELECT pg_is_in_recovery()" should be fine.

No, there's no concern about compatibility.  Please look at this:

https://www.postgresql.org/docs/devel/static/protocol-flow.html#PROTOCOL-ASYNC

[Excerpt]

ParameterStatus messages will be generated whenever the active value changes 
for any of the parameters the backend believes the frontend should know about. 
Most commonly this occurs in response to a SET SQL command executed by the 
frontend, and this case is effectively synchronous — but it is also possible 
for parameter status changes to occur because the administrator changed a 
configuration file and then sent the SIGHUP signal to the server. Also, if a 
SET command is rolled back, an appropriate ParameterStatus message will be 
generated to report the current effective value.

At present there is a hard-wired set of parameters for which ParameterStatus 
will be generated: they are server_version, server_encoding, client_encoding, 
application_name, is_superuser, session_authorization, DateStyle, 
IntervalStyle, TimeZone, integer_datetimes, and standard_conforming_strings. 
(server_encoding, TimeZone, and integer_datetimes were not reported by releases 
before 8.0; standard_conforming_strings was not reported by releases before 
8.1; IntervalStyle was not reported by releases before 8.4; application_name 
was not reported by releases before 9.0.) Note that server_version, 
server_encoding and integer_datetimes are pseudo-parameters that cannot change 
after startup. This set might change in the future, or even become 
configurable. Accordingly, a frontend should simply ignore ParameterStatus for 
parameters that it does not understand or care about.




Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Floating point comparison inconsistencies of the geometric types

2016-11-14 Thread Kyotaro HORIGUCHI
Hello,

At Sun, 13 Nov 2016 22:41:01 +0100, Emre Hasegeli  wrote in 

> > We can remove the fuzz factor altogether but I think we also
> > should provide a means usable to do similar things. At least "is
> > a point on a line" might be useless for most cases without any
> > fuzzing feature. (Nevertheless, it is a problem only when it is
> > being used to do that:) If we don't find reasonable policy on
> > fuzzing operations, it would be the proof that we shouldn't
> > change the behavior.
> 
> It was my initial idea to keep the fuzzy comparison behaviour on some
> places, but the more I get into I realised that it is almost
> impossible to get this right.  Instead, I re-implemented some
> operators to keep precision as much as possible.  The previous "is a
> point on a line" operator would *never* give the true result without
> the fuzzy comparison.  The new implementation would return true, when
> precision is not lost.

There's no way to accurately store numbers other than some
special ones in floating point format where mantissa precision is
limited. Even 0.1 is not stored accurately with a binary
mantissa. (It would be concealed by machine rounding for most
cases though.)  The same is said for numeric. It cannot store 1/3
accurately and doesn't even conceal the incaccuracy. Even if they
could store numbers accurately, anyway , say, the constant pi
does not have infinite precision.  As the result, we have a
tradition that equal operator on real numbers should be avoided
generally. Numerics are provided mainly for financial use, on
which finite-but-high precision decimal arithmetic are performed.

> I think this is a behaviour people, who are
> working with floating points, are prepared to deal with.  

What way to deal with it is in your mind? The problem hides
behind operators. To fix it a user should rewrite a expression
using more primitive operators. For example, (line_a # point_a)
should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in
more primitive way. I regared this that the operator # just
become useless.


> By the way, "is a point on a line" operator is quite wrong with
> the fuzzy comparison at the moment [1].

Th bug is a isolate problem from the epsilon behavior. It of
course should be fixed, but in "appropriate" way that may defined
in this discussion.

> > The 0001 patch adds many FP comparison functions individually
> > considering NaN. As the result the sort order logic involving NaN
> > is scattered around into the functions, then, you implement
> > generic comparison function using them. It seems inside-out to
> > me. Defining ordering at one place, then comparison using it
> > seems to be reasonable.
> 
> I agree that it would be simpler to use the comparison function for
> implementing other operators.  I have done it other way around to make
> them more optimised.  They are called very often.  I don't think
> checking exit code of the comparison function would be optimised the
> same way.  I could leave the comparison functions as they are, but
> re-implemented them using the others to keep documentation of NaN
> comparison in the single place.

Regarding optimization, at least gcc generates seemingly not so
different code for the two. The both generally generates extended
code directly calling isnan() and so. Have you measured the
performance of the two implement (with -O2, without
--enable-cassert)?  This kind of hand-optimization gets
legitimacy when we see a siginificant difference, according to
the convention here.. I suppose.

At least the comment you dropped by the patch,

 int
 float4_cmp_internal(float4 a, float4 b)
 {
-   /*
-* We consider all NANs to be equal and larger than any non-NAN. This is
-* somewhat arbitrary; the important thing is to have a consistent sort
-* order.
-*/

seems very significant and should be kept anywhere relevant.


> > If the center somehow goes extremely near to the origin, it could
> > result in a false error.
> >
> >> =# select @@ box'(-8e-324, -8e-324), (4.9e-324, 4.9e-324)';
> >> ERROR:  value out of range: underflow
> >
> > I don't think this underflow is an error, and actually it is a
> > change of the current behavior without a reasonable reason. More
> > significant (and maybe unacceptable) side-effect is that it
> > changes the behavior of ordinary operators. I don't think this is
> > acceptable. More consideration is needed.
> >
> >> =# select ('-8e-324'::float8 + '4.9e-324'::float8) / 2.0;
> >> ERROR:  value out of range: underflow
> 
> This is the current behaviour of float datatype.  My patch doesn't
> change that. 

Very sorry for the mistake! I somehow saw float8pl on master and
float8_div with your patch. Pleas forget it.

> This problem would probably also apply to multiplying
> very small values.  I agree that this is not the ideal behaviour.
> Though I am not sure, if we should go to a