date:20160122

Re: [HACKERS] Releasing in September

2016-01-22 Thread Amit Kapila

On Fri, Jan 22, 2016 at 11:46 PM, Simon Riggs  wrote:

> On 22 January 2016 at 16:34, Robert Haas  wrote:
>
>
>> For my part, I am not sure the names in the release notes are actually
>> all that helpful.
>
>
> It has one important effect of current interest: establishing the truth
> that multiple people and multiple companies are involved in producing and
> maintaining PostgreSQL. Whether the names are properly attributed will
> always be a time-consuming task, but I will oppose any attempt to remove or
> obscure evidence of who develops PostgreSQL, wherever that occurs.
>
>
Thats a valid point and I think one way to retain such an evidence
without adding name of author/reviewer/committer is to add a link to
commit/'s after each feature, something like whats done in some of
the other release notes [1].


[1] - http://kernelnewbies.org/LinuxChanges


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] insert/update performance

2016-01-22 Thread Jinhua Luo

Hi,

The vacuum doesn't recycle the rows obsoleted by update? I don't think
so. In the above vacuum result, I do not delete any rows, but the
vacuum still recycles a fraction of rows, obviously they're obsoleted
by update.

I know plain vacuum (without full option) do not reduce the size of
the whole table file/segments, but it should refresh the fsm. In my
case, the update statement did modify the index column, but is it
related to the fsm? I think anyways, the update would obsolete
previous versions, as long as they are not hold by any active
transactions, they would be recycled and count in the fsm, right?  I
just cannot understand why the recycle ratio is not 50%.


Regards,
Jinhua Luo

2016-01-23 15:13 GMT+08:00 Amit Kapila :
> On Sat, Jan 23, 2016 at 12:13 PM, Jinhua Luo  wrote:
>>
>> Hi All,
>>
>> Here is my test environment:
>>
>> postgresql 9.4.5, Ubuntu 14.04, 8 CPU core, 8GB ram, SCSI hard disk
>>
>> I have a table with 70 columns, and 6 indexes. The data flow is a
>> special OLTP model: frequent inserts (2000 tps), and each inserted row
>> would be updated very soon (i.e. the number of inserts is equal to the
>> number of updates).
>>
>>
>> At last, I found it's not the problem of autovacuum.
>> I do a simple test: I truncate the table, disable the autovacuum, and
>> run the application for a few minutes, then I invokes vacuum manually,
>> it gives a strange output:
>> found 598 removable, 25662 nonremovable row versions in 3476 pages
>> DETAIL:  0 dead row versions cannot be removed yet
>>
>> As said before, the number of inserts is equal to the number of
>> updates. So the bloat of the table should be 100%, and the number of
>> removable rows should be equal to the number of nonremovable rows,
>> which is the real number of inserts issued by the application.
>>
>> But the result shows that the number of nonremovable rows is just a
>> small fraction. If it's true, then no wonder that the table would keep
>> extending for a long time, because the free space is almost small.
>>
>> Why the free space after vacuum is not 50% of the original size in my
>> case?
>>
>
> Vacuum just removes the deleted rows (provided they are not visible to
> any active transaction), it won't reduce the size which is already extended,
> unless the empty space is at end of relation.
>
> Are you updating any index column?
>
> I think if you should once try with higher fill factor as you have lot
> of updates.
>
>
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] insert/update performance

2016-01-22 Thread Amit Kapila

On Sat, Jan 23, 2016 at 12:13 PM, Jinhua Luo  wrote:
>
> Hi All,
>
> Here is my test environment:
>
> postgresql 9.4.5, Ubuntu 14.04, 8 CPU core, 8GB ram, SCSI hard disk
>
> I have a table with 70 columns, and 6 indexes. The data flow is a
> special OLTP model: frequent inserts (2000 tps), and each inserted row
> would be updated very soon (i.e. the number of inserts is equal to the
> number of updates).
>
>
> At last, I found it's not the problem of autovacuum.
> I do a simple test: I truncate the table, disable the autovacuum, and
> run the application for a few minutes, then I invokes vacuum manually,
> it gives a strange output:
> found 598 removable, 25662 nonremovable row versions in 3476 pages
> DETAIL:  0 dead row versions cannot be removed yet
>
> As said before, the number of inserts is equal to the number of
> updates. So the bloat of the table should be 100%, and the number of
> removable rows should be equal to the number of nonremovable rows,
> which is the real number of inserts issued by the application.
>
> But the result shows that the number of nonremovable rows is just a
> small fraction. If it's true, then no wonder that the table would keep
> extending for a long time, because the free space is almost small.
>
> Why the free space after vacuum is not 50% of the original size in my
case?
>

Vacuum just removes the deleted rows (provided they are not visible to
any active transaction), it won't reduce the size which is already extended,
unless the empty space is at end of relation.

Are you updating any index column?

I think if you should once try with higher fill factor as you have lot
of updates.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-01-22 Thread Amit Kapila

On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar  wrote:

> On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund  wrote:
>
>> On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
>>
>> I think it's a worthwhile approach to pursue. But until it actually
>> fixes the problem of leaving around uninitialized pages I don't think
>> it's very meaningful to do performance comparisons.
>>
>
> Attached patch solves this issue, I am allocating the buffer for each page
> and initializing the page, only after that adding to FSM.
>

Few comments about patch:

1.
Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier

2.
! page = BufferGetPage(buffer);
! PageInit(page, BufferGetPageSize
(buf), 0);
!
! freespace = PageGetHeapFreeSpace(page);
!
MarkBufferDirty(buffer);
! UnlockReleaseBuffer(buffer);
!
RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

What is the need to mark page dirty here, won't it automatically
be markerd dirty once the page is used?  I think it is required
if you wish to WAL-log this action.

3. I think you don't need to multi-extend a relation if
HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
get a new page by extending a relation.

4. Again why do you need this multi-extend optimization for local
relations (those only accessible to current backend)?

5. Do we need this for nbtree as well?  One way to check that
is by Copying large data in table having index.

> Note: Test with both data and WAL on Magnetic Disk : No significant
> improvement visible
> -- I think wall write is becoming bottleneck in this case.
>
>
In that case, can you try the same test with un-logged tables?

Also, it is good to check the performance of patch with read-write work
load to ensure that extending relation in multiple-chunks should not
regress such cases.

> Currently i have kept extend_num_page as session level parameter but i
> think later we can make this as table property.
> Any suggestion on this ?
>
>
I think we should have a new storage_parameter at table level
extend_by_blocks or something like that instead of GUC. The
default value of this parameter should be 1 which means retain
current behaviour by default.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

[HACKERS] insert/update performance

2016-01-22 Thread Jinhua Luo

Hi All,

Here is my test environment:

postgresql 9.4.5, Ubuntu 14.04, 8 CPU core, 8GB ram, SCSI hard disk

I have a table with 70 columns, and 6 indexes. The data flow is a
special OLTP model: frequent inserts (2000 tps), and each inserted row
would be updated very soon (i.e. the number of inserts is equal to the
number of updates).

>From the start, the table (initially empty) would be extended bigger
and bigger (via mdextend(), which allocates a new page and zero out
the page to the disk). During this period, the query latency is ok
(Because most fsync of writes are queued to the next checkpoint, so
the writes only copy zero-filled data to the page cache of linux
kernel).

After a long time, the mdextend() disappears, instead, the mdread()
occupy almost all I/O requests, because the fsm indicates enough free
space within the old table segments. At this stage, the performance is
bad. Because most page cache of linux kernel is occupied by the latest
table segments and the index files (I check it via mincore() syscall).
The read from disk (in sync way, of course) slow down all subsequent
queries.

Why fsm is updated much slower than the query speed? If the fsm is
updated in time, then the free space would still possibly cached by
the linux kernel. I thought it's due to the autovacuum is not so
aggressive. So I try to reconfigure it, e.g. set cost_delay to 0,
lower down the vacuum threshold. But it doesn't help.

At last, I found it's not the problem of autovacuum.
I do a simple test: I truncate the table, disable the autovacuum, and
run the application for a few minutes, then I invokes vacuum manually,
it gives a strange output:
found 598 removable, 25662 nonremovable row versions in 3476 pages
DETAIL:  0 dead row versions cannot be removed yet

As said before, the number of inserts is equal to the number of
updates. So the bloat of the table should be 100%, and the number of
removable rows should be equal to the number of nonremovable rows,
which is the real number of inserts issued by the application.

But the result shows that the number of nonremovable rows is just a
small fraction. If it's true, then no wonder that the table would keep
extending for a long time, because the free space is almost small.

Why the free space after vacuum is not 50% of the original size in my case?

Please help. Thanks!


Regards,
Jinhua Luo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-22 Thread Thom Brown

On 23 January 2016 at 03:32, Thom Brown  wrote:
> On 22 January 2016 at 19:30, Victor Wagner  wrote:
>> On Tue, 19 Jan 2016 14:34:54 +
>> Thom Brown  wrote:
>>
>>>
>>> The segfault issue I originally reported now appears to be resolved.
>>>
>>> But now I have another one:
>>>
>>> psql
>>> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
>>> -c 'show port'
>>
>> Here is new version of the patch. Now I've reworked hostorder=random and
>> it seems to work as well as sequential. failover_timeout works too.
>> I've also found a case when attempt to connect fail when receiving
>> FATAL message from server which is not properly up yet. So, it is fixed
>> too.
>>
>> Addititonally, error messages from all failed connect attempts are not
>> accumulated now. Only last one is returned.
>
> I can't connect to a standby with the patch applied:
>
> thom@swift:~/Development/test$ psql -p 5531 postgres
> psql: thom@swift:~/Development/test$
>
> No error message, nothing in the logs.  I find this is the case with
> any standby, but doesn't affect primaries.
>
> So this has broken existing functionality somewhere.

Okay, I've tested my original complaints, and they appear to be
resolved.  The timeout works fine, the sequential and random orders
behave as expected, and the readonly setting is also working.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-22 Thread Thom Brown

On 22 January 2016 at 19:30, Victor Wagner  wrote:
> On Tue, 19 Jan 2016 14:34:54 +
> Thom Brown  wrote:
>
>>
>> The segfault issue I originally reported now appears to be resolved.
>>
>> But now I have another one:
>>
>> psql
>> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
>> -c 'show port'
>
> Here is new version of the patch. Now I've reworked hostorder=random and
> it seems to work as well as sequential. failover_timeout works too.
> I've also found a case when attempt to connect fail when receiving
> FATAL message from server which is not properly up yet. So, it is fixed
> too.
>
> Addititonally, error messages from all failed connect attempts are not
> accumulated now. Only last one is returned.

I can't connect to a standby with the patch applied:

thom@swift:~/Development/test$ psql -p 5531 postgres
psql: thom@swift:~/Development/test$

No error message, nothing in the logs.  I find this is the case with
any standby, but doesn't affect primaries.

So this has broken existing functionality somewhere.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pglogical - logical replication contrib module

2016-01-22 Thread Steve Singer

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested


This reply will covers a 10,000 foot level review of the feature (some of my 
other replies to the thread cover specifics that came up in testing and code 
level review will come later)

1) Do we want logical replication in core/contrib

10 year ago a popular feeling in the postgresql project was that replication 
didn't belong in core because there were too many different styles.  People 
then went on to complain that it were too many replication projects to choose 
from and that they were hard to use and had lots of corner cases.   The 
evolution of WAL based replication showed us how popular in-core replication 
is.  Users like being able to use in-core features and our community process 
tends to produce better quality in-core solutions than external projects.

I am of the opinion that if we can come up with a solution that meets some 
common use cases then it would be good to have those features in core/contrib.  
At this stage I am not going to get into a discussion of a contrib extension 
versus built in as not an extension.   I don't think a single replication 
solution
will ever meet all use-cases.   I feel that the extensible infrastructure we 
have so far built for logical replication means that people who want to develop 
solutions for use-cases not covered will be in a good position.  This doesn't 
mean we can't or shouldn't try to cover some use cases in core.


2) Does this patch provide a set of logical replication features that meet many 
popular use-cases

Below I will review some use-cases and try to assess how pglogical meets them.

 ** Streaming Postgresql Upgrade

pg_upgrade is great for many situations but sometimes you don't want an in 
place upgrade but you want a streaming upgrade.  Possibly because you don't 
want application downtime but instead you just want to point your applications 
at the upgraded database server in a controlled manner.   Othertimes you
might want an option of upgrading to a newer version of PG but maintain the 
option of having to rollback to the older version if things go badly.

I think pglogical should be able to handle this use case pretty well (assuming 
the source version of PG is actually new enough to include pglogical). 
Support for replicating sequences would need to be added before this is as 
smooth but once sequence support was added I think this would work well.
I also don't see any reason why you couldn't replicate from 9.7 -> 9.6 thought 
since the wire format is abstracted from the internal representation.  This is 
of course dependent not the application not doing anything that is inherently 
in-compatible between the two versions 


** Query only replicas (with temp tables or additional indexes)

Sometimes you want a replica for long running or heavy queries.  Requirements 
for temp tables, additional indexes or maybe the effect on vacuum means that 
our existing WAL based replicas are unsuitable. 

I think pglogical should be able to handle this use case pretty well with the 
caveat being that your replica is an asynchronous replica and will always lag
the origin by some amount.

** Replicating a subset of tables into a different database

Sometimes you wan to replicate a handful of tables from one database to another 
database.  Maybe the first database is the system of record for the data and 
the second database needs an up to date copy for querying.

Pglogical should meet this use case pretty well, it has flexible support for 
selecting which tables get replicated from which source.  Pglogical doesn't 
have any facilities to rename the tables between the origin and replica but 
they could be added later.

** Sharding

Systems that do application level sharding (or even sharding with a fdw) often 
have non-sharded tables that need to be available on all shards for relational 
integrity or joins.   Logical replication is one way to make sure that the 
replicated data gets to all the shards.  Sharding systems also sometimes want
to take the data from individual shards and replicate it to a consolidation 
server for reporting purposes.

Pglogical seems to meet this use case, I guess you would have a designated 
origin for the shared data/global data that all shards would subscribe to
with a set containing the designated data.  For the consolidation use case you 
would have the consolidation server subscribe to all shards

I am less clear about how someone would want DDL changes to work for these 
cases.  The DDL support in the patch is pretty limited so I am not going to 
think much now about how we would want DDL to work.


** Schema changes involving rewriting big tables

Sometimes you have a DDL change on a large table that will involve a table 
rewrite and the best way of deploying the change is to make the DDL ch

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-22 Thread Tomas Vondra


On 01/23/2016 02:35 AM, Michael Paquier wrote:

On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark  wrote:

On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
 wrote:

On 01/22/2016 06:45 AM, Michael Paquier wrote:


So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.



Yep. Same experience here (with qemu-kvm VMs).


I still think a better approach for this is to run the database on an
LVM volume and take lots of snapshots. No VM needed, though it doesn't
hurt. LVM volumes are below the level of the filesystem and a snapshot
captures the state of the raw blocks the filesystem has written to the
block layer. The block layer does no caching though the drive may but
neither the VM solution nor LVM would capture that.

LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.


Another idea: hardcode a PANIC just after rename() with
restart_after_crash = off (this needs is IsBootstrapProcess() checks).
Once server crashes, kill-9 the VM. Then restart the VM and the
Postgres instance with a new binary that does not have the PANIC, and
see how things are moving on. There is a window of up to several
seconds after the rename() call, so I guess that this would work.


I don't see how that would improve anything, as the PANIC has no impact 
on the I/O requests already issued to the system. What you need is some 
sort of coordination between the database and the script that kills the 
VM (or takes a LVM snapshot).


That can be done by simply emitting a particular log message, and the 
"kill script" may simply watch the file (for example over SSH). This has 
the benefit that you can also watch for additional conditions that are 
difficult to check from that particular part of the code (and only kill 
the VM when all of them trigger - for example only on the third 
checkpoint since start, and such).


The reason why I was not particularly thrilled about the LVM snapshot 
idea is that to identify this particular data loss issue, you need to be 
able to reason about the expected state of the database (what 
transactions are committed, how many segments are there). And my 
understanding was that Greg's idea was merely "try to start the DB on a 
snapshot and see if starts / is not corrupted," which would not work 
with this particular issue, as the database seemed just fine - the data 
loss is silent. Adding the "last XLOG segment" into pg_controldata would 
make it easier to detect without having to track details about which 
transactions got committed.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Aggregate

2016-01-22 Thread Haribabu Kommi

On Fri, Jan 22, 2016 at 10:13 PM, David Rowley
 wrote:
> On 22 January 2016 at 17:25, Haribabu Kommi  wrote:
>> Along with these changes, I added a float8 combine function to see
>> how it works under parallel aggregate, it is working fine for float4, but
>> giving small data mismatch with float8 data type.
>>
>> postgres=# select avg(f3), avg(f4) from tbl;
>>avg|   avg
>> --+--
>>  1.1002384186 | 100.12344879
>> (1 row)
>>
>> postgres=# set enable_parallelagg = true;
>> SET
>> postgres=# select avg(f3), avg(f4) from tbl;
>>avg|   avg
>> --+--
>>  1.1002384186 | 100.12344918
>> (1 row)
>>
>> Column - f3 - float4
>> Column - f4 - float8
>>
>> similar problem for all float8 var_pop, var_samp, stddev_pop and stddev_samp
>> aggregates. Any special care is needed for float8 datatype?
>
> I'm not sure if this is what's going on here, as I don't really know
> the range of numbers that you've used to populate f4 with. It would be
> good to know, does "f4" contain negative values too?

No negative values are present in the f4 column.
Following are the SQL statements,

create table tbl(f1 int, f2 char(100), f3 float4, f4 float8);
insert into tbl values(generate_series(1,10), 'Fujitsu', 1.1, 100.12345);


> It's not all that hard to demonstrate the instability of addition with
> float8. Take the following example:
>
> create table d (d float8);
> insert into d 
> values(1223123223412324.2231),(0.23),(-1223123223412324.2231);
>
> # select sum(d order by random()) from d;
>  sum
> -
>0
> (1 row)
>
> same query, once more.
>
> # select sum(d order by random()) from d;
>sum
> --
>  2.3e-013
> (1 row)
>
> Here the result just depends on the order which the numbers have been
> added. You may need to execute a few more times to see the result
> change.
>
> Perhaps a good test would be to perform a sum(f4 order by random()) in
> serial mode, and see if you're getting a stable result from the
> numbers that you have populated the table with.
>
> If that's the only problem at play here, then I for one am not worried
> about it, as the instability already exists today depending on which
> path is chosen to scan the relation. For example an index scan is
> likely not to return rows in the same order as a seq scan.
>
> We do also warn about this in the manual: "Inexact means that some
> values cannot be converted exactly to the internal format and are
> stored as approximations, so that storing and retrieving a value might
> show slight discrepancies. Managing these errors and how they
> propagate through calculations is the subject of an entire branch of
> mathematics and computer science and will not be discussed here,
> except for the following points:" [1]
>
>
> [1] http://www.postgresql.org/docs/devel/static/datatype-numeric.html
>

Thanks for the detailed explanation. Now I understood.

Here I attached updated patch with additional combine function for
two stage aggregates also.


Regards,
Hari Babu
Fujitsu Australia


additional_combine_fns_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Trigonometric functions in degrees

2016-01-22 Thread Tom Lane

I wrote:
> Dean Rasheed  writes:
>> Attached are patches for this and the new functions in degrees, now
>> with POSIX compatible Inf/NaN handling.

> Pushed with minor, mostly cosmetic fixes.

So the early returns from the buildfarm aren't very good:

* tern/sungazer isn't getting exactly 0.5 from sind(30).

* narwhal isn't getting exactly 1 from tand(45) or cotd(45).
It also is producing "0" not "-0" from tand(180) and cotd(270).

* gaur likewise is getting "0" not "-0" from tand(180) and cotd(270).

The tern/sungazer result implies that this:

return (sin(x * (M_PI / 180.0)) / sin(30.0 * (M_PI / 180.0))) / 2.0;

is not producing exactly 0.5, which means that the two sin() calls aren't
producing identical results, which I suspect is best explained by the
theory that the compiler is rearranging 30.0 * (M_PI / 180.0) into
(30.0 * M_PI) / 180.0, and getting a slightly different number that way.

I think we could fix that by replacing (M_PI / 180.0) by a hard-wired
constant (computed to say 20 digits or so).  I'd be inclined to do that
throughout the file for consistency; however, in principle such a change
might affect existing results from the radians() and degrees() functions.

The tand(45) problem doesn't seem especially surprising, because we're
really not making any effort to ensure that that result is exact.
A brute-force way to fix it would be to divide
(sind_q1(arg1) / cosd_q1(arg1)) by (sind_q1(45.0) / cosd_q1(45.0))
but that seems rather expensive, and possibly subject to the compiler
deciding to reorder the arithmetic operations.  I wonder if there's a
smarter way.

As for the minus-zero issues, I doubt that (a) there is a basis for
claiming that minus-zero is more correct than plain zero, or (b) the user
community for this feature really wants minus-zero results anyhow.
I propose getting rid of minus-zero outputs from tand and cotd by dint
of code like

if (result == 0.0)
result = 0.0;

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-22 Thread Michael Paquier

On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark  wrote:
> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
>  wrote:
>> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>>
>>> So, I have been playing with a Linux VM with VMware Fusion and on
>>> ext4 with data=ordered the renames are getting lost if the root
>>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>>> really easily.
>>
>>
>> Yep. Same experience here (with qemu-kvm VMs).
>
> I still think a better approach for this is to run the database on an
> LVM volume and take lots of snapshots. No VM needed, though it doesn't
> hurt. LVM volumes are below the level of the filesystem and a snapshot
> captures the state of the raw blocks the filesystem has written to the
> block layer. The block layer does no caching though the drive may but
> neither the VM solution nor LVM would capture that.
>
> LVM snapshots would have the advantage that you can keep running the
> database and you can take lots of snapshots with relatively little
> overhead. Having dozens or hundreds of snapshots would be unacceptable
> performance drain in production but for testing it should be practical
> and they take relatively little space -- just the blocks changed since
> the snapshot was taken.

Another idea: hardcode a PANIC just after rename() with
restart_after_crash = off (this needs is IsBootstrapProcess() checks).
Once server crashes, kill-9 the VM. Then restart the VM and the
Postgres instance with a new binary that does not have the PANIC, and
see how things are moving on. There is a window of up to several
seconds after the rename() call, so I guess that this would work.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2016-01-22 Thread David Rowley

On 23 January 2016 at 09:44, David Rowley  wrote:
> On 23 January 2016 at 09:17, Jeff Janes  wrote:
>> On Wed, Jan 20, 2016 at 11:06 AM, Robert Haas  wrote:
>>> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
>>>  wrote:
 Agreed. So I've attached a version of the patch which does not have any of
 the serialise/deserialise stuff in it.
>>>
>>> I re-reviewed this and have committed most of it with only minor
>>> kibitizing.  A few notes:
>>
>>
>> This commit has broken pg_dump.  At least, I think this is the thread
>> referencing this commit:
>>
> ...
>> pg_dump: column number -1 is out of range 0..17
>> Segmentation fault (core dumped)
>>
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x00416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0)
>> at pg_dump.c:12670
>> 12670if (strcmp(aggcombinefn, "-") != 0)
>> (gdb) bt
>> #0  0x00416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0)
>> at pg_dump.c:12670
>> #1  0x0041df7a in main (argc=,
>> argv=) at pg_dump.c:810
>
> Yikes :-( I will look at this with priority in the next few hours.
>
> Thanks for the report.

It seems that I must have mistakenly believed that non-existing
columns for previous versions were handled using the power of magic.
Turns out that I was wrong, and they need to be included as dummy
columns in the queries for previous versions.

The attached fixes the problem.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


combine_aggs_pg_dump_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing Functionally Dependent GROUP BY Columns

2016-01-22 Thread Tom Lane

David Rowley  writes:
> [ prune_group_by_clause_59f15cf_2016-01-15.patch ]

I spent some time looking through this but decided that it needs some work
to be committable.

The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list.  I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable).  The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().

The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say).  I do not think
we need to cache it right now but I'd like to have the option to do so.

(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)

Some lesser points:

* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring.  I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.

* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs.  It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.

* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list.  (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)

* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too.  Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test.  You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.

* What you did to join.sql/join.out seems a bit bizarre.  The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be.  A bit more work on regression test cases
seems indicated.

I'm going to set this back to "Waiting on Author".  I think the basic
idea is good but it needs a rewrite.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump fails on domain constraint comments

2016-01-22 Thread Alvaro Herrera

Michael Paquier wrote:
> On Tue, Jan 12, 2016 at 7:56 AM, Elvis Pranskevichus  
> wrote:
> > It looks like pg_dump emits incorrect text for domain constraint comments:
> >
> > Assuming the following structure,
> 
> Nice catch! qtypname already has fmtId applied to it, so quotes are
> applied twice to it in this case. I am adding an entry in the next CF
> with patch marked as ready for committer so as this does not fail into
> oblivion. A backpatch would be welcome as well.

Yeah, evidently I didn't test that part.  Thanks Elvis, applied and
backpatched to 9.5.

... Oh, I just noticed you requested this feature back in 2013,
https://www.postgresql.org/message-id/5310157.yWWCtg2qIU%40klinga.prans.org
I hope that it's doing what you wanted!

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Generic WAL logical messages

2016-01-22 Thread Petr Jelinek


Hi,

here is updated version of this patch, calling the messages logical 
(decoding) messages consistently everywhere and removing any connection 
to standby messages. Moving this to it's own module gave me place to 
write some brief explanation about this so the code documentation has 
hopefully improved as well.


The functionality itself didn't change.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From c19d7274091baee4a52b1fa0ac684ace4cc9 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Fri, 22 Jan 2016 20:15:22 +0100
Subject: [PATCH] Logical Messages

---
 contrib/test_decoding/Makefile  |   2 +-
 contrib/test_decoding/test_decoding.c   |  22 +++-
 doc/src/sgml/func.sgml  |  42 
 doc/src/sgml/logicaldecoding.sgml   |  22 
 src/backend/access/rmgrdesc/Makefile|   4 +-
 src/backend/access/rmgrdesc/logicalmsgdesc.c|  33 ++
 src/backend/access/transam/rmgr.c   |   1 +
 src/backend/replication/logical/Makefile|   2 +-
 src/backend/replication/logical/decode.c|  37 +++
 src/backend/replication/logical/logical.c   |  38 +++
 src/backend/replication/logical/logicalfuncs.c  |  27 +
 src/backend/replication/logical/message.c   | 132 
 src/backend/replication/logical/reorderbuffer.c |  73 +
 src/bin/pg_xlogdump/rmgrdesc.c  |   1 +
 src/include/access/rmgrlist.h   |   1 +
 src/include/catalog/pg_proc.h   |   4 +
 src/include/replication/logicalfuncs.h  |   2 +
 src/include/replication/message.h   |  42 
 src/include/replication/output_plugin.h |  13 +++
 src/include/replication/reorderbuffer.h |  21 
 20 files changed, 514 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/logicalmsgdesc.c
 create mode 100644 src/backend/replication/logical/message.c
 create mode 100644 src/include/replication/message.h

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a362e69..8fdcfbc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -38,7 +38,7 @@ submake-test_decoding:
 	$(MAKE) -C $(top_builddir)/contrib/test_decoding
 
 REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel \
-	binary prepared replorigin
+	binary prepared replorigin messages
 
 regresscheck: | submake-regress submake-test_decoding temp-install
 	$(MKDIR_P) regression_output
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4cf808f..f655355 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -21,6 +21,7 @@
 
 #include "replication/output_plugin.h"
 #include "replication/logical.h"
+#include "replication/message.h"
 #include "replication/origin.h"
 
 #include "utils/builtins.h"
@@ -63,11 +64,15 @@ static void pg_decode_change(LogicalDecodingContext *ctx,
  ReorderBufferChange *change);
 static bool pg_decode_filter(LogicalDecodingContext *ctx,
  RepOriginId origin_id);
+static void pg_decode_message(LogicalDecodingContext *ctx,
+			  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+			  bool transactional, const char *prefix,
+			  Size sz, const char *message);
 
 void
 _PG_init(void)
 {
-	/* other plugins can perform things here */
+	RegisterLogicalMsgPrefix("test");
 }
 
 /* specify output plugin callbacks */
@@ -82,6 +87,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pg_decode_commit_txn;
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
+	cb->message_cb = pg_decode_message;
 }
 
 
@@ -471,3 +477,17 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_message(LogicalDecodingContext *ctx,
+  ReorderBufferTXN *txn, XLogRecPtr lsn,
+  bool transactional, const char *prefix,
+  Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "message: lsn: %X/%X transactional: %d prefix: %s, sz: %zu content:",
+	 (uint32)(lsn >> 32), (uint32)lsn, transactional, prefix,
+	 sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d2b88f..d55a2b1 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17639,6 +17639,48 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());

   
 
+  
+   
+
+ pg_logical_send_message
+
+pg_logical_send_message(transactional bool, prefix text, content text)
+   
+   
+void
+   
+   
+Write text logical decoding message. This can be used to pass generic

Re: [HACKERS] Releasing in September

2016-01-22 Thread Simon Riggs

On 22 January 2016 at 05:07, Noah Misch  wrote:

> On Wed, Jan 20, 2016 at 06:58:24PM +, Simon Riggs wrote:
> > The main problem is the length of the integration phase, which is mostly
> > where nothing happens.
>
> The open items wiki page saw steady change from 30 April to 28 December[1];
> the two longest quiet periods were an eleven-day gap from 21 August and a
> nine-day gap from 16 December.  Nineteen distinct contributors were
> finding,
> fixing or otherwise editing data about 9.5 defects.  The integration work
> is
> too slow, yes, but this persistent myth of a dead period is false.
>
> For my part, I spent 1 July to 11 December reviewing 9.5 commits and
> reporting
> or fixing defective 9.5 work.  (I did spoil myself with a few breaks fixing
> pre-9.5 bugs.)  I doubt the 9.5 new bugs would have run dry within the next
> five months, either, but the release team picked a fair juncture to give up
> and call it dot-zero.
>
> nm
>
> [1]
> https://wiki.postgresql.org/index.php?title=PostgreSQL_9.5_Open_Items&limit=500&action=history
>

Thanks for explaining that, and thanks for that work. Had I known you were
doing that I wouldn't have said "nothing happens".

What I do observe is that we have N developers writing patches, C
committers committing patches and R people reviewing things during the
integration phase.

N > C > R so that N >> R

Meaning that during the integration phase our parallel efficiency drops
significantly, but as you say, not to zero. Given that the integration
phase is frequently at least as long as the development phase we end up
operating at about 0.5-0.6 of maximum efficiency. I would like to improve
on that, if possible.

I'll do what I can to shorten the integration phase, including the
prioritization/triage step I volunteered for, applied to the last CF.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Proposal: Trigonometric functions in degrees

2016-01-22 Thread Tom Lane

Dean Rasheed  writes:
> I tried using feclearexcept + fetestexcept as recommended by
> POSIX.1-2008, and that does indeed make the above examples compliant
> on my linux box. Unfortunately it introduces new errors -- asin starts
> generating FE_UNDERFLOW errors for numbers that are really not that
> small, such as asin(1e-9), although the function still returns the
> correct answer. A bigger problem though is that these are C99
> functions and so won't necessarily be available on all supported
> platforms.

The latter issue could be dealt with via configure tests, but I agree that
if we get new error conditions we weren't exactly looking for, it would
not be a net improvement.  I think ensuring that the Inf and NaN cases are
platform-independent is already a good step forward, so we can stop there
for now.

> Attached are patches for this and the new functions in degrees, now
> with POSIX compatible Inf/NaN handling.

Pushed with minor, mostly cosmetic fixes.  I did renumber the function
OIDs to be closer to the original trig functions' numbers, using some
OIDs that were freed up by the recent index AM API change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2016-01-22 Thread David Rowley

On 23 January 2016 at 09:17, Jeff Janes  wrote:
> On Wed, Jan 20, 2016 at 11:06 AM, Robert Haas  wrote:
>> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
>>  wrote:
>>> Agreed. So I've attached a version of the patch which does not have any of
>>> the serialise/deserialise stuff in it.
>>
>> I re-reviewed this and have committed most of it with only minor
>> kibitizing.  A few notes:
>
>
> This commit has broken pg_dump.  At least, I think this is the thread
> referencing this commit:
>
...
> pg_dump: column number -1 is out of range 0..17
> Segmentation fault (core dumped)
>
> Program terminated with signal 11, Segmentation fault.
> #0  0x00416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0)
> at pg_dump.c:12670
> 12670if (strcmp(aggcombinefn, "-") != 0)
> (gdb) bt
> #0  0x00416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0)
> at pg_dump.c:12670
> #1  0x0041df7a in main (argc=,
> argv=) at pg_dump.c:810

Yikes :-( I will look at this with priority in the next few hours.

Thanks for the report.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-01-22 Thread David Rowley

On 23 January 2016 at 05:36, Tomas Vondra  wrote:
> OK. I've looked at the patch again today, and it seems broken bv 45be99f8 as
> the partial paths were not passing the unique_inner to the create_*_path()
> functions. The attached patch should fix that.
>

Thanks for looking at this again, and thanks for fixing that problem.

I've attached an updated patch which incorporates your change, and
also another 2 small tweaks that I made after making another pass over
this myself.

> Otherwise I think the patch is ready for committer - I think this is a
> valuable optimization and I see nothing wrong with the code.
>
>
> Any objections to marking it accordingly?

None from me. I think it's had plenty of review time over the last year.

The only other thing I can possibly think of is to apply most of the
analyzejoins.c changes as a refactor patch first. If committer wants
that, I can separate these out, but I'll hold off for a response
before doing that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_ba5b9cb_2016-01-23.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2016-01-22 Thread Jeff Janes

On Wed, Jan 20, 2016 at 11:06 AM, Robert Haas  wrote:
> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
>  wrote:
>> Agreed. So I've attached a version of the patch which does not have any of
>> the serialise/deserialise stuff in it.
>
> I re-reviewed this and have committed most of it with only minor
> kibitizing.  A few notes:


This commit has broken pg_dump.  At least, I think this is the thread
referencing this commit:


commit a7de3dc5c346e07e0439275982569996e645b3c2
Author: Robert Haas 
Date:   Wed Jan 20 13:46:50 2016 -0500
Support multi-stage aggregation.


If I add an aggregate to an otherwise empty 9.4.5 database:


CREATE OR REPLACE FUNCTION first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $1;
$$;

-- And then wrap an aggregate around it
-- aggregates do not support create or replace, alas.
CREATE AGGREGATE first (
sfunc= first_agg,
basetype = anyelement,
stype= anyelement
);


And then run pg_dump from 9.6.this on it, I get:


pg_dump: column number -1 is out of range 0..17
Segmentation fault (core dumped)

Program terminated with signal 11, Segmentation fault.
#0  0x00416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0)
at pg_dump.c:12670
12670if (strcmp(aggcombinefn, "-") != 0)
(gdb) bt
#0  0x00416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0)
at pg_dump.c:12670
#1  0x0041df7a in main (argc=,
argv=) at pg_dump.c:810


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending list clean up exposure to SQL

2016-01-22 Thread Julien Rouhaud

On 20/01/2016 15:17, Fujii Masao wrote:
> 
> When I compiled the PostgreSQL with the patch, I got the following error.
> ISTM that the inclusion of pg_am.h header file is missing in ginfast.c.
> 
> ginfast.c: In function 'gin_clean_pending_list':
> ginfast.c:980: error: 'GIN_AM_OID' undeclared (first use in this function)
> ginfast.c:980: error: (Each undeclared identifier is reported only once
> ginfast.c:980: error: for each function it appears in.)
> 
> gin_clean_pending_list() must check whether the server is in recovery or not.
> If it's in recovery, the function must exit with an error. That is, IMO,
> something like the following check must be added.
> 
>  if (RecoveryInProgress())
>  ereport(ERROR,
> 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>   errmsg("recovery is in progress"),
>   errhint("GIN pending list cannot be
> cleaned up during recovery.")));
> 
> It's better to make gin_clean_pending_list() check whether the target index
> is temporary index of other sessions or not, like pgstatginindex() does.
> 
> +RelationindexRel = index_open(indexoid, AccessShareLock);
> 
> ISTM that AccessShareLock is not safe when updating the pending list and
> GIN index main structure. Probaby we should use RowExclusiveLock?
> 

This lock should be safe, as pointed by Alvaro and Jaime earlier in this
thread
(http://www.postgresql.org/message-id/20151119161846.GK614468@alvherre.pgsql),
as ginInsertCleanup() can be called concurrently.

Also, since 38710a374ea the ginInsertCleanup() call must be fixed:

-   ginInsertCleanup(&ginstate, false, true, &stats);
+   ginInsertCleanup(&ginstate, true, &stats);


Regards.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-22 Thread Alvaro Herrera

You're editing the expected file for the libpq-regress thingy, but you
haven't added any new lines to test the new capability.  I think it'd be
good to add some there.  (I already said this earlier in the thread; is
there any reason you ignored it the first time?)

If the test program requires improvement to handle the new stuff, let's
do that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-22 Thread Victor Wagner

On Tue, 19 Jan 2016 14:34:54 +
Thom Brown  wrote:

> 
> The segfault issue I originally reported now appears to be resolved.
> 
> But now I have another one:
> 
> psql
> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
> -c 'show port'

Here is new version of the patch. Now I've reworked hostorder=random and
it seems to work as well as sequential. failover_timeout works too.
I've also found a case when attempt to connect fail when receiving
FATAL message from server which is not properly up yet. So, it is fixed
too.

Addititonally, error messages from all failed connect attempts are not
accumulated now. Only last one is returned.





-- 
   Victor Wagner 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 2328d8f..3c7bd15 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -792,7 +792,7 @@ host=localhost port=5432 dbname=mydb connect_timeout=10

The general form for a connection URI is:
 
-postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]
+postgresql://[user[:password]@][netloc][:port][,netloc[:port]...][/dbname][?param1=value1&...]
 

 
@@ -809,6 +809,7 @@ postgresql://localhost/mydb
 postgresql://user@localhost
 postgresql://user:secret@localhost
 postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp
+postgresql://node1,node2:5433,node3:4432,node4/mydb?hostorder=random&readonly=1
 
 Components of the hierarchical part of the URI can also
 be given as parameters.  For example:
@@ -831,7 +832,9 @@ postgresql:///mydb?host=localhost&port=5433

 For improved compatibility with JDBC connection URIs,
 instances of parameter ssl=true are translated into
-sslmode=require.
+sslmode=require and
+loadBalanceHosts=true  into
+hostorder=random.

 

@@ -841,6 +844,10 @@ postgresql:///mydb?host=localhost&port=5433
 postgresql://[2001:db8::1234]/database
 

+   
+   There can be serveral host specifications, optionally accompanied
+   with port, separated by comma.
+   
 

 The host component is interpreted as described for the parameter PostgreSQL was built). On machines without
 Unix-domain sockets, the default is to connect to localhost.

+   
+   There can be more than one host parameter in
+   the connect string. In this case these hosts would be considered
+   alternate entries into same database and if connect to first one
+   fails, library would try to connect second etc. This can be used
+   for high availability cluster or for load balancing. See
+parameter.
+   
+   
+   Network host name can be accompanied with port number, separated by
+   colon. If so, this port number is used only when connected to
+   this host. If there is no port number, port specified in the
+parameter would be used.
+   
   
  
 
@@ -942,8 +963,44 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname


   
-
+
+  
+  hostorder
+  
+  
+  Specifies how to choose host from list of alternate hosts,
+  specified in the  parameter.
+  
+  
+  If value of this argument is sequential (the
+  default) library connects to the hosts in order they specified,
+  and tries to connect second one only if connection to the first
+  fails.
+  
+  
+  If value is random host to connect is randomly
+  picked from the list. It allows to balance load between several
+  cluster nodes. However, currently PostgreSQL doesn't support
+  multimaster clusters. So, without use of third-party products,
+  only read-only connections can take advantage from the
+  load-balancing. See 
+  
+  
+  
+  
+  readonly
+  
+  
+  If this parameter is 0 (the default), upon successful connection
+  library checks if host is in recovery state, and if it is so,
+  tries next host in the connect string. If this parameter is
+  non-zero, connection to warm standby nodes are considered
+  successful.
+  
+  
+  
   
+
port


@@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
  
   connect_timeout
   
@@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
+ 
+ falover_timeout
+ 
+ 
+ Maximum time to cycilically retry all the hosts in commect string.
+ (as decimal integer number of seconds). If not specified, then
+ hosts are tried just once.
+ 
+ 
+ If we have replicating cluster, and master node fails, it might
+ take some time to promote one of standby nodes to the new master.
+ So clients which notice that connect to the master fails, can
+ already give up attempt to reestablish a connection w

Re: [HACKERS] [patch] Proposal for \crosstabview in psql

2016-01-22 Thread Daniel Verite

  Hi,

Here's an updated patch improving on how the horizontal and vertical
headers can be sorted.

The discussion upthread went into how it was desirable
to have independant sorts for these headers, possibly driven
by another column, in addition to the query's ORDER BY.

Thus the options now accepted are:

\crosstabview [ [-|+]colV[:scolV] [-|+]colH[:scolH]  [colG1[,colG2...]] ]

The optional scolV/scolH columns drive sorts for respectively
colV/colH (colV:scolV somehow means SELECT colV from... order by scolV)

colG1,... in 3rd arg indicate the columns whose contents form the grid
cells, the typical use case being that there's only one such column.
By default it's all columns minus colV and colH.

For example,

SELECT
  cust_id,
  cust_name,
  cust_date,
  date_part('month, sales_date),
  to_char(sales_date, 'Mon') as month,
  amount
FROM sales_view
WHERE [predicates]
[ORDER BY ...]

If we want to look at  in a grid with months names across, sorted
by month number, and customer name in the vertical header, sorted by date of 
acquisition, we could do this:

\crosstabview +cust_name:cust_date +5:4 amount

or letting the vertical header being sorted by the query's ORDER BY,
and the horizontal header same as above:

\crosstabview cust_name +5:4 amount

or sorting vertically by name, if it happens that the ORDER BY is missing or
is on something else:

\crosstabview +cust_name +5:4 amount


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 6d0cb3d..a242ec4 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -990,6 +990,123 @@ testdb=>
   
 
   
+\crosstabview [
+[-|+]colV
+[:scolV]
+[-|+]colH
+[:scolH]
+[colG1[,colG2...]]
+] 
+
+
+Execute the current query buffer (like \g) and shows
+the results inside a crosstab grid.
+The output column colV
+becomes a vertical header, optionally sorted by scolV,
+and the output column colH
+becomes a horizontal header, optionally sorted by
+scolH.
+
+colG1[,colG2...]
+is the list of output columns to project into the grid.
+By default, all output columns of the query except 
+colV and
+colH
+are included in this list.
+
+
+
+All columns can be refered to by their position (starting at 1), or by
+their name. Normal case folding and quoting rules apply on column
+names. By default,
+colV corresponds to column 1
+and colH to column 2.
+A query having only one output column cannot be viewed in crosstab, and
+colH must differ from
+colV.
+
+
+
+The vertical header, displayed as the leftmost column,
+contains the set of all distinct values found in
+column colV, in the order
+of their first appearance in the query results,
+or in ascending order if a plus (+) sign precedes
+colV,
+or in descending order if it's a minus (-) sign.
+
+
+The horizontal header, displayed as the first row,
+contains the set of all distinct non-null values found in
+column colH.  They come
+by default in their order of appearance in the query results,
+or in ascending order if a plus (+) sign precedes
+colH,
+or in descending order if it's a minus (-) sign.
+
+Also, they can optionally be sorted by another column, if
+colV
+(respectively colH) is
+immediately followed by a colon and a reference to another column
+scolV
+(respectively scolH).
+
+
+
+Inside the crosstab grid,
+given a query output with N columns
+(including colV and
+colH),
+for each distinct value x of
+colH
+and each distinct value y of
+colV,
+the contents of a cell located at the intersection
+(x,y) is determined by these rules:
+
+
+
+ if there is no corresponding row in the query results such that the
+ value for colH
+ is x and the value
+ for colV
+ is y, the cell is empty.
+
+
+
+
+
+ if there is exactly one row such that the value
+ for colH
+ is x and the value
+ for colV
+ is y, then the N-2 other
+ columns or the columns listed in
+ colG1[,colG2...]
+ are displayed in the cell, separated between each other by
+ a space character if needed.
+
+ If N=2, the letter X is displayed
+ in the cell as if a virtual third column contained that character.
+
+
+
+
+
+ if there are s

Re: [HACKERS] Declarative partitioning

2016-01-22 Thread Corey Huinker

>
>
> So for now, you create an empty partitioned table specifying all the
> partition keys without being able to  define any partitions in the same
> statement. Partitions (and partitions thereof, if any) will be created
> using CREATE PARTITION statements, one for each.
>

...and I would assume that any attempt to insert into a partitioned table
with no partitions (or lacking partitions at a defined level) would be an
error? If so, I'd be ok with that.


> Specifying range partitioning bound as PARTITION FOR RANGE 
> sounds like it offers some flexibility, which can be seen as a good thing.
> But it tends to make internal logic slightly complicated.
>
> Whereas, saying PARTITION FOR VALUES LESS THAN (max1, max2, ...) is
> notationally simpler still and easier to work with internally. Also, there
> will be no confusion about exclusivity of the bound if we document it so.


I understand wanting the internal rules to be simple. Oracle clearly went
with VALUES LESS THAN waterfalls for that reason.

What I'm hoping to avoid is:
- having to identify my "year2014" partition by VALUES LESS THAN
'2015-01-01', a bit of cognitive dissonance defining data by what it's not.
- and then hoping that there is a year2013 partition created by someone
with similar sensibilities, the partition definition being incomplete
outside of the context of other partition definitions.
- and then further hoping that nobody drops the year2013 partition, thus
causing new 2013 rows to fall into the year2014 partition, a side effect of
an operation that did not mention the year2014 partition.

Range types do that, and if we're concerned about range type overhead,
we're only dealing with the ranges at DDL time, we can break down the ATR
rules into a more easily digestible form once the partition is modified.

Range continuity can be tested with -|-, but we'd only need to test for
overlaps: gaps in ranges are sometimes a feature, not a bug (ex: I don't
want any rows from future dates and we weren't in business before 1997).

Also, VALUES LESS THAN forces us to use discrete values.  There is no way
with to express with VALUES LESS THAN partitions that have float values for
temperature:
ice (,0.0), water [0.0,212.0], steam (212.0,3000.0], plasma (3000.0,).

Yes, I can calculate the day after the last day in a year, I can use
212.01, I can write code to rigorously check that all partitions
are in place. I'd just rather not.

p.s. I'm really excited about what this will bring to Postgres in general
and my organization in particular. This feature alone will help chip away
at our needs for Vertica and Redshift clusters. Let me know if there's
anything I can do to help.

Re: [HACKERS] Releasing in September

2016-01-22 Thread Magnus Hagander

On Fri, Jan 22, 2016 at 7:19 PM, Andres Freund  wrote:

> On 2016-01-22 08:18:45 -0600, Jim Nasby wrote:
> > Personally, I don't see why we have our scarcest resource doing what is
> > essentially a project management task, especially when at least one
> > commercial company has offered to donate paid staff time.
>
> Because so far all CFs that weren't managed by somebody involved on the
> code level worked even less than the rest. You need to be able to judge
> how complex and ready a patch is to some degree. You need to know who to
> prod for a specific patch.
>
> Yes, a PM can learn to do that. But it's not something a freshly hired
> person can do. You need a fair bit of community involvement.
>

I definitely agree.

However, if someone were to put forward a person who at least partially
qualifies towards that I'm all for giving it a try at least once. But so
far I don't recall seeing any actual propsal of that *directly*, more very
vague "there are things others could do". But if/when they do, I definitely
think it's worth trying.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Releasing in September

2016-01-22 Thread Andres Freund

On 2016-01-22 08:50:15 -0600, Jim Nasby wrote:
> I think that's a great way to ensure we shrink the pool of reviewers when
> someone works on a patch and then it goes nowhere.

True, it really sucks. But what's your counter proposal? Commitfests
dragging on forever, and people burning out on continually feeling they
need to work on them? Hasn't worked out well.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Andres Freund

On 2016-01-22 08:18:45 -0600, Jim Nasby wrote:
> Personally, I don't see why we have our scarcest resource doing what is
> essentially a project management task, especially when at least one
> commercial company has offered to donate paid staff time.

Because so far all CFs that weren't managed by somebody involved on the
code level worked even less than the rest. You need to be able to judge
how complex and ready a patch is to some degree. You need to know who to
prod for a specific patch.

Yes, a PM can learn to do that. But it's not something a freshly hired
person can do. You need a fair bit of community involvement.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Simon Riggs

On 22 January 2016 at 16:34, Robert Haas  wrote:

> For my part, I am not sure the names in the release notes are actually
> all that helpful.

It has one important effect of current interest: establishing the truth
that multiple people and multiple companies are involved in producing and
maintaining PostgreSQL. Whether the names are properly attributed will
always be a time-consuming task, but I will oppose any attempt to remove or
obscure evidence of who develops PostgreSQL, wherever that occurs.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Releasing in September

2016-01-22 Thread Andres Freund

On 2016-01-22 08:40:28 -0600, Jim Nasby wrote:
> Ideally reviewers shouldn't be doing any testing, because the tests
> that are part of the patch should answer every question they would
> have, but I don't see that happening until we have a separate
> automation-only target that we don't care how long it takes to run.

I think that's completely wrong.

Yes, more tests are good, and we need a place for longer running
tests. But assuming that every patch author will create a testsuite that
covers every angle is just about akin to assuming every submitter will
deliver perfect, bug free code. And we know how well that turns out.

I think actively trying to break a feature, and postgres in general, is
one of the most important tasks of reviewers and testers. And with that
I don't mean trying to run "make check". Look e.g. at the tests Jeff
Janes has performed, what the recent plug tests of Tomas Vondra brought
to light, or at what the full page write checker tool of Heikki's
showed.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Jim Nasby


On 1/20/16 4:29 PM, Bruce Momjian wrote:

On Wed, Jan 20, 2016 at 09:12:07AM -0800, Joshua Drake wrote:

I just don't buy the Ubuntu release model for our database.  Ubuntu is
trying to balance hot features vs stability, while we are really focused
on stability, similar to Debian.


I understand but I think we are missing out on an opportunity here.
Notice that the shorter release cycle for STS will actually make
some things easier. Including:

  * Increased test base (just like Fedora/Ubuntu)
  * Increased early adopter testing (that is what early adopting is
really about for us anyway)
  * Decreased concerns about upgrades and ability to extend upgrade status.


I can see LTS working for plugin change, but not server binary changes.


s/LTS/STS/?

In any case, I think JD is onto something here. As someone that focuses 
more on user experience than "deep core" code, I already find yearly 
releases to be quite inconvenient. It's hard to find the motivation to 
make a minor improvement in something (especially knowing how hard it 
will be to get the patch approved) knowing that it won't see the light 
of day for a year, and realistically I won't be able to use it with any 
clients that are in production for 2-3 years.


Given the high level of extensibility that we have, maybe it would be 
good to logically segregate stuff into things that are deeply embedded 
in the "core" code (ie: on-disk format) from things that are much easier 
to change when necessary (like add-on functions or PLs). Things like new 
JSON operators could be released much more rapidly that way.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] count_nulls(VARIADIC "any")

2016-01-22 Thread Jim Nasby


On 1/21/16 1:48 PM, Pavel Stehule wrote:

the form of regress tests is not pretty significant issue. Jim's
design is little bit transparent, Marko's is maybe little bit
practical. Both has sense from my opinion, and any hasn't
significant advantage against other.


any possible agreement, how these tests should be designed?

simple patch, simple regress tests, so there are no reason for long waiting.


I don't really see how individual tests are more practical (you can 
still cut and paste a table...), but since there's no strong consensus 
either way I'd say it's up to you as author.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Jim Nasby


On 1/20/16 11:40 AM, Tom Lane wrote:

Yeah.  It's certainly unfair if someone's patch doesn't get reviewed,
but there are only 24 hours in a day, and we have a limited pool of
reviewer and committer manpower.  I think we just have to say that
sometimes life is unfair.


I think that's a great way to ensure we shrink the pool of reviewers 
when someone works on a patch and then it goes nowhere. I find it rather 
difficult to get feedback on ideas before I spend the time to code 
something, it's got to be even worse for someone the community doesn't 
know. So if we're going to do this, I think there must be a mechanism 
for a patch idea/design to be approved.


I think we also need to be careful about -hackers being the only place 
feature desirability is measured. There's an entire world of users out 
there that aren't even on -general. If some feature doesn't really 
interest -hackers but there's 50 users that want it and someone willing 
to work on it, ISTM we should make efforts to get it committed.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Jim Nasby


On 1/20/16 11:49 PM, Tom Lane wrote:

Michael Paquier  writes:

On Thu, Jan 21, 2016 at 2:30 PM, Peter Geoghegan  wrote:

What benefit does porting sqlsmith for inclusion in core have? I can
only think of costs, including those that you mentioned.



We have automatic buildfarm coverage on many platforms. Perhaps we
could live without that with a buildfarm module though.


I do not think we should necessarily try to include every testing tool
in the core distribution.  What is important is that they be readily
available: easy to find, easy to use, documented, portable.  "Same
license as the PG core code" is not on that list.

An immediately relevant example is that the buildfarm server and client
code aren't in the core distribution, and AFAIR no one has suggested
that they need to be.


Right. What I think would be far more useful is making it easier to 
explicitly test things (better tools + design for test), and something 
akin to buildfarm that will run automated testing on submitted patches.


Put another way: it's stupid that we even ask reviewers to waste time 
running make check. That can be automated. Ideally reviewers shouldn't 
be doing any testing, because the tests that are part of the patch 
should answer every question they would have, but I don't see that 
happening until we have a separate automation-only target that we don't 
care how long it takes to run.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Jim Nasby


On 1/21/16 2:29 AM, Amit Kapila wrote:

I also think there should be some way to give credit to CFM, if it is
difficult to do anything related to money, then we can enforce that if
CFM submits any patches for next CF, then those should be prioritised.


Personally, I don't see why we have our scarcest resource doing what is 
essentially a project management task, especially when at least one 
commercial company has offered to donate paid staff time.


I don't think the last CF of 9.6 is the time to experiment, but I think 
we should try using a PM for the first CF of 9.7.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: PL/Pythonu - function ereport

2016-01-22 Thread Jim Nasby


On 1/21/16 4:57 PM, Pavel Stehule wrote:

It is not correct - outside PLPython you got a Error (PostgreSQL error
has not any classes), and isn't important the raising class (Error or
SPIError). Inside PL/Python you will got SPIError or successors (based
on SQLcode).


Right. The closest thing we have to error classes is SQLSTATE. If 
someone found a clever way to setup an exception inheritance tree[1] on 
that then maybe different exceptions would make sense. Short of that, I 
don't see it.


[1] There's a hierarchy to the SQL state codes, based on the first 2 
characters. So if there was...


class connection_exception(spi_exception)
  __init__
str = 'Connection Exception'

class connection_does_not_exist(connection_exception)
  __init__
str = 'Connection Does Not Exist"

...

to map to the small set of errors below, maybe that would make sense. 
Obviously that would need to be auto-generated. It seems more trouble 
than it's worth though.



Section: Class 08 - Connection Exception

08000EERRCODE_CONNECTION_EXCEPTION 
 connection_exception
08003EERRCODE_CONNECTION_DOES_NOT_EXIST 
 connection_does_not_exist
08006EERRCODE_CONNECTION_FAILURE 
 connection_failure
08001EERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION 
 sqlclient_unable_to_establish_sqlconnection
08004EERRCODE_SQLSERVER_REJECTED_ESTABLISHMENT_OF_SQLCONNECTION 
 sqlserver_rejected_establishment_of_sqlconnection
08007EERRCODE_TRANSACTION_RESOLUTION_UNKNOWN 
 transaction_resolution_unknown
08P01EERRCODE_PROTOCOL_VIOLATION 
 protocol_violation



--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Re: [DOCS] Document Upper Limit for NAMEDATELEN in pgsql 9.5+

2016-01-22 Thread Tom Lane

Robert Haas  writes:
> Actually, though, varstr_levenshtein_less_equal() never gets called
> from parse_relation.c with a distance bound of more than four, so it
> can't actually go quadratic.  There's another call in that file to
> varstr_levenshtein(), which in retrospect looks silly, but I'm pretty
> sure that could also be changed to use a distance bound of 4 (well,
> MAX_FUZZY_DISTNACE + 1) without changing the behavior at all.  Given a
> constant distance bound, the algorithm is, I believe, only O(max(m,
> n)) not O(mn).  Knowing that, we could let the internal code just
> bypass the length check and only enforce the length restriction when
> the code is called from SQL via fuzzystrmatch.

Done that way; thanks for the advice.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Failover Slots

2016-01-22 Thread Andres Freund

On 2016-01-22 11:40:24 -0500, Robert Haas wrote:
> It occurred to me to wonder if it might be better to
> propagate logical slots partially or entirely outside the WAL stream,
> because with this design you will end up with the logical slots on
> every replica, including cascaded replicas, and I'm not sure that's
> what we want.  Then again, I'm also not sure it isn't what we want.

Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.

I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Failover Slots

2016-01-22 Thread Robert Haas

On Fri, Jan 22, 2016 at 2:46 AM, Craig Ringer  wrote:
> It's also going to be necessary to handle what happens when a new failover
> slot (physical or logical) is created on the master where it conflicts with
> the name of a non-failover physical slot that was created on the replica. In
> this case I am inclined to terminate any walsender backend for the replica's
> slot with a conflict with recovery, remove its slot and replace it with a
> failover slot. The failover slot does not permit replay while in recovery so
> if the booted-off client reconnects it'll get an ERROR saying it can't
> replay from a failover slot. It should be pretty clear to the admin what's
> happening between the conflict with recovery and the failover slot error.
> There could still be an issue if the client persistently keeps retrying and
> successfully reconnects after replica promotion but I don't see that much
> can be done about that. The documentation will need to address the need to
> try to avoid name conflicts between slots created on replicas and failover
> slots on the master.

That's not going to win any design-of-the-year awards, but maybe it's
acceptable.  It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want.  Then again, I'm also not sure it isn't what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-01-22 Thread Tomas Vondra


Hi,

On 12/17/2015 02:17 PM, David Rowley wrote:

On 17 December 2015 at 19:11, Simon Riggs mailto:si...@2ndquadrant.com>> wrote:

On 17 December 2015 at 00:17, Tomas Vondra
mailto:tomas.von...@2ndquadrant.com>>
wrote:

I'd go with match_first_tuple_only.


+1

unique_inner is a state that has been detected,
match_first_tuple_only is the action we take as a result.


Ok great. I've made it so in the attached. This means the comment in the
join code where we perform the skip can be a bit less verbose and all
the details can go in where we're actually setting the
match_first_tuple_only to true.


OK. I've looked at the patch again today, and it seems broken bv 
45be99f8 as the partial paths were not passing the unique_inner to the 
create_*_path() functions. The attached patch should fix that.


Otherwise I think the patch is ready for committer - I think this is a 
valuable optimization and I see nothing wrong with the code.



Any objections to marking it accordingly?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index be410d2..1ec2ddc 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -411,7 +411,7 @@ try_partial_nestloop_path(PlannerInfo *root,
 	 * Before creating a path, get a quick lower bound on what it is likely
 	 * to cost.  Bail out right away if it looks terrible.
 	 */
-	initial_cost_nestloop(root, &workspace, jointype,
+	initial_cost_nestloop(root, &workspace, jointype, extra->unique_inner,
 		  outer_path, inner_path,
 		  extra->sjinfo, &extra->semifactors);
 	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
@@ -422,6 +422,7 @@ try_partial_nestloop_path(PlannerInfo *root,
 			 create_nestloop_path(root,
   joinrel,
   jointype,
+  extra->unique_inner,
   &workspace,
   extra->sjinfo,
   &extra->semifactors,
@@ -622,6 +623,7 @@ try_partial_hashjoin_path(PlannerInfo *root,
 			 create_hashjoin_path(root,
   joinrel,
   jointype,
+  extra->unique_inner,
   &workspace,
   extra->sjinfo,
   &extra->semifactors,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Robert Haas

On Fri, Jan 22, 2016 at 10:43 AM, Alvaro Herrera
 wrote:
>> If we were to expand that to
>> cover reviewers, we would then be overburdinging that system and we
>> would probably end up removing all names from the release notes.
>
> To me, this reads just as a threat that "if you disturb the waters here
> I will just remove all names and you will all be the poorer for it."
> Please don't do that, it looks rude.

For my part, I am not sure the names in the release notes are actually
all that helpful.  What does and does not get credited there and how
it gets credited is more random than I would like.  And it's often
hard to draw the line as to who contributed enough to deserve credit
and who did not.  In commit messages you can try to spell out the
details, but as Bruce says, there's no room for that in the release
notes.  If a feature is credited as (Robert Haas, X, Y, Z) it's clear
that I'm the principle author, but it's unclear whether I did 40% of
the work or 90% of the work.  If it's credited as (X, Robert Haas),
that could mean X came up with the idea and wrote a patch which I then
completely re-wrote from scratch using a different design and
committed; or it could mean X came with an idea that was basically
good and I wrote and debugged a 1000-line patch, and I rewrote 10% of
it for some reason before committing.  So even today, it's pretty hard
to use the release notes as a way to measure depth of contribution.
The problem is worse for people most of whose work is reviewing and
committing rather than authoring, but if you put everybody's name on
everything they reviewed, that wouldn't fix it.  Instead, you'd tend
to overvalue people who do a lot of shallow reviews relative to people
who do a few very deep ones.  I think we just have to resign ourselves
to the fact that judging depth of contribution from the release notes,
or even the commit log, is going to be rather difficult; and that
different people will have different opinions about whose work is most
meritorious.

In other words, I fear that putting names in the release notes is just
politicizing a process that ought to be about presenting the
community's work in the best way possible.  A new release ought to be
about putting the whole community forward in the best light.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: function parse_ident

2016-01-22 Thread Marko Tiikkaja


Hi Pavel,

Sorry for the lack of review here.  I didn't realize I was still the 
reviewer for this after it had been moved to another commitfest.


That said, I think I've exhausted my usefulness here as a reviewer. 
Marking ready for committer.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tsvector editing functions

2016-01-22 Thread Tomas Vondra




On 12/30/2015 06:49 PM, Stas Kelvich wrote:

Hi, Tomáš! Thanks for comprehensive review.


On 15 Dec 2015, at 06:07, Tomas Vondra  wrote:

1) It's a bit difficult to judge the usefulness of the API, as I've
   always been a mere user of full-text search, and I never had a need
   (or courage) to mess with the tsvectors. OTOH I don't see a good
   reason no to have such API, when there's a need for it.

   The API seems to be reasonably complete, with one exception - when
   looking at editing function we have for 'hstore', we do have these
   variants for delete()

  delete(hstore,text)
  delete(hstore,text[])
  delete(hstore,hstore)

   while this patch only adds delete(tsvector,text). Would it make
   sense to add variants similar to hstore? It probably does not make
   much sense to add delete(tsvector,tsvector), right? But being able
   to delete a bunch of lexemes in one go seems like a good thing.

   What do you think?


That’s a good idea and actually deleting tsvector from tsvector makes perfect 
sense. In delete function I used exact string match between string and lexemes in 
tsvector, but if somebody wants to delete for example “Cats” from tsvector, then 
he should downcase and singularize this word. Easiest way to do it is to just use 
to_tsvector() function. Also we can use this function to delete specific 
positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) -> 'cat:3 
fat:4'::tsvector.

So in attached patch I’ve implemented following:

delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr 
from tsin


OK, although I do recommend using more sensible variable names, i.e. why 
how to use 'lexemes' instead of 'lexarr' for example? Similarly for the 
other functions.




delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of 
tsv_filter from tsin. When lexeme in tsv_filter has no positions function will 
delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have 
positions function will delete them from positions of matching lexeme in tsin. 
If after such removal resulting positions set is empty then function will 
delete that lexeme from resulting tsvector.



I can't really imagine situation in which I'd need this, but if you do 
have a use case for it ... although in the initial paragraph you say 
"... but if somebody wants to delete for example ..." which suggests you 
may not have such use case.


Based on bad experience with extending API based on vague ideas, I 
recommend only really adding functions with existing need. It's easy to 
add a function later, much more difficult to remove it or change the 
signature.



Also if we want some level of completeness of API and taking into account that 
concat() function shift positions on second argument I thought that it can be 
useful to also add function that can shift all positions of specific value. 
This helps to undo concatenation: delete one of concatenating tsvectors and 
then shift positions in resulting tsvector. So I also wrote one another small 
function:

shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset


That seems rather too low-level. Shouldn't it be really built into 
delete() directly somehow?






4) I find it rather annoying that there are pretty much no comments in
   the code. Granted, there are pretty much no comments in the
   surrounding code, but I doubt that's a good reason for not having
   any comments in new code. It makes reviews unnecessarily difficult.



Fixed, I think.


Yep, much better now.





5) tsvector_concat() is not mentioned in docs at all



Concat mentioned in docs as an operator ||.


Ah, OK.





6) Docs don't mention names of the new parameters in function
   signatures, just data types. The functions with a single parameter
   probably don't need to do that, but multi-parameter ones should.



Fixed.


OK, but please let's use variable names clearly identifying the meaning. 
So not 'w' but 'weight' and so on.





7) Some of the functions use intexterm that does not match the function
   name. I see two such cases - to_tsvector and setweight. Is there a
   reason for that?



Because sgml compiler wants unique indexterm. Both functions that
youmentioned use overloading of arguments and have non-unique name.


As Michael pointed out, that should probably be handled by using 
 and  tags.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Releasing in September

2016-01-22 Thread Alvaro Herrera

Bruce Momjian wrote:

> FYI, the fact that feature authors appear in the release notes is an
> artifact of how we track who wrote which stuff, and is not designed for
> rewarding, though it has that effect.

I think you can claim this all you want, but I'd be surprised if anyone
actually believed it.  You're saying that we do this so that we can
properly attribute code to developers so that we can pester the right
guy when things go wrong; but in reality when we investigate a bug we
never look at the release notes; what we do is look at the Git logs,
which already contain this information.

To the public at large I'm pretty sure having the names there is a
reward for having developed stuff, not a blame sign.  Do real users go
back to 9.3 to see that "oh this Alverro dude wrote multixacts which had
so many bugs, we should really hang him by the bollocks"?  I don't think
so.  People like Robert, Andres, Noah are more likely to want me hung
because they worked so hard to get the bugs fixed.

> If we were to expand that to
> cover reviewers, we would then be overburdinging that system and we
> would probably end up removing all names from the release notes.

To me, this reads just as a threat that "if you disturb the waters here
I will just remove all names and you will all be the poorer for it."
Please don't do that, it looks rude.

I think it is a disservice to the project to deny the reviewers the bit
of coolness factor derived from having their names in the release notes
-- or maybe just *somewhere*.  If you want to remove the developer names
from the release notes to avoid bloating them, perhaps what we can do is
put names of authors and reviewers in a page in the website (probably
not the wiki, though, because it's easy to vandalize that and many
people may have concerns regarding the legitimacy of anyone listed
there.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS]WIP: Covering + unique indexes.

2016-01-22 Thread Anastasia Lubennikova


22.01.2016 01:47, David Rowley:

On 20 January 2016 at 06:08, Anastasia Lubennikova
 wrote:



18.01.2016 01:02, David Rowley пишет:

On 14 January 2016 at 08:24, David Rowley  wrote:

I will try to review the omit_opclass_4.0.patch soon.


Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.

Thank you again. All mentioned points are fixed and patches are merged.
I hope it's all right now. Please check comments one more time. I rather doubt 
that I wrote everything correctly.


Thanks for updating.

+for the searching or ordering of records can defined in the

should be:

+for the searching or ordering of records can be defined in the

but perhaps "defined" should be "included".

The following is still quite wasteful. CopyIndexTuple() does a
palloc() and memcpy(), and then you throw that away if
rel->rd_index->indnatts != rel->rd_index->indnkeyatts. I think you
just need to add an "else" and move the CopyIndexTuple() below the if.

item = (IndexTuple) PageGetItem(lpage, itemid);
   right_item = CopyIndexTuple(item);
+ if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts)
+ right_item = index_reform_tuple(rel, right_item,
rel->rd_index->indnatts, rel->rd_index->indnkeyatts);

Fixed. Thank you for reminding me.

Tom also commited
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=65c5fcd353a859da9e61bfb2b92a99f12937de3b
So it looks like you'll need to update your pg_am.h changes. Looks
like you'll need a new struct member in IndexAmRoutine and just
populate that new member in each of the *handler functions listed in
pg_am.h

-#define Natts_pg_am 30
+#define Natts_pg_am 31

Done. I hope that my patch is close to the commit too.

Thank you again for review.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 412c845..c201f8b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3515,6 +3515,14 @@
   pg_class.relnatts)
  
 
+  
+  indnkeyatts
+  int2
+  
+  The number of key columns in the index. "Key columns" are ordinary
+  index columns in contrast with "included" columns.
+ 
+
  
   indisunique
   bool
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 5f7befb..ee40309 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -111,6 +111,8 @@ typedef struct IndexAmRoutine
 boolamclusterable;
 /* does AM handle predicate locks? */
 boolampredlocks;
+/* does AM support columns included with clause INCLUDING? */
+boolamcaninclude;
 /* type of data stored in index, or InvalidOid if variable */
 Oid amkeytype;
 
@@ -852,7 +854,8 @@ amrestrpos (IndexScanDesc scan);
using unique indexes, which are indexes that disallow
multiple entries with identical keys.  An access method that supports this
feature sets amcanunique true.
-   (At present, only b-tree supports it.)
+   (At present, only b-tree supports it.) Columns which are present in the
+   INCLUDING clause are not used to enforce uniqueness.
   
 
   
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 23bbec6..09d4e6b 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -633,7 +633,8 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
Indexes can also be used to enforce uniqueness of a column's value,
or the uniqueness of the combined values of more than one column.
 
-CREATE UNIQUE INDEX name ON table (column , ...);
+CREATE UNIQUE INDEX name ON table (column , ...)
+INCLUDING (column , ...);
 
Currently, only B-tree indexes can be declared unique.
   
@@ -642,7 +643,9 @@ CREATE UNIQUE INDEX name ON tableINCLUDING aren't used to enforce constraints (UNIQUE,
+   PRIMARY KEY, etc).
   
 
   
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ce36a1b..8011d6c 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -23,6 +23,7 @@ PostgreSQL documentation
 
 CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]
 ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
+[ INCLUDING ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
 [ WITH ( storage_parameter = value [, ... ] ) ]
 [ TABLESPACE tablespace_name ]
 [ WHERE predicate ]
@@ -139,6 +140,32 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 
  
+  INCLUDING
+  
+   
+An optional INCLUDING clause allows a list of columns to be
+specified which will be included in the index, in the non-key portion of
+the index. Columns which are part of this clause cannot also exist

Re: [HACKERS] Releasing in September

2016-01-22 Thread Robert Haas

On Wed, Jan 20, 2016 at 11:46 PM, Peter Geoghegan  wrote:
> On Wed, Jan 20, 2016 at 8:32 PM, Michael Paquier
>  wrote:
>> Perhaps some people are more interested in implementing new
>> features than working on bugs and would just continue hacking and
>> arguing about new features, at least a stability period may attract
>> more committer attention into actual bug fixes, in short: no new
>> features can be committed until the previous versions has reached at
>> least beta2, rc, whatever. This may accelerate the stability process.
>
> Or it might just accelerate the amount of time it takes to *declare*
> having reached that milestone.

Yes. I think that's been a problem already, and closing development
for prolonged periods of time will make it worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2016-01-22 Thread Robert Haas

On Thu, Jan 21, 2016 at 4:08 PM, David Rowley
 wrote:
> It's quite simple to test how much of a win it'll be in the serial
> case today, and yes, it's not much, but it's a bit.
>
> create table t1 as select x from generate_series(1,100) x(x);
> vacuum analyze t1;
> select count(*) from (select * from t1 union all select * from t1) t;
>   count
> -
>  200
> (1 row)
>
> Time: 185.793 ms
>
> -- Mock up pushed down aggregation by using sum() as a combine
> function for count(*)
> select sum(c) from (select count(*) c from t1 union all select
> count(*) from t1) t;
>sum
> -
>  200
> (1 row)
>
> Time: 162.076 ms
>
> Not particularly incredible, but we don't normally turn our noses up
> at a 14% improvement, so let's just see how complex it will be to
> implement, once the upper planner changes are done.

Mumble mumble.  Why is that even any faster?  Just because we avoid
the projection overhead of the Append node, which is a no-op here
anyway?  If so, a planner change is one thought, but perhaps we also
ought to look at whether we can't reduce the execution-time overhead.

> But as you mention about lack of ability to make use of pre-sorted
> Path for each branch of the UNION ALL; I was really hoping Tom's patch
> will improve that part by allowing the planner to choose a pre-sorted
> Path and perform a MergeAppend instead of an Append, which would allow
> pre-sorted input into a GroupAggregate node.

I won't hazard a guess on that point...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

2016-01-22 Thread Tomas Vondra


Hi Amit,

thanks for working on this. Seems the last version of the patch was 
submitted more than 2 months ago and I believe large parts of it will 
get reworked based on the extensive discussion on this list, so I 
haven't looked at the code at all.


I'd like to comment on the one thing and that's the syntax. It seems to 
me we're really trying to reinvent the wheel and come up with our own 
version of the syntax. Is there a particular reason why not to look at 
the syntax of the other databases and adapt as much of the existing 
syntax as possible?


I think that's for a few reasons - firstly it makes the life much easier 
for the DBAs and users who are either migrating to PostgreSQL or have to 
manage a mix of databases. Secondly, it serves as a valuable source of 
engineering info, preventing the "I haven't thought of this use case" 
problem.


An example of this is the proposed syntax for adding a partition

  CREATE TABLE measurement_fail
  PARTITION OF measurement
  FOR VALUES START ('2006-02-15') END ('2006-03-01');

which seems a bit awkward as both the databases I'm familiar with 
(Oracle and Sybase) use ALTER TABLE to do this


  ALTER TABLE measurement
ADD PARTITION measurement_fail VALUES LESS THAN ( ... )

And so on for the other commands.

That being said, I entirely agree with Simon (and others) that getting 
the planner part work is the crucial part of the patch. But I also think 
that a proper abstraction (thanks to good syntax) may be a valuable hint 
how to define the catalogs and such.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-22 Thread Andres Freund

On 2016-01-22 21:32:29 +0900, Michael Paquier wrote:
> Group shot with 3), 4) and 5). Well, there is no data loss here,
> putting me in the direction of considering this addition of an fsync
> as an optimization and not a bug.

I think this is an extremely weak argument. The reasoning when exactly a
loss of file is acceptable is complicated. In many cases adding an
additional fsync won't add measurable cost, given the frequency of
operations and/or the cost of surrounding operations.

Now, if you can make an argument why something is potentially impacting
performance *and* definitely not required: OK, then we can discuss
that.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-22 Thread Greg Stark

On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
 wrote:
> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>
>> So, I have been playing with a Linux VM with VMware Fusion and on
>> ext4 with data=ordered the renames are getting lost if the root
>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>> really easily.
>
>
> Yep. Same experience here (with qemu-kvm VMs).

I still think a better approach for this is to run the database on an
LVM volume and take lots of snapshots. No VM needed, though it doesn't
hurt. LVM volumes are below the level of the filesystem and a snapshot
captures the state of the raw blocks the filesystem has written to the
block layer. The block layer does no caching though the drive may but
neither the VM solution nor LVM would capture that.

LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-22 Thread Michael Paquier

On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra
 wrote:
> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>> Here are some comments about your patch after a look at the code.
>>
>> Regarding the additions in fsync_fname() in xlog.c:
>> 1) In InstallXLogFileSegment, rename() will be called only if
>> HAVE_WORKING_LINK is not used, which happens only on Windows and
>> cygwin. We could add it for consistency, but it should be within the
>> #else/#endif block. It is not critical as of now.
>> 2) The call in RemoveXlogFile is not necessary, the rename happening
>> only on Windows.
>
> Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
> there be some other platforms? And are we sure the file systems on those
> platforms are safe without the fsync call?
> That is, while the report references ext4, there may be other file systems
> with the same problem - ext4 was used mostly as it's the most widely used
> Linux file system.

>From pg_config_manual.h:
#if !defined(WIN32) && !defined(__CYGWIN__)
#define HAVE_WORKING_LINK 1
#endif
If we want to be consistent with what Posix proposes, I am not against
adding it.

>> 3) In exitArchiveRecovery if the rename is not made durable I think
>> it does not matter much. Even if recovery.conf is the one present
>> once the node restarts node would move back again to recovery, and
>> actually we had better move back to recovery in this case, no?
>
> I'm strongly against this "optimization" - I'm more than happy to exchange
> the one fsync for not having to manually fix the database after crash.
>
> I don't really see why switching back to recovery should be desirable in
> this case? Imagine you have a warm/hot standby, and that you promote it to
> master. The clients connect, start issuing commands and then the system
> crashes and loses the recovery.conf rename. The system reboots, database
> performs local recovery but then starts as a standby and starts rejecting
> writes. That seems really weird to me.
>> 4) StartupXLOG for the tablespace map. I don't think this one is
>> needed as well. Even if the tablespace map is not removed after a
>> power loss user would get an error telling that the file should be
>> removed.
>
> Please no, for the same reasons as in (3).
>
>> 5) For the one where haveBackupLabel || haveTblspcMap. If we do the
>> fsync, we guarantee that there is no need to do again the recovery.
>> But in case of a power loss, isn't it better to do the recovery again?
>
> Why would it be better? Why should we do something twice when we don't have
> to? Had this not be reliable, then the whole recovery process is
> fundamentally broken and we better fix it instead of merely putting a
> band-aid on it.

Group shot with 3), 4) and 5). Well, there is no data loss here,
putting me in the direction of considering this addition of an fsync
as an optimization and not a bug.

>> 6) For the one after XLogArchiveNotify() for the last partial
>> segment of the old timeline, it doesn't really matter to not make the
>> change persistent as this is mainly done because it is useful to
>> identify that it is a partial segment.
>
> OK, although I still don't quite see why that should be a reason not to do
> the fsync. It's not really going to give us any measurable performance
> advantage (how often we do those fsyncs), so I'd vote to keep it and make
> sure the partial segments are named accordingly. Less confusion is always
> better.

Check.

>> 7) In CancelBackup, this one is not needed as well I think. We would
>> surely want to get back to recovery if those files remain after a
>> power loss.
>
> I may be missing something, but why would we switch to recovery in this
> case?



>> For the ones in xlogarchive.c:
>> 1) For KeepFileRestoredFromArchive, it does not matter here, we are
>> renaming a file with a temporary name to a permanent name. Once the
>> node restarts, we would see an extra temporary file if the rename
>> was not effective.
>
> So we'll lose the segment (won't have it locally under the permanent name),
> as we've already restored it and won't do that again. Is that really a good
> thing to do?

At this point if a segment is restored from archive and there is a
power loss we are going back to recovery. The advantage of having the
fsync would ensure that the segment is not fetched twice.

>> 2) XLogArchiveForceDone, the only bad thing that would happen here is
>> to leave behind a .ready file instead of a .done file. I guess that we
>> could have it though as an optimization to not have to archive again
>> this file.
>
> Yes.
>
>>
>> For the one in pgarch.c:
>> 1) In pgarch_archiveDone, we could add it as an optimization to
>> actually let the server know that the segment has been already
>> archived, preventing a retry.
>
> Not sure what you mean by "could add it as an optimization"?

In case of a loss of the rename, server would retry archiving it. So
adding an fsync here ensures that the segment is marked as .done for
good.

>> Thoughts?
>

Re: [HACKERS] Parallel Aggregate

2016-01-22 Thread David Rowley

On 22 January 2016 at 17:25, Haribabu Kommi  wrote:
> Along with these changes, I added a float8 combine function to see
> how it works under parallel aggregate, it is working fine for float4, but
> giving small data mismatch with float8 data type.
>
> postgres=# select avg(f3), avg(f4) from tbl;
>avg|   avg
> --+--
>  1.1002384186 | 100.12344879
> (1 row)
>
> postgres=# set enable_parallelagg = true;
> SET
> postgres=# select avg(f3), avg(f4) from tbl;
>avg|   avg
> --+--
>  1.1002384186 | 100.12344918
> (1 row)
>
> Column - f3 - float4
> Column - f4 - float8
>
> similar problem for all float8 var_pop, var_samp, stddev_pop and stddev_samp
> aggregates. Any special care is needed for float8 datatype?

I'm not sure if this is what's going on here, as I don't really know
the range of numbers that you've used to populate f4 with. It would be
good to know, does "f4" contain negative values too?

It's not all that hard to demonstrate the instability of addition with
float8. Take the following example:

create table d (d float8);
insert into d 
values(1223123223412324.2231),(0.23),(-1223123223412324.2231);

# select sum(d order by random()) from d;
 sum
-
   0
(1 row)

same query, once more.

# select sum(d order by random()) from d;
   sum
--
 2.3e-013
(1 row)

Here the result just depends on the order which the numbers have been
added. You may need to execute a few more times to see the result
change.

Perhaps a good test would be to perform a sum(f4 order by random()) in
serial mode, and see if you're getting a stable result from the
numbers that you have populated the table with.

If that's the only problem at play here, then I for one am not worried
about it, as the instability already exists today depending on which
path is chosen to scan the relation. For example an index scan is
likely not to return rows in the same order as a seq scan.

We do also warn about this in the manual: "Inexact means that some
values cannot be converted exactly to the internal format and are
stored as approximations, so that storing and retrieving a value might
show slight discrepancies. Managing these errors and how they
propagate through calculations is the subject of an entire branch of
mathematics and computer science and will not be discussed here,
except for the following points:" [1]

[1] http://www.postgresql.org/docs/devel/static/datatype-numeric.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-22 Thread Aleksander Alekseev

Hi,

> First of all, why not merge both patches into one? They aren't too
> big anyway.

Agree.

> I think comments should be changed, to be more informative here.

> Add a comment here too.

> Maybe you should explain this magic number 7 in the comment above?

Done.

> Then, this thread became too tangled. I think it's worth to write a
> new message with the patch, the test script, some results and brief
> overview of how does it really works. It will make following review
> much easier.

Sure.

HASHHDR represents a hash table. It could be usual or partitioned.
Partitioned table is stored in a shared memory and accessed by multiple
processes simultaneously. To prevent data corruption hash table is
partitioned and each process has to acquire a lock for a corresponding
partition before accessing data in it. Number of partition is determine
by lower bits of key's hash value. Most tricky part is --- dynahash
knows nothing about these locks, all locking is done on calling side.

Since shared memory is pre-allocated and can't grow to allocate memory
in a hash table we use freeList. Also we use nentries field to count
current number of elements in a hash table. Since hash table is used by
multiple processes these fields are protected by a spinlock. Which as
it turned out could cause lock contention and create a bottleneck.

After trying a few approaches I discovered that using one spinlock per
partition works quite well. Here are last benchmark results:

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

Note that "no locks" solution cant be used because it doesn't guarantee
that all available memory will be used in some corner cases.

You can find a few more details and a test script in the first message
of this thread. If you have any other questions regarding this patch
please don't hesitate to ask.

Best regards,
Aleksander

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  &info,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   &info,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   &info, hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9c2e49c..8d9b36a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,8 +373,7 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
@@ -382,7 +381,6 @@ InitLocks(void)
 	 * calculations must agree with LockShmemSize!
 	 */
 	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
 
 	/*
 	 * Allocate hash table for LOCK structs

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-22 Thread Aleksander Alekseev

This patch affects header files. By any chance didn't you forget to run
`make clean` after applying it? As we discussed above, when you
change .h files autotools doesn't rebuild dependent .c files:

http://www.postgresql.org/message-id/caf4au4y4mtapp2u3ugtbqd_z0chesjwfnavrgk4umfcwa4e...@mail.gmail.com

On Thu, 21 Jan 2016 16:49:12 +0530
Dilip Kumar  wrote:

> On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <
> a.aleks...@postgrespro.ru> wrote:
> 
> >
> > increasing number of lock partitions (see columns "no locks",
> > "lwlock" and "spinlock array"). Previously it couldn't help us (see
> > "master" column) because of a bottleneck.
> >
> > If you have any other questions regarding this patch please don't
> > hesitate to ask.
> >
> 
> I have done some performance bench marking for this
> patch.(dynahash-optimization-v10-step1.patch)
> 
> Machine Detail:
> cpu   : POWER8
> cores: 24 (192 with HT)
> 
> Non Default Parameter:
> 
> Shared Buffer= 30GB
> max_wal_size= 10GB
> max_connections=500
> 
> Test1:
> *schema.sql* and *pgbench.sql* are same files which Aleksander has
> attached in first mail of the thread.
> 
> psql -d postgres -f schema.sql
> pgbench -c$ -j$ -f pgbench.sql  postgres
> 
> clientbasepatch
> 1145145
> 2262258
> 4453472
> 8749732
> 1611141128
> 3217271789
> 6427292793
> 12835343520
> 
> With this test i did not see any improvement, though i think it should
> improve the performance, do you have any suggestion to see the
> results same as yours ?
> 
> I have also dump stacks using some script and I have observed many
> stacks dumps as you mentioned in initial thread. And after patch, I
> found that number of lock waiting stacks are reduced.
> 
> Stack Dump:
> ---
> #0  0x7f25842899a7 in semop () from /lib64/libc.so.6
> #1  0x007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at
> pg_sema.c:387
> #2  0x0078955f in LWLockAcquire (lock=0x7f247698a980,
> mode=LW_EXCLUSIVE) at lwlock.c:1028
> #3  0x007804c4 in LockAcquireExtended
> #4  0x0077fe91 in LockAcquire
> #5  0x0077e862 in LockRelationOid
> #6  0x0053bc7b in find_inheritance_children
> #7  0x0053bd4f in find_all_inheritors
> #8  0x006fc0a2 in expand_inherited_rtentry
> #9  0x006fbf91 in expand_inherited_tables
> 
> I have tried to analyze using perf also, I can see that amount of time
> taken in hash_search_with_hash_value is reduced from 3.86%(master) to
> 1.72%(patch).
> 
> I have plan to do further investigation, in different scenarios of
> dynahash.
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-22 Thread Magnus Hagander

On Fri, Jan 22, 2016 at 9:26 AM, Tomas Vondra 
wrote:

> Hi,
>
> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>
> So, I have been playing with a Linux VM with VMware Fusion and on
>> ext4 with data=ordered the renames are getting lost if the root
>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>> really easily.
>>
>
> Yep. Same experience here (with qemu-kvm VMs).
>
> Here are some comments about your patch after a look at the code.
>>
>> Regarding the additions in fsync_fname() in xlog.c:
>> 1) In InstallXLogFileSegment, rename() will be called only if
>> HAVE_WORKING_LINK is not used, which happens only on Windows and
>> cygwin. We could add it for consistency, but it should be within the
>> #else/#endif block. It is not critical as of now.
>> 2) The call in RemoveXlogFile is not necessary, the rename happening
>> only on Windows.
>>
>
> Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
> there be some other platforms? And are we sure the file systems on those
> platforms are safe without the fsync call?
>
> That is, while the report references ext4, there may be other file systems
> with the same problem - ext4 was used mostly as it's the most widely used
> Linux file system.
>
> 3) In exitArchiveRecovery if the rename is not made durable I think
>> it does not matter much. Even if recovery.conf is the one present
>> once the node restarts node would move back again to recovery, and
>> actually we had better move back to recovery in this case, no?
>>
>
> I'm strongly against this "optimization" - I'm more than happy to exchange
> the one fsync for not having to manually fix the database after crash.
>
> I don't really see why switching back to recovery should be desirable in
> this case? Imagine you have a warm/hot standby, and that you promote it to
> master. The clients connect, start issuing commands and then the system
> crashes and loses the recovery.conf rename. The system reboots, database
> performs local recovery but then starts as a standby and starts rejecting
> writes. That seems really weird to me.
>
> 4) StartupXLOG for the tablespace map. I don't think this one is
>> needed as well. Even if the tablespace map is not removed after a
>> power loss user would get an error telling that the file should be
>> removed.
>>
>
> Please no, for the same reasons as in (3).
>
> 5) For the one where haveBackupLabel || haveTblspcMap. If we do the
>> fsync, we guarantee that there is no need to do again the recovery.
>> But in case of a power loss, isn't it better to do the recovery again?
>>
>
> Why would it be better? Why should we do something twice when we don't
> have to? Had this not be reliable, then the whole recovery process is
> fundamentally broken and we better fix it instead of merely putting a
> band-aid on it.
>
> 6) For the one after XLogArchiveNotify() for the last partial
>> segment of the old timeline, it doesn't really matter to not make the
>> change persistent as this is mainly done because it is useful to
>> identify that it is a partial segment.
>>
>
> OK, although I still don't quite see why that should be a reason not to do
> the fsync. It's not really going to give us any measurable performance
> advantage (how often we do those fsyncs), so I'd vote to keep it and make
> sure the partial segments are named accordingly. Less confusion is always
> better.
>
> 7) In CancelBackup, this one is not needed as well I think. We would
>> surely want to get back to recovery if those files remain after a
>> power loss.
>>
>
> I may be missing something, but why would we switch to recovery in this
> case?
>
>
>> For the ones in xlogarchive.c:
>> 1) For KeepFileRestoredFromArchive, it does not matter here, we are
>> renaming a file with a temporary name to a permanent name. Once the
>> node restarts, we would see an extra temporary file if the rename
>> was not effective.
>>
>
> So we'll lose the segment (won't have it locally under the permanent
> name), as we've already restored it and won't do that again. Is that really
> a good thing to do?
>
> 2) XLogArchiveForceDone, the only bad thing that would happen here is
>> to leave behind a .ready file instead of a .done file. I guess that we
>> could have it though as an optimization to not have to archive again
>> this file.
>>
>
> Yes.
>
>
>> For the one in pgarch.c:
>> 1) In pgarch_archiveDone, we could add it as an optimization to
>> actually let the server know that the segment has been already
>> archived, preventing a retry.
>>
>
> Not sure what you mean by "could add it as an optimization"?
>
> In timeline.c:
>> 1) writeTimeLineHistoryFile, it does not matter much. In the worst
>> case we would have just a dummy temporary file, and startup process
>> would come back here (in exitArchiveRecovery() we may finish with an
>> unnamed backup file similarly).
>>
>
> OK
>
> 2) In writeTimeLineHistoryFile, similarly we don't need to care
>> much, in WalRcvFetchTimeLineHistoryFiles recovery woul

Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2016-01-22 Thread Vik Fearing

On 01/22/2016 04:28 AM, Tom Lane wrote:
> Vik Fearing  writes:
>> I looked around for other places where this code should be used and
>> didn't find any.  I am marking this patch Ready for Committer.
> 
> I pushed this with some adjustments, mainly to make sure that the
> erroneous-units errors exactly match those that would be thrown in
> the main code line.

Thanks!
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] silent data loss with ext4 / all current versions

2016-01-22 Thread Tomas Vondra


Hi,

On 01/22/2016 06:45 AM, Michael Paquier wrote:


So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.


Yep. Same experience here (with qemu-kvm VMs).


Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.


Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or 
could there be some other platforms? And are we sure the file systems on 
those platforms are safe without the fsync call?


That is, while the report references ext4, there may be other file 
systems with the same problem - ext4 was used mostly as it's the most 
widely used Linux file system.



3) In exitArchiveRecovery if the rename is not made durable I think
it does not matter much. Even if recovery.conf is the one present
once the node restarts node would move back again to recovery, and
actually we had better move back to recovery in this case, no?


I'm strongly against this "optimization" - I'm more than happy to 
exchange the one fsync for not having to manually fix the database after 
crash.


I don't really see why switching back to recovery should be desirable in 
this case? Imagine you have a warm/hot standby, and that you promote it 
to master. The clients connect, start issuing commands and then the 
system crashes and loses the recovery.conf rename. The system reboots, 
database performs local recovery but then starts as a standby and starts 
rejecting writes. That seems really weird to me.



4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.


Please no, for the same reasons as in (3).


5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?


Why would it be better? Why should we do something twice when we don't 
have to? Had this not be reliable, then the whole recovery process is 
fundamentally broken and we better fix it instead of merely putting a 
band-aid on it.



6) For the one after XLogArchiveNotify() for the last partial
segment of the old timeline, it doesn't really matter to not make the
change persistent as this is mainly done because it is useful to
identify that it is a partial segment.


OK, although I still don't quite see why that should be a reason not to 
do the fsync. It's not really going to give us any measurable 
performance advantage (how often we do those fsyncs), so I'd vote to 
keep it and make sure the partial segments are named accordingly. Less 
confusion is always better.



7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.


I may be missing something, but why would we switch to recovery in this 
case?




For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename
was not effective.


So we'll lose the segment (won't have it locally under the permanent 
name), as we've already restored it and won't do that again. Is that 
really a good thing to do?



2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.


Yes.



For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.


Not sure what you mean by "could add it as an optimization"?


In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).


OK


2) In writeTimeLineHistoryFile, similarly we don't need to care
much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
the same path


OK



Thoughts?



Thanks for the review and comments. I think the question is whether we 
only want to do the additional fsync() only when it ultimately may lead 
to data loss, or even in cases where it may cause operational issues 
(e.g. switch

58 matches

Mail list logo