Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-23 Thread Michael Paquier
On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas  wrote:
> On 06/23/2015 07:51 AM, Michael Paquier wrote:
>> - 0001, add if_not_exists to pg_tablespace_location, returning NULL if
>> path does not exist
>> - 0002, same with pg_stat_file, returning NULL if file does not exist
>> - 0003, same with pg_read_*file. I added them to all the existing
>> functions for consistency.
>> - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
>> (thanks Robert for the naming!)
>> - 0005, as things get complex, a set of regression tests aimed to
>> covering those things. pg_tablespace_location is platform-dependent,
>> so there are no tests for it.
>> - 0006, the fix for pg_rewind, using what has been implemented before.
>
>
> With these patches, pg_read_file() will return NULL for any failure to open
> the file, which makes pg_rewind to assume that the file doesn't exist in the
> source server, and will remove the file from the destination. That's
> dangerous, those functions should check specifically for ENOENT.

This makes sense. I changed all those functions to do so.

> There's still a small race condition with tablespaces. If you run CREATE
> TABLESPACE in the source server while pg_rewind is running, it's possible
> that the recursive query that pg_rewind uses sees the symlink in pg_tblspc/
> directory, but its snapshot doesn't see the row in pg_tablespace yet. It
> will think that the symlink is a regular file, try to read it, and fail (if
> we checked for ENOENT).
> Actually, I think we need try to deal with symlinks a bit harder. Currently,
> pg_rewind assumes that anything in pg_tblspace that has a matching row in
> pg_tablespace is a symlink, and nothing else is. I think symlinks to
> directories. I just noticed that pg_rewind fails miserable if pg_xlog is a
> symlink, because of that:
>
> 
> The servers diverged at WAL position 0/3023F08 on timeline 1.
> Rewinding from last common checkpoint at 0/260 on timeline 1
>
> "data-master//pg_xlog" is not a directory
> Failure, exiting
> 

It may be possible that in this case the path is a symlink on the
source but not on the target, and vice-versa, so this looks too
restrictive to me if we begin to use pg_readlink. Think for example of
a symlink of pg_xlog that is not included in a base backup, we ignore
it in copy_fetch.c for pg_xlog and the others, I think that here as
well we should ignore those errors except for tablespaces.

> I think we need to add a column to pg_stat_file output, to indicate symbolic
> links, and add a pg_readlink() function. That still leaves a race condition
> if the type of a file changes, i.e. a file is deleted and a directory with
> the same name is created in its place, but that seems acceptable. I don't
> think PostgreSQL ever does such a thing, so that could only happen if you
> mess with the data directory manually while the server is running.

Hm. pg_stat_file uses now stat(), and not lstat() so it cannot make
the difference between what is a link or not. I have changed
pg_stat_file to use lstat instead to cover this case in my set of
patches, but a new function may be a better answer here. I have added
as well a pg_readlink() function on the stack, and actually the
if_not_exists mode of pg_tablespace_location is not needed anymore if
we rely on pg_readlink in this case. I have let it in the set of
patches though. This still looks useful for other utility tools.

> I just realized another problem: We recently learned the hard way that some
> people have files in the data directory that are not writeable by the
> 'postgres' user
> (http://www.postgresql.org/message-id/20150523172627.ga24...@msg.df7cb.de).
> pg_rewind will try to overwrite all files it doesn't recognize as relation
> files, so it's going to fail on those. A straightforward fix would be to
> first open the destination file in read-only mode, and compare its contents,
> and only open the file in write mode if it has changed. It would still fail
> when the files really differ, but I think that's acceptable.

If I am missing nothing, two code paths need to be patched here:
copy_file_range and receiveFileChunks. copy_file_range is
straight-forward. Now wouldn't it be better to write the contents into
a temporary file, compare their content, and then switch if necessary
for receiveFileChunks?

> I note that pg_rewind doesn't need to distinguish between an empty and a
> non-existent directory, so it's quite silly for it to pass
> include_dot_dirs=true, and then filter out "." and ".." from the result set.
> The documentation should mention the main reason for including "." and "..":
> to distinguish between an empty and non-existent directory.

OK. Switched to that in the first patch for pg_rewind.

Attached is a new set of patches. Except for the last ones that
addresses one issue of pg_rewind (symlink management when streaming
PGDATA), all the others introduce if_not_exists options for the
functions of genfile.c. The pg_rewind stuff could be mor

Re: [HACKERS] checkpointer continuous flushing

2015-06-23 Thread Fabien COELHO



Besides, causing additional cacheline bouncing during the
sorting process is a bad idea.


Hmmm. The impact would be to multiply the memory required by 3 or 4 (buf_id, 
relation, forknum, offset), instead of just buf_id, and I understood that 
memory was a concern.


Moreover, once the sort process get the lines which contain the sorting data 
from the buffer descriptor in its cache, I think that it should be pretty 
much okay. Incidentally, they would probably have been brought to cache by 
the scan to collect them. Also, I do not think that the sorting time for 
128000 buffers, and possible cache misses, was a big issue, but I do not have 
a measure to defend that. I could try to collect some data about that.


I've collected some data by adding a "sort time" measure, with 
checkpoint_sort_size=1000 so that sorting is in one chunk, and done 
some large checkpoints:


LOG:  checkpoint complete: wrote 41091 buffers (6.3%);
  0 transaction log file(s) added, 0 removed, 0 recycled;
  sort=0.024 s, write=0.488 s, sync=8.790 s, total=9.837 s;
  sync files=41, longest=8.717 s, average=0.214 s;
  distance=404972 kB, estimate=404972 kB

LOG:  checkpoint complete: wrote 212124 buffers (32.4%);
  0 transaction log file(s) added, 0 removed, 0 recycled;
  sort=0.078 s, write=128.885 s, sync=1.269 s, total=131.646 s;
  sync files=43, longest=1.155 s, average=0.029 s;
  distance=2102950 kB, estimate=2102950 kB

LOG:  checkpoint complete: wrote 384427 buffers (36.7%);
  0 transaction log file(s) added, 0 removed, 1 recycled;
  sort=0.120 s, write=83.995 s, sync=13.944 s, total=98.035 s;
  sync files=9, longest=13.724 s, average=1.549 s;
  distance=3783305 kB, estimate=3783305 kB

LOG:  checkpoint complete: wrote 809211 buffers (77.2%);
  0 transaction log file(s) added, 0 removed, 1 recycled;
  sort=0.358 s, write=138.146 s, sync=14.943 s, total=153.124 s;
  sync files=13, longest=14.871 s, average=1.149 s;
  distance=8075338 kB, estimate=8075338 kB

Summary of these checkpoints:

  #buffers   size   sort
 41091  328MB  0.024
212124  1.7GB  0.078
384427  2.9GB  0.120
809211  6.2GB  0.358

Sort times are pretty negligeable compared to the whole checkpoint time,
and under 0.1 s/GB of buffers sorted.

On a 512 GB server with shared_buffers=128GB (25%), this suggest a worst 
case checkpoint sorting in a few seconds, and then you have a hundred GB 
to write anyway. If we project on next decade 1 TB checkpoint that would 
make sorting in under a minute... But then you have 1 TB of data to dump.


As a comparison point, I've done the large checkpoint with the default 
sort size of 131072:


LOG:  checkpoint complete: wrote 809211 buffers (77.2%);
  0 transaction log file(s) added, 0 removed, 1 recycled;
  sort=0.251 s, write=152.377 s, sync=15.062 s, total=167.453 s;
  sync files=13, longest=14.974 s, average=1.158 s;
  distance=8075338 kB, estimate=8075338 kB

The 0.251 sort time is to be compared to 0.358. Well, n.log(n) is not too 
bad, as expected.



These figures suggest that sorting time and associated cache misses are 
not a significant issue and thus are not worth bothering much about, and 
also that probably a simple boolean option would be quite acceptable 
instead of the chunk approach.


Attached is an updated version of the patch which turns the sort option 
into a boolean, and also include the sort time in the checkpoint log.


There is still an open question about whether the sorting buffer 
allocation is lost on some signals and should be reallocated in such 
event.


--
Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1da7dfb..d7c1ff8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2474,6 +2474,28 @@ include_dir 'conf.d'
   
  
 
+ 
+  checkpoint_sort (bool)
+  
+   checkpoint_sort configuration parameter
+  
+  
+  
+   
+Whether to sort buffers before writting them out to disk on checkpoint.
+For a HDD storage, this setting allows to group together
+neighboring pages written to disk, thus improving performance by
+reducing random write activity.
+This sorting should have limited performance effects on SSD backends
+as such storages have good random write performance, but it may
+help with wear-leveling so be worth keeping anyway.
+The default is on.
+This parameter can only be set in the postgresql.conf
+file or on the server command line.
+   
+  
+ 
+
  
   checkpoint_warning (integer)
   
@@ -2495,6 +2517,24 @@ include_dir 'conf.d'
   
  
 
+ 
+  checkpoint_flush_to_disk (bool)
+  
+   checkpoint_flush_to_disk configuration parameter
+  
+  
+  
+   
+When writing data for a checkpoint, hint the underlying OS that the
+data must be sent to disk as soon as possible.  This may help smoothing
+disk I/O writes and avoid a stall 

Re: [HACKERS] checkpointer continuous flushing

2015-06-23 Thread Fabien COELHO



I do not see how to do both, as these two orders seem more or less
unrelated?  The traditionnal assumption is that the I/O are very slow
and they are to be optimized first, so going for buffer ordering to be
nice to the disk looks like the priority.


The point is that it's already expensive for backends to advance the clock; 
if they then have to wait on IO as well it gets REALLY expensive. So we want 
to avoid that.


I do not know what this clock stuff does. Note that the checkpoint buffer 
scan is done once at the beginning of the checkpoint and its time is 
relatively small compared to everything else in the checkpoint.


If this scan is an issue, it can be done in reverse order, or in some 
other order, but I think it is better to do it in order for better cache 
behavior, although the effect should be marginal.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-06-23 Thread Fabien COELHO



  flsh |  full speed tps   | percent of late tx, 4 clients
  /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
   N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
   N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
   Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
   Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,


flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,


I'm not sure to see the "dip". The performances are better with 4 clients
compared to 1 client?


What do you mean by "negative (-8 % with 4 clients) on throughput" in 
above sentence? I thought by that you mean that there is dip in TPS with 
patch as compare to HEAD at 4 clients.


Ok, I misunderstood your question. I thought you meant a dip between 1 
client and 4 clients. I meant that when flush is turned on tps goes down 
by 8% (743 to 681 tps) on this particular run. Basically tps improvements 
mostly come from "sort", and "flush" has uncertain effects on tps 
(throuput), but much more on latency and performance stability (lower late 
rate, lower standard deviation).


Note that I'm not comparing to HEAD in the above tests, but with the new 
options desactivated, which should be more or less comparable to current 
HEAD, i.e. there is no sorting nor flushing done, but this is not strictly 
speaking HEAD behavior. Probably I should get some figures with HEAD as 
well to check the "more or less" assumption.



Also I am not completely sure what's +- means in your data above?


The first figure before "+-" is the tps, the second after is its standard 
deviation computed in per-second traces. Some runs are very bad, with 
pgbench stuck at times, and result on stddev larger than the average, they 
ere noted with "*".



I understand your point and I also don't have any specific answer
for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.


Hmmm. My point of view is still that the logical priority is to optimize 
for disk IO first, then look for compatible RAM optimisations later.


I can run tests with a small shared_buffers, but probably it would just 
trigger a lot of checkpoints, or worse rely on the bgwriter to find space, 
which would generate random IOs.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-23 Thread Fujii Masao
On Wed, Jun 24, 2015 at 3:36 AM, Michael Paquier
 wrote:
> On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao  wrote:
>> On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas  wrote:
>>> On 06/23/2015 05:03 PM, Fujii Masao wrote:

 On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas 
 wrote:
>
> On 06/23/2015 07:51 AM, Michael Paquier wrote:
>>
>>
>> So... Attached are a set of patches dedicated at fixing this issue:
>
>
>
> Thanks for working on this!
>
>> - 0001, add if_not_exists to pg_tablespace_location, returning NULL if
>> path does not exist
>> - 0002, same with pg_stat_file, returning NULL if file does not exist
>> - 0003, same with pg_read_*file. I added them to all the existing
>> functions for consistency.
>> - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
>> (thanks Robert for the naming!)
>> - 0005, as things get complex, a set of regression tests aimed to
>> covering those things. pg_tablespace_location is platform-dependent,
>> so there are no tests for it.
>> - 0006, the fix for pg_rewind, using what has been implemented before.
>
>
>
> With thes patches, pg_read_file() will return NULL for any failure to
> open
> the file, which makes pg_rewind to assume that the file doesn't exist in
> the
> source server, and will remove the file from the destination. That's
> dangerous, those functions should check specifically for ENOENT.


 I'm wondering if using pg_read_file() to copy the file from source server
 is reasonable. ISTM that it has two problems as follows.

 1. It cannot read very large file like 1GB file. So if such large file was
  created in source server after failover, pg_rewind would not be able
  to copy the file. No?
>>>
>>>
>>> pg_read_binary_file() handles large files just fine. It cannot return more
>>> than 1GB in one call, but you can call it several times and retrieve the
>>> file in chunks. That's what pg_rewind does, except for reading the control
>>> file, which is known to be small.
>>
>> Yeah, you're right.
>>
>> I found that pg_rewind creates a temporary table to fetch the file in chunks.
>> This would prevent pg_rewind from using the *hot standby* server as a source
>> server at all. This is of course a limitation of pg_rewind, but we might want
>> to alleviate it in the future.
>
> This is something that a replication command could address properly.
>
 2. Many users may not allow a remote client to connect to the
  PostgreSQL server as a superuser for some security reasons. IOW,
  there would be no entry in pg_hba.conf for such connection.
  In this case, pg_rewind always fails because pg_read_file() needs
  superuser privilege. No?

 I'm tempting to implement the replication command version of
 pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
 replication command does...
>>>
>>>
>>> Yeah, that would definitely be nice. Peter suggested it back in January
>>> (http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think
>>> it's way too late to do that for 9.5, however. I'm particularly worried that
>>> if we design the required API in a rush, we're not going to get it right,
>>> and will have to change it again soon. That might be difficult in a minor
>>> release. Using pg_read_file() and friends is quite flexible, even though we
>>> just find out that they're not quite flexible enough right now (the ENOENT
>>> problem).
>>
>> I agree that it's too late to do what I said...
>>
>> But just using pg_read_file() cannot address the #2 problem that I pointed
>> in my previous email. Also requiring a superuser privilege on pg_rewind
>> really conflicts with the motivation why we added replication privilege.
>>
>> So we should change pg_read_file() so that even replication user can
>> read the file?
>
> From the security prospective, a replication user can take a base
> backup so it can already retrieve easily the contents of PGDATA. Hence
> I guess that it would be fine. However, what about cases where
> pg_hba.conf authorizes access to a given replication user via psql and
> blocks it for the replication protocol? We could say that OP should
> not give out replication access that easily, but in this case the user
> would have access to the content of PGDATA even if he should not. Is
> that unrealistic?

The most realistic case is that both source and target servers have
the pg_hba.conf containing the following authentication setting
regarding replication. That is, each server allows other to use the
replication user to connect to via replication protocol.

# TYPE  DATABASE  USER ADDRESSMETHOD
host  replication  repuser  X.X.X.X/Y  md5

This case makes me think that allowing even replication user to
call pg_read_file() may not be good solution for us. Because
in that case replication user

Re: [HACKERS] checkpointer continuous flushing

2015-06-23 Thread Amit Kapila
On Tue, Jun 23, 2015 at 10:29 AM, Fabien COELHO  wrote:

>
> Hello Amit,
>
>  medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
>>>   max_wal_size=4GB warmup=1200 time=7500
>>>
>>>   flsh |  full speed tps   | percent of late tx, 4 clients
>>>   /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
>>>N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
>>>N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
>>>Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
>>>Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |
>>>
>>> The effect of sorting is very positive (+150% to 270% tps). On this run,
>>>
>> flushing has a positive (+20% with 1 client) or negative (-8 % with 4
>> clients) on throughput, and late transactions are reduced by 92-95% when
>> both options are activated.
>>
>> Why there is dip in performance with multiple clients,
>>
>
> I'm not sure to see the "dip". The performances are better with 4 clients
> compared to 1 client?
>

What do you mean by "negative (-8 % with 4  clients) on throughput"
in above sentence?  I thought by that you mean that there is dip
in TPS with patch as compare to HEAD at 4 clients.

Also I am not completely sure what's +- means in your data above?


>  can it be due to reason that we started doing more stuff after holding
>> bufhdr lock in below code?
>>
>
> I think it is very unlikely that the buffer being locked would be
> simultaneously requested by one of the 4 clients for an UPDATE, so I do not
> think it should have a significant impact.
>
>  BufferSync() [...]
>>
>
>  BufferSync()
>> {
>> ..
>> - buf_id = StrategySyncStart(NULL, NULL);
>> - num_to_scan = NBuffers;
>> + active_spaces = nb_spaces;
>> + space = 0;
>>  num_written = 0;
>> - while (num_to_scan-- > 0)
>> +
>> + while (active_spaces != 0)
>> ..
>> }
>>
>> The changed code doesn't seems to give any consideration to
>> clock-sweep point
>>
>
> Indeed.
>
>  which might not be helpful for cases when checkpoint could have flushed
>> soon-to-be-recycled buffers. I think flushing the sorted buffers w.r.t
>> tablespaces is a good idea, but not giving any preference to clock-sweep
>> point seems to me that we would loose in some cases by this new change.
>>
>
> I do not see how to do both, as these two orders seem more or less
> unrelated?


I understand your point and I also don't have any specific answer
for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Foreign join pushdown vs EvalPlanQual

2015-06-23 Thread Etsuro Fujita
Hi,

While reviewing the foreign join pushdown core patch, I noticed that the
patch doesn't perform an EvalPlanQual recheck properly.  The example
that crashes the server will be shown below (it uses the postgres_fdw
patch [1]).  I think the reason for that is because the ForeignScan node
performing the foreign join remotely has scanrelid = 0 while
ExecScanFetch assumes that its scan node has scanrelid > 0.

I think this is a bug.  I've not figured out how to fix this yet, but I
thought we would also need another plan that evaluates the join locally
for the test tuples for EvalPlanQual.  Though I'm missing something though.

Create an environment:

postgres=# create table tab (a int, b int);
CREATE TABLE
postgres=# create foreign table foo (a int) server myserver options
(table_name 'foo');
CREATE FOREIGN TABLE
postgres=# create foreign table bar (a int) server myserver options
(table_name 'bar');
CREATE FOREIGN TABLE
postgres=# insert into tab values (1, 1);
INSERT 0 1
postgres=# insert into foo values (1);
INSERT 0 1
postgres=# insert into bar values (1);
INSERT 0 1
postgres=# analyze tab;
ANALYZE
postgres=# analyze foo;
ANALYZE
postgres=# analyze bar;
ANALYZE

Run the example:

[Terminal 1]
postgres=# begin;
BEGIN
postgres=# update tab set b = b + 1 where a = 1;
UPDATE 1

[Terminal 2]
postgres=# explain verbose select tab.* from tab, foo, bar where tab.a =
foo.a and foo.a = bar.a for update;

 QUERY PLAN



 LockRows  (cost=100.00..101.18 rows=4 width=70)
   Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
   ->  Nested Loop  (cost=100.00..101.14 rows=4 width=70)
 Output: tab.a, tab.b, tab.ctid, foo.*, bar.*
 Join Filter: (foo.a = tab.a)
 ->  Seq Scan on public.tab  (cost=0.00..1.01 rows=1 width=14)
   Output: tab.a, tab.b, tab.ctid
 ->  Foreign Scan  (cost=100.00..100.08 rows=4 width=64)
   Output: foo.*, foo.a, bar.*, bar.a
   Relations: (public.foo) INNER JOIN (public.bar)
   Remote SQL: SELECT l.a1, l.a2, r.a1, r.a2 FROM (SELECT
ROW(l.a9), l.a9 FROM (SELECT a a9 FROM public.foo FOR UPDATE) l) l (a1,
a2) INNER
JOIN (SELECT ROW(r.a9), r.a9 FROM (SELECT a a9 FROM public.bar FOR
UPDATE) r) r (a1, a2) ON ((l.a2 = r.a2))
(11 rows)

postgres=# select tab.* from tab, foo, bar where tab.a = foo.a and foo.a
= bar.a for update;

[Terminal 1]
postgres=# commit;
COMMIT

[Terminal 2]
(After the commit in Terminal 1, Terminal 2 will show the following.)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Best regards,
Etsuro Fujita

[1]
http://www.postgresql.org/message-id/CAEZqfEe9KGy=1_wagh2rgzpg0o4pqgd+iauyaj8wtze+cyj...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] git push hook to check for outdated timestamps

2015-06-23 Thread Noah Misch
On Sun, Jun 14, 2015 at 12:37:00PM +0200, Magnus Hagander wrote:
> On Fri, Jun 12, 2015 at 4:33 PM, Tom Lane  wrote:
> > Andrew Dunstan  writes:
> > > On 06/12/2015 09:31 AM, Robert Haas wrote:
> > >> Could we update our git hook to refuse a push of a new commit whose
> > >> timestamp is more than, say, 24 hours in the past?  Our commit history
> > >> has some timestamps in it now that are over a month off, and it's
> > >> really easy to do, because when you rebase a commit, it keeps the old
> > >> timestamp.  If you then merge or cherry-pick that into an official
> > >> branch rather than patch + commit, you end up with this problem unless
> > >> you are careful to fix it by hand.  It would be nice to prevent
> > >> further mistakes of this sort, as they create confusion.
> >
> > > I think 24 hours is probably fairly generous,
> >
> > Yeah ... if we're going to try to enforce a linear-looking history, ISTM
> > the requirement ought to be "newer than the latest commit on the same
> > branch".  Perhaps that would be unduly hard to implement though.
> >
> 
> From a quick look at our existing script, I think that's doable, but I'd
> have to do some more detailed verification before I'm sure. And we'd have
> to figure out some way to deal with a push with multiple commits in it, but
> it should certainly be doable if the first one is.
> 
> Would we in that case want to enforce linearity *and* recent-ness, or just
> linearity? as in, do we care about the commit time even if it doesn't
> change the order?

If a recency check is essentially free, let's check both.  Otherwise,
enforcing linearity alone is a 95% solution that implicitly bounds recency.

> > FWIW, our git_changelog script tries to avoid this problem by paying
> > attention to CommitDate not Date.  But I agree that it's confusing when
> > those fields are far apart.
> >
> 
> That brings it back to the enforcing - would we want to enforce both those?

May as well.  AuthorDate is the main source of trouble.  You would need to go
out of your way (e.g. git filter-branch) to push an old CommitDate, but let's
check it just the same.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how is a query passed between a coordinator and a datanode

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 5:07 AM, Rui Hai Jiang  wrote:
> I'm trying to figure out how a query and its result is passed between a 
> coordinator and a datanode. I know there are many messages passed between 
> them to finish a query.
>
>
> I did a test against the coordinator by adding a row to a table and the sql 
> was, insert into hg1(id, name) values(1,'tom').
>
> I found a command 'P' was sent from the coordinator to a datanode and there 
> was a remote statement as following,
>
>
> stmt_name=p_1_25af_f
> query_string=Remote Subplan
> plan_string={REMOTESTMT :commandType 3 :hasReturning false ...}
>
>
> My questions are,
>  1-does the coordinator use the remote statement to tell a datanode what to 
> do? If so, how is the plan string created by the coordinator and how is the 
> plan_string parsed by the datanode?
>
>  2-if there are multiple rows in the result of the query, how are the rows of 
> data passed from the datanode to the coordinator? Does the datanode just send 
> all the rows of data to the coordinator? or the coordinator get each row of 
> data by sending a query?
>
>
>  Thank you very much!

This is probably an appropriate question for postgres-xl-developers,
but not pgsql-hackers.  Those concepts do not exist in PostgreSQL,
only Postgres-XC or one of its proliferating set of forks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Craig Ringer
On 24 June 2015 at 03:23, Jim Nasby  wrote:
> On 6/23/15 12:21 PM, Tom Lane wrote:
>>
>> I concur: if we're to have a flag at all, it should work as Alvaro says.
>>
>> However, I'm not real sure we need a flag.  I think the use-case of
>> wanting extra logging for a bgworker under development is unlikely to be
>> satisfied very well by just causing existing start/stop logging messages
>> to come out at higher priority.  You're likely to be wanting to log other,
>> bgworker-specific, events, and so you'll probably end up writing a bunch
>> of your own elog calls anyway (which you'll eventually remove, #ifdef out,
>> or decrease the log levels of).
>
>
> FWIW, I have this problem *constantly* with plpgsql. I put RAISE DEBUGs in,
> but once you have those in enough places SET client_min_messages=debug
> becomes next to useless because of the huge volume of spew.
>
> What I'd like is a way to add an identifier to ereport/RAISE so you could
> turn on individual reports. If we had that we'd just make these particular
> ereports DEBUG1 and not worry about it because you could easily turn them on
> when needed.

So, log identifiers/classifiers, essentially.

Message numbers have been discussed before with regards to core and
rejected consistently. I don't think it's really come up in terms of
PLs and user-defined functions.

I've certainly had similar issues to you w.r.t. to debug messages from
user-level code, and wanted to be able to enable one particular log
line, all log output from a particular function, or all log output
from a particular extension / all functions in a schema.

I think it's a whole separate topicto Pavel's original proposal
though, and really merits a separate thread. For Pavel's issue I'm all
in favour of just changing the log message, I think LOG is way too
high for something that's internal implementation detail.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 release notes

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 5:48 PM, Kevin Grittner  wrote:
> Andres Freund  wrote:
>>>  
>>>   
>>>Improve concurrent locking and buffer scan performance (Andres
>>>Freund, Kevin Grittner)
>>>   
>>>  
>>
>> If this is ab5194e6f, I don't think it makes sense to mention "buffer
>> scan" - it's just any lwlock, and buffer locks aren't the primary
>> benefit (ProcArrayLock, buffer mapping lock probably are that). I also
>
>> don't think Kevin was involved?
>
> It seems likely that 2ed5b87f9 was combined with something else in
> this reference.  By reducing buffer pins and buffer content locking
> during btree index scans it shows a slight performance gain in
> btree scans and avoids some blocking of btree index vacuuming.

I think maybe we should separate that back out.  The list needs to be
user-accessible, but if it's hard to understand what it's referring
to, that's not good either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upper planner path-ification

2015-06-23 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, June 23, 2015 10:18 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: David Rowley; pgsql-hackers@postgresql.org; Tom Lane
> Subject: Re: [HACKERS] upper planner path-ification
> 
> On Tue, Jun 23, 2015 at 4:41 AM, Kouhei Kaigai  wrote:
> > So, unless we don't find out a solution around planner, 2-phase aggregation 
> > is
> > like a curry without rice
> 
> Simon and I spoke with Tom about this upper planner path-ification
> problem at PGCon, and he indicated that he intended to work on it
> soon, and that we should bug him about it if he doesn't.
> 
> I'm pleased to here this as I seem to have a special talent in that area.
>
It's fantastic support arm.

So, it may be more productive to pay my efforts for investigation
of David Rowley's patch, based on the assumption if we can add
partial/final-aggregation path node as if scan/join-paths.

My interest also stands on his infrastructure.
  https://cs.uwaterloo.ca/research/tr/1993/46/file.pdf
It is a bit old paper but good starting point.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Oh, this is embarrassing: init file logic is still broken

2015-06-23 Thread Tom Lane
Chasing a problem identified by my Salesforce colleagues led me to the
conclusion that my commit f3b5565dd ("Use a safer method for determining
whether relcache init file is stale") is rather borked.  It causes
pg_trigger_tgrelid_tgname_index to be omitted from the relcache init file,
because that index is not used by any syscache.  I had been aware of that
actually, but considered it a minor issue.  It's not so minor though,
because RelationCacheInitializePhase3 marks that index as nailed for
performance reasons, and includes it in NUM_CRITICAL_LOCAL_INDEXES.
That means that load_relcache_init_file *always* decides that the init
file is busted and silently(!) ignores it.  So we're taking a nontrivial
hit in backend startup speed as of the last set of minor releases.

Clearly, my shortcut of defining the init file contents as exactly the set
of syscache-supporting rels was too short.  I think the cleanest fix is
to create a wrapper function in relcache.c that encapsulates the knowledge
that pg_trigger_tgrelid_tgname_index is also to be included in the init
file.  But to prevent this type of problem from recurring, we need some
defenses against the wrapper getting out of sync with reality elsewhere.
I suggest that:

1. write_relcache_init_file ought to bitch if it concludes that a nailed
relation is not to be written to the init file.  Probably an Assert is
enough.

2. load_relcache_init_file ought to complain if it takes the "wrong number
of nailed relations" exit path.  Here, it seems like there might be a
chance of the path being taken validly, for example if a minor release
were to decide to nail a rel that wasn't nailed before, or vice versa.
So what I'm thinking about here is an elog(WARNING) complaint, which would
make logic bugs like this pretty visible in development builds.  If we
ever did make a change like that in a minor release, people might get a
one-time WARNING when upgrading, but I think that would be acceptable.

Thoughts, better ideas?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-06-23 Thread Merlin Moncure
On Tue, Jun 23, 2015 at 3:45 PM, Jim Nasby  wrote:
> On 6/23/15 3:22 PM, Merlin Moncure wrote:
>>
>> I would rephrase that to: "do X to all fields of an object".
>> Array handling is pretty good now (minus arrays of arrays, but arrays
>
>
> Except that still won't make it easy to do something to each element of an
> array in SQL, which I think would be nice to have.

Maybe, or maybe we're framing the problem incorrectly.  To me, it's
not really all that difficult to do:
select foo(x) from unnest(bar) x;

Unless you have to maintain state inside of foo(), in which case I'd
probably using the highly underutilized 'window over custom aggregate'
technique.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 release notes

2015-06-23 Thread Kevin Grittner
Andres Freund  wrote:
>>  
>>   
>>Improve concurrent locking and buffer scan performance (Andres
>>Freund, Kevin Grittner)
>>   
>>  
>
> If this is ab5194e6f, I don't think it makes sense to mention "buffer
> scan" - it's just any lwlock, and buffer locks aren't the primary
> benefit (ProcArrayLock, buffer mapping lock probably are that). I also

> don't think Kevin was involved?

It seems likely that 2ed5b87f9 was combined with something else in
this reference.  By reducing buffer pins and buffer content locking
during btree index scans it shows a slight performance gain in
btree scans and avoids some blocking of btree index vacuuming.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Making sure this isn't a new recovery bug ...

2015-06-23 Thread Josh Berkus
Hackers,

At one site, I have some duplicate row corruption in a staging database.
 This database was created by a binary backup from 9.3.5, which was
restored via PITR with a timestamp target to 9.3.5, so known-bad
versions.  The strange thing about the duplicate rows is that they were
all frozen prior to the original binary backup, as far as I can tell.
Here's a sample:

id| b_xmax | b_xmin |b_ctid
--+++--
 48871010 |  0 |  2 | (1529664,1)
 48871010 |  0 |  2 | (1529696,1)
 48871024 |  0 |  2 | (1529697,1)
 48871024 |  0 |  2 | (1529665,1)
 48871033 |  0 |  2 | (1529666,1)
 48871033 |  0 |  2 | (1529698,1)
 48871041 |  0 |  2 | (1529697,2)
 48871041 |  0 |  2 | (1529665,2)
 48871043 |  0 |  2 | (1529698,2)
 48871043 |  0 |  2 | (1529666,2)
 48871049 |  0 |  2 | (1529667,1)
 48871049 |  0 |  2 | (1529699,1)

This is about 1000 rows of a 100m row table.

So my question is, is this still likely to be one of the known multixact
issues?  I'm asking because this whole system will get scrubbed and
replaced with 9.3.9 in a couple days, so if it's a bug we want to
investigate, I need to do forensics now.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-06-23 Thread Jim Nasby

On 6/23/15 3:40 PM, Pavel Stehule wrote:

BTW, I think this relates to the desire to be able to do more OO-ish
things in the database. Like "do X to all elements in this array".
And to have actual classes, private members, real arrays of arrays.
It seems like there's a bigger need here that's only being addressed
piecemeal. :/


I would not to open this box - and I would not to throw or redesign
almost all PostgreSQL type handling system. I am sure, so it is not
necessary. PL can be relative static if the dynamic is covered by query
language. The few features can implemented without to necessity to
redesign all. Still there are other PL - and we have not force to design
new Perl, JavaScript, ...


By that argument why are we putting it into plpgsql either? You can 
easily do the stuff we've been talking about in plperl (and presumably 
most other pl's). So why mess around with adding it to plpgsql?


More importantly, these are things that would be extremely useful at the 
SQL level. When it comes to records for example, we frequently know 
exactly what's in them, so why do we force users to statically specify 
that at the SQL level? This is why we don't support pivot tables (which 
in the BI world is a Big Deal).


I think it's a mistake to try and solve this strictly through plpgsql 
without recognizing the larger desire and trying to move the ball that 
direction. I'm not saying a first effort should boil the ocean, but if 
we keep piecemealing this without more though we're going to keep 
getting more warts (like a lot of the gotchas we have with arrays).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-06-23 Thread Jim Nasby

On 6/23/15 3:22 PM, Merlin Moncure wrote:

I would rephrase that to: "do X to all fields of an object".
Array handling is pretty good now (minus arrays of arrays, but arrays


Except that still won't make it easy to do something to each element of 
an array in SQL, which I think would be nice to have.



of objects containing arrays is 'good enough' for most real world
cases).  We've suffered for a while now with hstore/json as a
temporary container to handle operations that are not well supported
by postgres's particularly strongly typed flavor SQL.   The "OO" of
postgres has been gradually diluting away; it's not a 'object
relational' database anymore and the OO features, very much a product
of the silly 90's OO hysteria, have been recast into more useful
features like inheritance and/or pruned back.


Admittedly I've never played with an OO database, but I think our data 
features are pretty good [1]. Where I do think we can improve though is 
developing/coding things in the database. For example, I'd love to have 
the equivalent to a class. Perhaps that could be accomplished by 
allowing multiple instances of an extension. I'd also like stronger 
support for private objects (permissions don't really fit that bill).



I don't mind having to push everything to jsonb and back for tuple
manipulation and I expect that's how these types of things are going
to be done moving forwards. jsonb has clearly caught a bid judging by
what I'm reading in the blogosphere and will continue to accrete
features things like this.


I think it's unfortunate to lose the strong typing that we have. That 
can be especially important for something like numbers (was it 
originally a float or a numeric?). But maybe JSON is good enough.



[1] The one OO-ish data feature I'd like is the ability to de-reference 
a foreign key "pointer". So if


CREATE TABLE b( a_id int REFERENCES a);

then have

SELECT a_id.some_field FROM b;

transform to

SELECT a.some_field FROM b JOIN a ...;
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-06-23 Thread Pavel Stehule
2015-06-23 21:57 GMT+02:00 Jim Nasby :

> On 6/23/15 9:45 AM, Pavel Stehule wrote:
>
>>
>> 2015-06-23 1:56 GMT+02:00 Jim Nasby > >:
>>
>>
>> On 6/22/15 2:46 AM, Pavel Stehule wrote:
>>
>>
>> FOREACH key, val IN RECORD myrow
>> LOOP
>> IF pg_typeof(val) IN ('int4', 'double precision', 'numeric')
>> THEN
>>   val := val + 1; -- these variables can be mutable
>>   -- or maybe in futore
>>  myrow[key] := val + 1;
>> END IF;
>> END LOOP;
>>
>> What is important - "val" is automatic variable, and it can has
>> different type in any step.
>>
>> It is little bit strange, but impossible to solve, so we cannot to
>> support row[var] as right value (without immutable casting). But
>> we can
>> do it with left value.
>>
>>
>> Actually, you can (theoretically) solve it for the right value as
>> well with if val is an actual type and you have operators on that
>> type that know to search for a specific operator given the actual
>> types that are involved. So if val is int4, val + 1 becomes int4 +
>> int4.
>>
>> The problem I've run into with this is by the time you've added
>> enough casts to make this workable you've probably created a
>> situation where val + something is going to recurse back to itself.
>> I've partially solved this in [1], and intend to finish it by
>> calling back in via SPI to do the final resolution, the same way the
>> RI triggers do.
>>
>> What would be a lot better is if we had better control over function
>> and operator resolution.
>>
>> [1]
>>
>> https://github.com/decibel/variant/commit/2b99067744a405da8a325de1ebabd213106f794f#diff-8aa2db4a577ee4201d6eb9041c2a457eR846
>>
>>
>> The solution of dynamic operators changes philosophy about 180° - and I
>> afraid about a performance.
>>
>> Now if I am thinking about possibilities - probably it is solvable on
>> right side too. It needs to solve two steps:
>>
>> 1. parametrized record reference syntax - some like SELECT $1[$]
>> 2. possibility to throw plan cache, if result has different type than is
>> expected in cache.
>>
>
> Well, the other option is we allow for cases where we don't know in
> advance what the type will be. That would handle this, JSON, variant, and
> possibly some other scenarios.
>
> BTW, I think this relates to the desire to be able to do more OO-ish
> things in the database. Like "do X to all elements in this array". And to
> have actual classes, private members, real arrays of arrays. It seems like
> there's a bigger need here that's only being addressed piecemeal. :/


I would not to open this box - and I would not to throw or redesign almost
all PostgreSQL type handling system. I am sure, so it is not necessary. PL
can be relative static if the dynamic is covered by query language. The few
features can implemented without to necessity to redesign all. Still there
are other PL - and we have not force to design new Perl, JavaScript, ...


> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] pg_stat_*_columns?

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 3:47 PM, Andres Freund  wrote:
> On 2015-06-22 21:05:52 -0400, Robert Haas wrote:
>> Interesting idea.  We could consider the set of stats files a database
>> unto itself and reserve a low-numbered OID for it.  The obvious thing
>> to do is use the database's OID as the relfilenode, but then how do
>> you replace the stats file?
>
> The relmapper infrastructure should be able to take care of that.

How?  I think it assumes that the number of mapped entries is pretty small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we back-patch SSL renegotiation fixes?

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 3:48 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jun 23, 2015 at 2:33 PM, Tom Lane  wrote:
>>> I do not know at this point whether these behaviors are really the same
>>> bug or not, but I wonder whether it's time to consider back-patching the
>>> renegotiation fixes we did in 9.4.  Specifically, I think maybe we should
>>> back-patch 31cf1a1a4, 86029b31e, and 36a3be654.  (There are more changes
>>> in master, but since those haven't yet shipped in any released branch,
>>> and there's been a lot of other rework in the same area, those probably
>>> are not back-patch candidates.)
>>>
>>> Thoughts?
>
>> I have no clear idea how safe it is to back-port these fixes.
>
> Well, it would mean that pre-9.5 branches all behave the same, which
> would be an improvement in my book.  Also, ISTM that the 9.4 code
> for renegotiation assumes a whole lot less than prior branches about
> OpenSSL's internal behavior; so it ought to be more robust, even if
> some bugs remain.
>
>> Just as a point of reference, we had a customer hit a problem similar
>> to bug #12769 on 9.3.x.  I think (but am not sure) that 272923a0a may
>> have been intended to fix that issue.  In a quick search, I didn't
>> find any other complaints about renegotiation-related issues from our
>> customers.
>
> The problem with trying to adopt code from HEAD is that it probably
> depends on the rather invasive changes explained here:
> http://www.postgresql.org/message-id/20150126101405.ga31...@awork2.anarazel.de
> Even assuming that there's no dependency on the immediate-interrupt
> changes, I'm afraid to back-patch anything that invasive.

What commits actually resulted from that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-06-23 Thread Merlin Moncure
On Tue, Jun 23, 2015 at 2:57 PM, Jim Nasby  wrote:
> On 6/23/15 9:45 AM, Pavel Stehule wrote:
>> 1. parametrized record reference syntax - some like SELECT $1[$]
>> 2. possibility to throw plan cache, if result has different type than is
>> expected in cache.
>
>
> Well, the other option is we allow for cases where we don't know in advance
> what the type will be. That would handle this, JSON, variant, and possibly
> some other scenarios.
>
> BTW, I think this relates to the desire to be able to do more OO-ish things
> in the database. Like "do X to all elements in this array". And to have
> actual classes, private members, real arrays of arrays. It seems like
> there's a bigger need here that's only being addressed piecemeal. :/

I would rephrase that to: "do X to all fields of an object".
Array handling is pretty good now (minus arrays of arrays, but arrays
of objects containing arrays is 'good enough' for most real world
cases).  We've suffered for a while now with hstore/json as a
temporary container to handle operations that are not well supported
by postgres's particularly strongly typed flavor SQL.   The "OO" of
postgres has been gradually diluting away; it's not a 'object
relational' database anymore and the OO features, very much a product
of the silly 90's OO hysteria, have been recast into more useful
features like inheritance and/or pruned back.

I don't mind having to push everything to jsonb and back for tuple
manipulation and I expect that's how these types of things are going
to be done moving forwards. jsonb has clearly caught a bid judging by
what I'm reading in the blogosphere and will continue to accrete
features things like this.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-06-23 Thread Jim Nasby

On 6/22/15 11:59 PM, Fabien COELHO wrote:

which might not be helpful for cases when checkpoint could have
flushed soon-to-be-recycled buffers. I think flushing the sorted
buffers w.r.t tablespaces is a good idea, but not giving any
preference to clock-sweep point seems to me that we would loose in
some cases by this new change.


I do not see how to do both, as these two orders seem more or less
unrelated?  The traditionnal assumption is that the I/O are very slow
and they are to be optimized first, so going for buffer ordering to be
nice to the disk looks like the priority.


The point is that it's already expensive for backends to advance the 
clock; if they then have to wait on IO as well it gets REALLY expensive. 
So we want to avoid that.


Other than that though, it is pretty orthogonal, so perhaps another 
indication that the clock should be handled separately from both 
backends and bgwriter...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-06-23 Thread Jim Nasby

On 6/23/15 9:45 AM, Pavel Stehule wrote:


2015-06-23 1:56 GMT+02:00 Jim Nasby mailto:jim.na...@bluetreble.com>>:

On 6/22/15 2:46 AM, Pavel Stehule wrote:


FOREACH key, val IN RECORD myrow
LOOP
IF pg_typeof(val) IN ('int4', 'double precision', 'numeric')
THEN
  val := val + 1; -- these variables can be mutable
  -- or maybe in futore
 myrow[key] := val + 1;
END IF;
END LOOP;

What is important - "val" is automatic variable, and it can has
different type in any step.

It is little bit strange, but impossible to solve, so we cannot to
support row[var] as right value (without immutable casting). But
we can
do it with left value.


Actually, you can (theoretically) solve it for the right value as
well with if val is an actual type and you have operators on that
type that know to search for a specific operator given the actual
types that are involved. So if val is int4, val + 1 becomes int4 + int4.

The problem I've run into with this is by the time you've added
enough casts to make this workable you've probably created a
situation where val + something is going to recurse back to itself.
I've partially solved this in [1], and intend to finish it by
calling back in via SPI to do the final resolution, the same way the
RI triggers do.

What would be a lot better is if we had better control over function
and operator resolution.

[1]

https://github.com/decibel/variant/commit/2b99067744a405da8a325de1ebabd213106f794f#diff-8aa2db4a577ee4201d6eb9041c2a457eR846


The solution of dynamic operators changes philosophy about 180° - and I
afraid about a performance.

Now if I am thinking about possibilities - probably it is solvable on
right side too. It needs to solve two steps:

1. parametrized record reference syntax - some like SELECT $1[$]
2. possibility to throw plan cache, if result has different type than is
expected in cache.


Well, the other option is we allow for cases where we don't know in 
advance what the type will be. That would handle this, JSON, variant, 
and possibly some other scenarios.


BTW, I think this relates to the desire to be able to do more OO-ish 
things in the database. Like "do X to all elements in this array". And 
to have actual classes, private members, real arrays of arrays. It seems 
like there's a bigger need here that's only being addressed piecemeal. :/

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we back-patch SSL renegotiation fixes?

2015-06-23 Thread Tom Lane
Alvaro Herrera  writes:
>> On Tue, Jun 23, 2015 at 2:33 PM, Tom Lane  wrote:
>>> I do not know at this point whether these behaviors are really the same
>>> bug or not, but I wonder whether it's time to consider back-patching the
>>> renegotiation fixes we did in 9.4.  Specifically, I think maybe we should
>>> back-patch 31cf1a1a4, 86029b31e, and 36a3be654.

> Yes, +1 for backpatching.  Don't forget 5674460b and b1aebbb6.

Huh?  5674460b is ancient, and we concluded that b1aebbb6 didn't represent
anything much more than cosmetic fixes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we back-patch SSL renegotiation fixes?

2015-06-23 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Jun 23, 2015 at 2:33 PM, Tom Lane  wrote:

> > I do not know at this point whether these behaviors are really the same
> > bug or not, but I wonder whether it's time to consider back-patching the
> > renegotiation fixes we did in 9.4.  Specifically, I think maybe we should
> > back-patch 31cf1a1a4, 86029b31e, and 36a3be654.  (There are more changes
> > in master, but since those haven't yet shipped in any released branch,
> > and there's been a lot of other rework in the same area, those probably
> > are not back-patch candidates.)

Yes, +1 for backpatching.  Don't forget 5674460b and b1aebbb6.

I could reproduce problems trivially with COPY in psql without that and
a small renegotiation limit, as I recall.

> > Thoughts?
> 
> I have no clear idea how safe it is to back-port these fixes.
> 
> Just as a point of reference, we had a customer hit a problem similar
> to bug #12769 on 9.3.x.  I think (but am not sure) that 272923a0a may
> have been intended to fix that issue.

Maybe we should *also* backpatch that, then (and if so, then also its
fixup 1c2b7c087).  I do not think that the failure was introduced by
the fixes cited above.

> In a quick search, I didn't find any other complaints about
> renegotiation-related issues from our customers.

Other issues Andres was investigating seemed related to nonblocking
connections (which as I recall are used mostly by replication code).
Bug #12769 contained a link to the previous discussion.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we back-patch SSL renegotiation fixes?

2015-06-23 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jun 23, 2015 at 2:33 PM, Tom Lane  wrote:
>> I do not know at this point whether these behaviors are really the same
>> bug or not, but I wonder whether it's time to consider back-patching the
>> renegotiation fixes we did in 9.4.  Specifically, I think maybe we should
>> back-patch 31cf1a1a4, 86029b31e, and 36a3be654.  (There are more changes
>> in master, but since those haven't yet shipped in any released branch,
>> and there's been a lot of other rework in the same area, those probably
>> are not back-patch candidates.)
>> 
>> Thoughts?

> I have no clear idea how safe it is to back-port these fixes.

Well, it would mean that pre-9.5 branches all behave the same, which
would be an improvement in my book.  Also, ISTM that the 9.4 code
for renegotiation assumes a whole lot less than prior branches about
OpenSSL's internal behavior; so it ought to be more robust, even if
some bugs remain.

> Just as a point of reference, we had a customer hit a problem similar
> to bug #12769 on 9.3.x.  I think (but am not sure) that 272923a0a may
> have been intended to fix that issue.  In a quick search, I didn't
> find any other complaints about renegotiation-related issues from our
> customers.

The problem with trying to adopt code from HEAD is that it probably
depends on the rather invasive changes explained here:
http://www.postgresql.org/message-id/20150126101405.ga31...@awork2.anarazel.de
Even assuming that there's no dependency on the immediate-interrupt
changes, I'm afraid to back-patch anything that invasive.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_*_columns?

2015-06-23 Thread Andres Freund
On 2015-06-22 21:05:52 -0400, Robert Haas wrote:
> Interesting idea.  We could consider the set of stats files a database
> unto itself and reserve a low-numbered OID for it.  The obvious thing
> to do is use the database's OID as the relfilenode, but then how do
> you replace the stats file?

The relmapper infrastructure should be able to take care of that.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_*_columns?

2015-06-23 Thread Jim Nasby

On 6/22/15 8:05 PM, Robert Haas wrote:

In totally different crazy way we could just use the existing buffer
>manager we have and simply put the stats file in
>shared_buffers. Inventing a per-database relfilenode that doesn't
>conflict doesn't seem impossible. With some care it shouldn't be hard to
>make that stats file readable from all sessions, even if they're not
>connected to the database (e.g. autovacuum launcher).

Interesting idea.  We could consider the set of stats files a database
unto itself and reserve a low-numbered OID for it.  The obvious thing
to do is use the database's OID as the relfilenode, but then how do
you replace the stats file?


I think one of the biggest use cases for the stats is to collect them 
over time and put them in a database. It's rather tempting to just say 
that's what we should be doing, and provide a knob for how often to 
record the values and how long to keep the data for. That would 
eliminate a lot of these problems.


The one part I don't see how to solve is the case where bad stuff is 
happening *right now* and you don't/can't wait for the next reporting 
interval. Presumably handling that requires all the stuff that's 
discussed already and you might as well just handle the recording in 
user space. But maybe someone has some clever ideas there...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Jim Nasby

On 6/23/15 12:21 PM, Tom Lane wrote:

I concur: if we're to have a flag at all, it should work as Alvaro says.

However, I'm not real sure we need a flag.  I think the use-case of
wanting extra logging for a bgworker under development is unlikely to be
satisfied very well by just causing existing start/stop logging messages
to come out at higher priority.  You're likely to be wanting to log other,
bgworker-specific, events, and so you'll probably end up writing a bunch
of your own elog calls anyway (which you'll eventually remove, #ifdef out,
or decrease the log levels of).


FWIW, I have this problem *constantly* with plpgsql. I put RAISE DEBUGs 
in, but once you have those in enough places SET 
client_min_messages=debug becomes next to useless because of the huge 
volume of spew.


What I'd like is a way to add an identifier to ereport/RAISE so you 
could turn on individual reports. If we had that we'd just make these 
particular ereports DEBUG1 and not worry about it because you could 
easily turn them on when needed.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] btree_gin and BETWEEN

2015-06-23 Thread Tom Lane
Jeff Janes  writes:
> If I use the btree_gin extension to build a gin index on a scalar value, it
> doesn't work well with BETWEEN queries.  It looks like it scans the whole
> index, with the part of the index between the endpoints getting scanned
> twice.  It is basically executed as if "col1 between x and y" were "col1
> between -inf and y and col1 between x and +inf".

> It puts the correct tuples into the bitmap, because whichever inequality is
> not being used to set the query endpoint currently is used as a filter
> instead.

> So I could just not build that index.  But I want it for other reasons, and
> the problem is that the planner thinks the index can implement the BETWEEN
> query efficiently.  So even if it has truly better options available, it
> switches to using a falsely attractive btree_gin index.

> create table foo as select random() as btree, random() as gin from
> generate_series(1,300);
> create index on foo using gin (gin);
> create index on foo using btree (btree);
> explain ( analyze, buffers) select count(*) from foo where btree between
> 0.001 and 0.00105;
> explain ( analyze, buffers) select count(*) from foo where gin between
> 0.001 and 0.00105;

> It would be nice if btree_gin supported BETWEEN and other range queries
> efficiently, or at least if the planner knew it couldn't support them
> efficiently.  But I don't see where to begin on either one of these tasks.
> Is either one of them plausible?

ISTM this is a bug/oversight in the core GIN code: it has multiple
GinScanEntry's for the same index column, but fails to realize that it
could start the index scans at the largest match value of any of those
GinScanEntry's.  This did not matter so much before the partial-match
feature went in; but with a partial match we might be talking about
scanning a large fraction of the index, and it's important to optimize
the start point if possible.  I think there is also a failure to notice
when we could stop the scans because one of a group of AND'ed entries
says no more matches are possible.

I'm not sure where the best place to hack in this consideration would
be, though.  Oleg, Teodor, any thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we back-patch SSL renegotiation fixes?

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 2:33 PM, Tom Lane  wrote:
> Those of you who have been following
> http://www.postgresql.org/message-id/flat/1d3bc192-970d-4b70-a5fe-38d2a9f76...@me.com
> are aware that Red Hat shipped a rather broken version of openssl last
> week.  While waiting for them to fix it, I've been poking at the behavior,
> and have found out that PG 9.4 and later are much less badly broken than
> older branches.  In the newer branches you'll see a failure only after
> transmitting 2GB within a session, whereas the older branches fail at
> the second renegotiation attempt, which would typically be 1GB of data
> and could be a lot less.
>
> I do not know at this point whether these behaviors are really the same
> bug or not, but I wonder whether it's time to consider back-patching the
> renegotiation fixes we did in 9.4.  Specifically, I think maybe we should
> back-patch 31cf1a1a4, 86029b31e, and 36a3be654.  (There are more changes
> in master, but since those haven't yet shipped in any released branch,
> and there's been a lot of other rework in the same area, those probably
> are not back-patch candidates.)
>
> Thoughts?

I have no clear idea how safe it is to back-port these fixes.

Just as a point of reference, we had a customer hit a problem similar
to bug #12769 on 9.3.x.  I think (but am not sure) that 272923a0a may
have been intended to fix that issue.  In a quick search, I didn't
find any other complaints about renegotiation-related issues from our
customers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-23 Thread Michael Paquier
On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao  wrote:
> On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas  wrote:
>> On 06/23/2015 05:03 PM, Fujii Masao wrote:
>>>
>>> On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas 
>>> wrote:

 On 06/23/2015 07:51 AM, Michael Paquier wrote:
>
>
> So... Attached are a set of patches dedicated at fixing this issue:



 Thanks for working on this!

> - 0001, add if_not_exists to pg_tablespace_location, returning NULL if
> path does not exist
> - 0002, same with pg_stat_file, returning NULL if file does not exist
> - 0003, same with pg_read_*file. I added them to all the existing
> functions for consistency.
> - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
> (thanks Robert for the naming!)
> - 0005, as things get complex, a set of regression tests aimed to
> covering those things. pg_tablespace_location is platform-dependent,
> so there are no tests for it.
> - 0006, the fix for pg_rewind, using what has been implemented before.



 With thes patches, pg_read_file() will return NULL for any failure to
 open
 the file, which makes pg_rewind to assume that the file doesn't exist in
 the
 source server, and will remove the file from the destination. That's
 dangerous, those functions should check specifically for ENOENT.
>>>
>>>
>>> I'm wondering if using pg_read_file() to copy the file from source server
>>> is reasonable. ISTM that it has two problems as follows.
>>>
>>> 1. It cannot read very large file like 1GB file. So if such large file was
>>>  created in source server after failover, pg_rewind would not be able
>>>  to copy the file. No?
>>
>>
>> pg_read_binary_file() handles large files just fine. It cannot return more
>> than 1GB in one call, but you can call it several times and retrieve the
>> file in chunks. That's what pg_rewind does, except for reading the control
>> file, which is known to be small.
>
> Yeah, you're right.
>
> I found that pg_rewind creates a temporary table to fetch the file in chunks.
> This would prevent pg_rewind from using the *hot standby* server as a source
> server at all. This is of course a limitation of pg_rewind, but we might want
> to alleviate it in the future.

This is something that a replication command could address properly.

>>> 2. Many users may not allow a remote client to connect to the
>>>  PostgreSQL server as a superuser for some security reasons. IOW,
>>>  there would be no entry in pg_hba.conf for such connection.
>>>  In this case, pg_rewind always fails because pg_read_file() needs
>>>  superuser privilege. No?
>>>
>>> I'm tempting to implement the replication command version of
>>> pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
>>> replication command does...
>>
>>
>> Yeah, that would definitely be nice. Peter suggested it back in January
>> (http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think
>> it's way too late to do that for 9.5, however. I'm particularly worried that
>> if we design the required API in a rush, we're not going to get it right,
>> and will have to change it again soon. That might be difficult in a minor
>> release. Using pg_read_file() and friends is quite flexible, even though we
>> just find out that they're not quite flexible enough right now (the ENOENT
>> problem).
>
> I agree that it's too late to do what I said...
>
> But just using pg_read_file() cannot address the #2 problem that I pointed
> in my previous email. Also requiring a superuser privilege on pg_rewind
> really conflicts with the motivation why we added replication privilege.
>
> So we should change pg_read_file() so that even replication user can
> read the file?

>From the security prospective, a replication user can take a base
backup so it can already retrieve easily the contents of PGDATA. Hence
I guess that it would be fine. However, what about cases where
pg_hba.conf authorizes access to a given replication user via psql and
blocks it for the replication protocol? We could say that OP should
not give out replication access that easily, but in this case the user
would have access to the content of PGDATA even if he should not. Is
that unrealistic?

> Or replication user version of pg_read_file() should be implemented?

You mean a new function? In what is it different from authorizing
pg_read_file usage for a replication user?

Honestly, I can live with this superuser restriction in 9.5. And come
back to the replication user restriction in 9.6 once things cool down
a bit.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Should we back-patch SSL renegotiation fixes?

2015-06-23 Thread Tom Lane
Those of you who have been following
http://www.postgresql.org/message-id/flat/1d3bc192-970d-4b70-a5fe-38d2a9f76...@me.com
are aware that Red Hat shipped a rather broken version of openssl last
week.  While waiting for them to fix it, I've been poking at the behavior,
and have found out that PG 9.4 and later are much less badly broken than
older branches.  In the newer branches you'll see a failure only after
transmitting 2GB within a session, whereas the older branches fail at
the second renegotiation attempt, which would typically be 1GB of data
and could be a lot less.

I do not know at this point whether these behaviors are really the same
bug or not, but I wonder whether it's time to consider back-patching the
renegotiation fixes we did in 9.4.  Specifically, I think maybe we should
back-patch 31cf1a1a4, 86029b31e, and 36a3be654.  (There are more changes
in master, but since those haven't yet shipped in any released branch,
and there's been a lot of other rework in the same area, those probably
are not back-patch candidates.)

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_*_columns?

2015-06-23 Thread Magnus Hagander
On Tue, Jun 23, 2015 at 3:01 AM, Robert Haas  wrote:

> On Sun, Jun 21, 2015 at 11:43 AM, Magnus Hagander 
> wrote:
> > On Sat, Jun 20, 2015 at 11:55 PM, Robert Haas 
> wrote:
> >> On Sat, Jun 20, 2015 at 7:01 PM, Tom Lane  wrote:
> >> >> But if the structure
> >> >> got too big to map (on a 32-bit system), then you'd be sort of hosed,
> >> >> because there's no way to attach just part of it.  That might not be
> >> >> worth worrying about, but it depends on how big it's likely to get -
> a
> >> >> 32-bit system is very likely to choke on a 1GB mapping, and maybe
> even
> >> >> on a much smaller one.
> >> >
> >> > Yeah, I'm quite worried about assuming that we can map a data
> structure
> >> > that might be of very significant size into shared memory on 32-bit
> >> > machines.  The address space just isn't there.
> >>
> >> Considering the advantages of avoiding message queues, I think we
> >> should think a little bit harder about whether we can't find some way
> >> to skin this cat.  As I think about this a little more, I'm not sure
> >> there's really a problem with one stats DSM per database.  Sure, the
> >> system might have 100,000 databases in some crazy pathological case,
> >> but the maximum number of those that can be in use is bounded by
> >> max_connections, which means the maximum number of stats file DSMs we
> >> could ever need at one time is also bounded by max_connections.  There
> >> are a few corner cases to think about, like if the user writes a
> >> client that connects to all 100,000 databases in very quick
> >> succession, we've got to jettison the old DSMs fast enough to make
> >> room for the new DSMs before we run out of slots, but that doesn't
> >> seem like a particularly tough nut to crack.  If the stats collector
> >> ensures that it never attaches to more than MaxBackends stats DSMs at
> >> a time, and each backend ensures that it never attaches to more than
> >> one stats DSM at a time, then 2 * MaxBackends stats DSMs is always
> >> enough.  And that's just a matter of bumping
> >> PG_DYNSHMEM_SLOTS_PER_BACKEND from 2 to 4.
> >>
> >> In more realistic cases, it will probably be normal for many or all
> >> backends to be connected to the same database, and the number of stats
> >> DSMs required will be far smaller.
> >
> > What about a combination in the line of something like this: stats
> collector
> > keeps the statistics in local memory as before. But when a backend needs
> to
> > get a snapshot of it's data, it uses a shared memory queue to request it.
> > What the stats collector does in this case is allocate a new DSM, copy
> the
> > data into that DSM, and hands the DSM over to the backend. At this point
> the
> > stats collector can forget about it, and it's up to the backend to get
> rid
> > of it when it's done with it.
>
> Well, there seems to be little point in having the stats collector
> forget about a DSM that it could equally well have shared with the
> next guy who wants a stats snapshot for the same database.  That case
> is surely *plenty* common enough to be worth optimizing for.
>
>
Right, we only need to drop it once we have received a stats message for it
so something changed. And possibly that with a minimum time as well, as we
have now, if we want to limit the potential churn.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Jun 23, 2015 at 1:21 PM, Tom Lane  wrote:

> > However, I'm not real sure we need a flag.  I think the use-case of
> > wanting extra logging for a bgworker under development is unlikely to be
> > satisfied very well by just causing existing start/stop logging messages
> > to come out at higher priority.  You're likely to be wanting to log other,
> > bgworker-specific, events, and so you'll probably end up writing a bunch
> > of your own elog calls anyway (which you'll eventually remove, #ifdef out,
> > or decrease the log levels of).
> 
> Yeah.  So let's just change it.

+1

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 1:21 PM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Robert Haas wrote:
>>> Well, if the flag is BGWORKER_QUIET, then the default behavior remains
>>> unchanged, but when that flag is used, the log level is reduced to
>>> DEBUG1.  That has the advantage of not breaking backward
>>> compatibility.  But I'm not sure whether anyone cares if we just break
>>> it, and it's certainly simpler without the flag.
>
>> I vote we do it the other way around, that is have a flag BGWORKER_VERBOSE.
>> This breaks backwards compatibility (I don't think there's too much
>> value in that in this case), but it copes with the more common use case
>> that you want to have the flag while the worker is being developed; and
>> things that are already working don't need to change in order to get the
>> natural behavior.
>
> I concur: if we're to have a flag at all, it should work as Alvaro says.
>
> However, I'm not real sure we need a flag.  I think the use-case of
> wanting extra logging for a bgworker under development is unlikely to be
> satisfied very well by just causing existing start/stop logging messages
> to come out at higher priority.  You're likely to be wanting to log other,
> bgworker-specific, events, and so you'll probably end up writing a bunch
> of your own elog calls anyway (which you'll eventually remove, #ifdef out,
> or decrease the log levels of).

Yeah.  So let's just change it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Tom Lane
Alvaro Herrera  writes:
> Robert Haas wrote:
>> Well, if the flag is BGWORKER_QUIET, then the default behavior remains
>> unchanged, but when that flag is used, the log level is reduced to
>> DEBUG1.  That has the advantage of not breaking backward
>> compatibility.  But I'm not sure whether anyone cares if we just break
>> it, and it's certainly simpler without the flag.

> I vote we do it the other way around, that is have a flag BGWORKER_VERBOSE.
> This breaks backwards compatibility (I don't think there's too much
> value in that in this case), but it copes with the more common use case
> that you want to have the flag while the worker is being developed; and
> things that are already working don't need to change in order to get the
> natural behavior.

I concur: if we're to have a flag at all, it should work as Alvaro says.

However, I'm not real sure we need a flag.  I think the use-case of
wanting extra logging for a bgworker under development is unlikely to be
satisfied very well by just causing existing start/stop logging messages
to come out at higher priority.  You're likely to be wanting to log other,
bgworker-specific, events, and so you'll probably end up writing a bunch
of your own elog calls anyway (which you'll eventually remove, #ifdef out,
or decrease the log levels of).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Jun 23, 2015 at 10:53 AM, Pavel Stehule  
> wrote:
> > 2015-06-23 15:20 GMT+02:00 Robert Haas :
> >> I was thinking of a background worker flag, not a GUC.
> >> BGWORKER_QUIET, or something like that.  But I guess we ought to just
> >> change it.
> >
> > I have not any problem with bg worker flag. The only question is, what
> > should be by default.
> 
> Well, if the flag is BGWORKER_QUIET, then the default behavior remains
> unchanged, but when that flag is used, the log level is reduced to
> DEBUG1.  That has the advantage of not breaking backward
> compatibility.  But I'm not sure whether anyone cares if we just break
> it, and it's certainly simpler without the flag.

I vote we do it the other way around, that is have a flag BGWORKER_VERBOSE.
This breaks backwards compatibility (I don't think there's too much
value in that in this case), but it copes with the more common use case
that you want to have the flag while the worker is being developed; and
things that are already working don't need to change in order to get the
natural behavior.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NULL passed as an argument to memcmp() in parse_func.c

2015-06-23 Thread Tom Lane
Glen Knowles  writes:
> It appears that, according to the standard, passing NULL to memcmp is
> undefined behavior, even if the count is 0. See
> http://stackoverflow.com/questions/16362925/can-i-pass-a-null-pointer-to-memcmp
> for C99 and C++ standard references.

Hmm ... looks like that's correct.  I had not noticed the introductory
paragraphs.  For those following along at home, the relevant text in
C99 is in "7.21.1 String function conventions":

   [#2]  Where  an  argument declared as size_t n specifies the
   length of the array for a function, n  can  have  the  value
   zero  on  a call to that function.  Unless explicitly stated
   otherwise in the description of  a  particular  function  in
   this subclause, pointer arguments on such a call shall still
   have valid values, as described in 7.1.4.  On such a call, a
   function  that  locates  a  character finds no occurrence, a
   function that compares two character sequences returns zero,
   and   a   function   that   copies  characters  copies  zero
   characters.

and the relevant text from 7.1.4 is

   [#1]   Each  of  the  following  statements  applies  unless
   explicitly stated otherwise  in  the  detailed  descriptions |
   that  follow:  If  an  argument to a function has an invalid
   value (such as a value outside the domain of  the  function,
   or  a pointer outside the address space of the program, or a
   null pointer) or a type (after promotion) not expected by  a
   function  with variable number of arguments, the behavior is
   undefined.

So it looks like we'd better change it.

I am not sure whether to put in the nargs == 0 test suggested yesterday
or to just insist that callers not pass NULL.  A quick grep suggests that
there is only one such caller right now, namely this bit in ruleutils.c:

appendStringInfo(&buf, "EXECUTE PROCEDURE %s(",
 generate_function_name(trigrec->tgfoid, 0,
NIL, NULL,
false, NULL, EXPR_KIND_NONE));

You could certainly argue that that's taking an unwarranted shortcut.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-23 Thread Fujii Masao
On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas  wrote:
> On 06/23/2015 05:03 PM, Fujii Masao wrote:
>>
>> On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas 
>> wrote:
>>>
>>> On 06/23/2015 07:51 AM, Michael Paquier wrote:


 So... Attached are a set of patches dedicated at fixing this issue:
>>>
>>>
>>>
>>> Thanks for working on this!
>>>
 - 0001, add if_not_exists to pg_tablespace_location, returning NULL if
 path does not exist
 - 0002, same with pg_stat_file, returning NULL if file does not exist
 - 0003, same with pg_read_*file. I added them to all the existing
 functions for consistency.
 - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
 (thanks Robert for the naming!)
 - 0005, as things get complex, a set of regression tests aimed to
 covering those things. pg_tablespace_location is platform-dependent,
 so there are no tests for it.
 - 0006, the fix for pg_rewind, using what has been implemented before.
>>>
>>>
>>>
>>> With thes patches, pg_read_file() will return NULL for any failure to
>>> open
>>> the file, which makes pg_rewind to assume that the file doesn't exist in
>>> the
>>> source server, and will remove the file from the destination. That's
>>> dangerous, those functions should check specifically for ENOENT.
>>
>>
>> I'm wondering if using pg_read_file() to copy the file from source server
>> is reasonable. ISTM that it has two problems as follows.
>>
>> 1. It cannot read very large file like 1GB file. So if such large file was
>>  created in source server after failover, pg_rewind would not be able
>>  to copy the file. No?
>
>
> pg_read_binary_file() handles large files just fine. It cannot return more
> than 1GB in one call, but you can call it several times and retrieve the
> file in chunks. That's what pg_rewind does, except for reading the control
> file, which is known to be small.

Yeah, you're right.

I found that pg_rewind creates a temporary table to fetch the file in chunks.
This would prevent pg_rewind from using the *hot standby* server as a source
server at all. This is of course a limitation of pg_rewind, but we might want
to alleviate it in the future.

>> 2. Many users may not allow a remote client to connect to the
>>  PostgreSQL server as a superuser for some security reasons. IOW,
>>  there would be no entry in pg_hba.conf for such connection.
>>  In this case, pg_rewind always fails because pg_read_file() needs
>>  superuser privilege. No?
>>
>> I'm tempting to implement the replication command version of
>> pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
>> replication command does...
>
>
> Yeah, that would definitely be nice. Peter suggested it back in January
> (http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think
> it's way too late to do that for 9.5, however. I'm particularly worried that
> if we design the required API in a rush, we're not going to get it right,
> and will have to change it again soon. That might be difficult in a minor
> release. Using pg_read_file() and friends is quite flexible, even though we
> just find out that they're not quite flexible enough right now (the ENOENT
> problem).

I agree that it's too late to do what I said...

But just using pg_read_file() cannot address the #2 problem that I pointed
in my previous email. Also requiring a superuer privilege on pg_rewind
really conflicts with the motivation why we added replication privilege.

So we should change pg_read_file() so that even replication user can
read the file?
Or replication user version of pg_read_file() should be implemented?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] btree_gin and BETWEEN

2015-06-23 Thread Jeff Janes
If I use the btree_gin extension to build a gin index on a scalar value, it
doesn't work well with BETWEEN queries.  It looks like it scans the whole
index, with the part of the index between the endpoints getting scanned
twice.  It is basically executed as if "col1 between x and y" were "col1
between -inf and y and col1 between x and +inf".

It puts the correct tuples into the bitmap, because whichever inequality is
not being used to set the query endpoint currently is used as a filter
instead.

So I could just not build that index.  But I want it for other reasons, and
the problem is that the planner thinks the index can implement the BETWEEN
query efficiently.  So even if it has truly better options available, it
switches to using a falsely attractive btree_gin index.

create table foo as select random() as btree, random() as gin from
generate_series(1,300);
create index on foo using gin (gin);
create index on foo using btree (btree);
explain ( analyze, buffers) select count(*) from foo where btree between
0.001 and 0.00105;
explain ( analyze, buffers) select count(*) from foo where gin between
0.001 and 0.00105;

It would be nice if btree_gin supported BETWEEN and other range queries
efficiently, or at least if the planner knew it couldn't support them
efficiently.  But I don't see where to begin on either one of these tasks.
Is either one of them plausible?

Cheers,

Jeff


Re: [HACKERS] checkpointer continuous flushing

2015-06-23 Thread Fabien COELHO



It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.


Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min
max_wal_size=1GB warmup=600 time=4000

 flsh |  full speed tps  | percent of late tx, 4 clients, for tps:
 /srt |  1 client  |  4 clients  |  100 |  200 |  400 |  800 | 1200 | 1600
 N/N  | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38
 N/Y  | 924 +- 122 | 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67
 Y/N  | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 | 0.71 | 0.81 | 1.26
 Y/Y  | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13

There seems to be a small 1-2% performance benefit with 4 clients, this is 
reversed for 1 client, there are significantly and consistently less late 
transactions when options are activated, the performance is more stable

(standard deviation reduced by 10-18%).

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times 
over in 5 minutes, so the checkpoint basically writes everything over 220 
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less 
in order in memory, so would be written out in order.



medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
 max_wal_size=4GB warmup=1200 time=7500

 flsh |  full speed tps   | percent of late tx, 4 clients
 /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
  N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
  N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
  Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
  Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run, 
flushing has a positive (+20% with 1 client) or negative (-8 % with 4 
clients) on throughput, and late transactions are reduced by 92-95% when 
both options are activated.


At 550 tps checkpoints are xlog-triggered and write about 1/3 of the 
database, (17 buffers to write very 220-260 seconds, 4 MB/s).


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash index creation warning

2015-06-23 Thread Robert Haas
On Mon, Jun 22, 2015 at 8:46 PM, Michael Paquier
 wrote:
> On Tue, Jun 23, 2015 at 9:06 AM, Jim Nasby  wrote:
>> On 6/12/15 5:00 PM, Thom Brown wrote:
>>>
>>> On 18 October 2014 at 15:36, Bruce Momjian  wrote:

 On Fri, Oct 17, 2014 at 02:36:55PM -0400, Bruce Momjian wrote:
>
> On Fri, Oct 17, 2014 at 12:56:52PM -0400, Tom Lane wrote:
>>
>> David G Johnston  writes:
>>>
>>> The question is whether we explain the implications of not being
>>> WAL-logged
>>> in an error message or simply state the fact and let the documentation
>>> explain the hazards - basically just output:
>>> "hash indexes are not WAL-logged and their use is discouraged"
>>
>>
>> +1.  The warning message is not the place to be trying to explain all
>> the
>> details.
>
>
> OK, updated patch attached.


 Patch applied.
>>>
>>>
>>> I only just noticed this item when I read the release notes.  Should
>>> we bother warning when used on an unlogged table?
>>
>>
>> Not really; but I think the bigger question at this point is if we want to
>> change it this late in the game.
>
> Changing it even during beta looks acceptable to me. I think that it
> is mainly a matter to have a patch (here is one), and someone to push
> it as everybody here seem to agree that for UNLOGGED tables this
> warning has little sense.

I think you should be testing RelationNeedsWAL(), not the
relpersistence directly.  The same point applies for temporary
indexes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 10:53 AM, Pavel Stehule  wrote:
> 2015-06-23 15:20 GMT+02:00 Robert Haas :
>> I was thinking of a background worker flag, not a GUC.
>> BGWORKER_QUIET, or something like that.  But I guess we ought to just
>> change it.
>
> I have not any problem with bg worker flag. The only question is, what
> should be by default.

Well, if the flag is BGWORKER_QUIET, then the default behavior remains
unchanged, but when that flag is used, the log level is reduced to
DEBUG1.  That has the advantage of not breaking backward
compatibility.  But I'm not sure whether anyone cares if we just break
it, and it's certainly simpler without the flag.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Pavel Stehule
2015-06-23 15:20 GMT+02:00 Robert Haas :

> On Mon, Jun 22, 2015 at 9:47 PM, Tom Lane  wrote:
> > Michael Paquier  writes:
> >> On Tue, Jun 23, 2015 at 10:07 AM, Robert Haas wrote:
> >>> On Mon, Jun 22, 2015 at 8:19 PM, Jim Nasby wrote:
>  Anything ever happen with this? I agree that LOG is to high for
> reporting
>  most (if not all) of these things.
> >
> >>> I think we should consider having a flag for this behavior rather than
> >>> changing the behavior across the board.
> >>> But then again, maybe we should just change it.
> >>>
> >>> What do others think?
> >
> >> A GUC just for that looks like an overkill to me, this log is useful
> >> when debugging. And one could always have its bgworker call elog by
> >> itself at startup and before leaving to provide more or less similar
> >> information.
> >
> > I agree that we don't need YAGUC here, particularly not one that applies
> > indiscriminately to all bgworkers.  I'd vote for just decreasing the log
> > level.  The current coding is appropriate for a facility that's basically
> > experimental; but as it moves towards being something that would be used
> > routinely in production, the argument for being noisy in the log gets
> > weaker and weaker.
>
> I was thinking of a background worker flag, not a GUC.
> BGWORKER_QUIET, or something like that.  But I guess we ought to just
> change it.
>

I have not any problem with bg worker flag. The only question is, what
should be by default.

Pavel


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] proposal: row_to_array function

2015-06-23 Thread Pavel Stehule
2015-06-23 1:56 GMT+02:00 Jim Nasby :

> On 6/22/15 2:46 AM, Pavel Stehule wrote:
>
>>
>> FOREACH key, val IN RECORD myrow
>> LOOP
>>IF pg_typeof(val) IN ('int4', 'double precision', 'numeric') THEN
>>  val := val + 1; -- these variables can be mutable
>>  -- or maybe in futore
>> myrow[key] := val + 1;
>>END IF;
>> END LOOP;
>>
>> What is important - "val" is automatic variable, and it can has
>> different type in any step.
>>
>> It is little bit strange, but impossible to solve, so we cannot to
>> support row[var] as right value (without immutable casting). But we can
>> do it with left value.
>>
>
> Actually, you can (theoretically) solve it for the right value as well
> with if val is an actual type and you have operators on that type that know
> to search for a specific operator given the actual types that are involved.
> So if val is int4, val + 1 becomes int4 + int4.
>
> The problem I've run into with this is by the time you've added enough
> casts to make this workable you've probably created a situation where val +
> something is going to recurse back to itself. I've partially solved this in
> [1], and intend to finish it by calling back in via SPI to do the final
> resolution, the same way the RI triggers do.
>
> What would be a lot better is if we had better control over function and
> operator resolution.
>
> [1]
> https://github.com/decibel/variant/commit/2b99067744a405da8a325de1ebabd213106f794f#diff-8aa2db4a577ee4201d6eb9041c2a457eR846
>

The solution of dynamic operators changes philosophy about 180° - and I
afraid about a performance.

Now if I am thinking about possibilities - probably it is solvable on right
side too. It needs to solve two steps:

1. parametrized record reference syntax - some like SELECT $1[$]
2. possibility to throw plan cache, if result has different type than is
expected in cache.


Pavel



> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
>
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [HACKERS] SSL TAP tests and chmod

2015-06-23 Thread Robert Haas
On Sun, Jun 21, 2015 at 9:50 PM, Michael Paquier
 wrote:
> On Fri, Jun 19, 2015 at 2:49 PM, Michael Paquier wrote:
>> The TAP tests in src/test/ssl are using system() in combination with
>> chmod, but perl has a command chmod integrated into it, and it would
>> work better on Windows as well.
>> The attached patch is aimed at fixing that.
>
> For the sake of the archives, this has been committed by Robert (thanks!):
> commit: ca3f43aa48a83013ea50aeee7cd193a5859c4587
> author: Robert Haas 
> date: Fri, 19 Jun 2015 10:46:30 -0400
> Change TAP test framework to not rely on having a chmod executable.
>
> This might not work at all on Windows, and is not ever efficient.

Whoops, sorry, I thought I had replied to this thread, but I guess I forgot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-23 Thread Heikki Linnakangas

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas  wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:


So... Attached are a set of patches dedicated at fixing this issue:



Thanks for working on this!


- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.



With thes patches, pg_read_file() will return NULL for any failure to open
the file, which makes pg_rewind to assume that the file doesn't exist in the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.


I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
 created in source server after failover, pg_rewind would not be able
 to copy the file. No?


pg_read_binary_file() handles large files just fine. It cannot return 
more than 1GB in one call, but you can call it several times and 
retrieve the file in chunks. That's what pg_rewind does, except for 
reading the control file, which is known to be small.



2. Many users may not allow a remote client to connect to the
 PostgreSQL server as a superuser for some security reasons. IOW,
 there would be no entry in pg_hba.conf for such connection.
 In this case, pg_rewind always fails because pg_read_file() needs
 superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...


Yeah, that would definitely be nice. Peter suggested it back in January 
(http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think 
it's way too late to do that for 9.5, however. I'm particularly worried 
that if we design the required API in a rush, we're not going to get it 
right, and will have to change it again soon. That might be difficult in 
a minor release. Using pg_read_file() and friends is quite flexible, 
even though we just find out that they're not quite flexible enough 
right now (the ENOENT problem).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_relation_info comment out of sync

2015-06-23 Thread Robert Haas
On Sun, Jun 21, 2015 at 8:10 AM, Thomas Munro
 wrote:
> The comment for get_relation_info should probably include serverid in
> the list of rel members that it can update (see attached).

Committed, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-23 Thread Fujii Masao
On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas  wrote:
> On 06/23/2015 07:51 AM, Michael Paquier wrote:
>>
>> So... Attached are a set of patches dedicated at fixing this issue:
>
>
> Thanks for working on this!
>
>> - 0001, add if_not_exists to pg_tablespace_location, returning NULL if
>> path does not exist
>> - 0002, same with pg_stat_file, returning NULL if file does not exist
>> - 0003, same with pg_read_*file. I added them to all the existing
>> functions for consistency.
>> - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
>> (thanks Robert for the naming!)
>> - 0005, as things get complex, a set of regression tests aimed to
>> covering those things. pg_tablespace_location is platform-dependent,
>> so there are no tests for it.
>> - 0006, the fix for pg_rewind, using what has been implemented before.
>
>
> With thes patches, pg_read_file() will return NULL for any failure to open
> the file, which makes pg_rewind to assume that the file doesn't exist in the
> source server, and will remove the file from the destination. That's
> dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Insufficient locking for ALTER DEFAULT PRIVILEGES

2015-06-23 Thread Robert Haas
On Sun, Jun 21, 2015 at 11:11 AM, Andres Freund  wrote:
> On 2015-06-21 11:45:24 -0300, Alvaro Herrera wrote:
>> Alvaro Herrera wrote:
>> > Now that I actually check with a non-relation object, I see pretty much
>> > the same error.  So probably if instead of some narrow bug fix what we
>> > need is some general solution for all object types.  I know this has
>> > been discussed a number of times ...  Anyway I see now that we should
>> > not consider this a backpatchable bug fix, and I'm not doing the coding
>> > either, at least not now.
>>
>> Discussed this with a couple of 2ndQ colleagues and it became evident
>> that MVCC catalog scans probably make this problem much more prominent.
>> So historical branches are not affected all that much, but it's a real
>> issue on 9.4+.
>
> Hm. I don't see how those would make a marked difference. The snapshot
> for catalogs scan are taken afresh for each scan (unless
> cached). There'll probably be some difference, but it'll be small.

Yeah, I think the same.  If those changes introduced a problem we
didn't have before, I'd like to see a reproducible test case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory context at PG_init call ?

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 7:41 AM, Sandro Santilli  wrote:
> Empirically, I seem to be getting the _PG_init call for a module while
> the active memory context lifetime is that of the function call which
> first needed to load the shared object.
>
> Is this the case ? Documented anywhere ?
> Initializing memory meant to be alive for the whole lifetime of a backend
> in that function is a bit complex if that's confirmed.

If you want something that lasts for the lifetime of the backend, just do

MemoryContext oldctx = MemoryContextSwitchTo(TopMemoryContext);
...
MemoryContextSwitchTo(oldctx);

I'm not sure what context you should expect in _PG_init(), although
what you mention doesn't sound unreasonable.  But generally if you
want a long-lived context you should establish that explicitly
yourself.  We try to keep short-lived contexts active whenever
possible because that avoids long-term leaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] less log level for success dynamic background workers for 9.5

2015-06-23 Thread Robert Haas
On Mon, Jun 22, 2015 at 9:47 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Tue, Jun 23, 2015 at 10:07 AM, Robert Haas wrote:
>>> On Mon, Jun 22, 2015 at 8:19 PM, Jim Nasby wrote:
 Anything ever happen with this? I agree that LOG is to high for reporting
 most (if not all) of these things.
>
>>> I think we should consider having a flag for this behavior rather than
>>> changing the behavior across the board.
>>> But then again, maybe we should just change it.
>>>
>>> What do others think?
>
>> A GUC just for that looks like an overkill to me, this log is useful
>> when debugging. And one could always have its bgworker call elog by
>> itself at startup and before leaving to provide more or less similar
>> information.
>
> I agree that we don't need YAGUC here, particularly not one that applies
> indiscriminately to all bgworkers.  I'd vote for just decreasing the log
> level.  The current coding is appropriate for a facility that's basically
> experimental; but as it moves towards being something that would be used
> routinely in production, the argument for being noisy in the log gets
> weaker and weaker.

I was thinking of a background worker flag, not a GUC.
BGWORKER_QUIET, or something like that.  But I guess we ought to just
change it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upper planner path-ification

2015-06-23 Thread Robert Haas
On Tue, Jun 23, 2015 at 4:41 AM, Kouhei Kaigai  wrote:
> So, unless we don't find out a solution around planner, 2-phase aggregation is
> like a curry without rice

Simon and I spoke with Tom about this upper planner path-ification
problem at PGCon, and he indicated that he intended to work on it
soon, and that we should bug him about it if he doesn't.

I'm pleased to here this as I seem to have a special talent in that area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2015-06-23 Thread Abhijit Menon-Sen
At 2014-08-20 11:07:44 +0300, hlinnakan...@vmware.com wrote:
>
> I don't think the new GetBufferWithoutRelcache function is in line
> with the existing ReadBuffer API. I think it would be better to add a
> new ReadBufferMode - RBM_CACHE_ONLY? - that only returns the buffer if
> it's already in cache, and InvalidBuffer otherwise.

Hi Heikki.

I liked the idea of having a new ReadBufferMode of RBM_CACHE_ONLY, but I
wasn't happy with the result when I tried to modify the code to suit. It
didn't make the code easier for me to follow.

(I'll say straight up that it can be done the way you said, and if the
consensus is that it would be an improvement, I'll do it that way. I'm
just not convinced about it, hence this mail.)

For example:

> static Buffer
> ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>   BlockNumber blockNum, ReadBufferMode mode,
>   BufferAccessStrategy strategy, bool *hit)
> {
> volatile BufferDesc *bufHdr;
> Block   bufBlock;
> boolfound;
> boolisExtend;
> boolisLocalBuf = SmgrIsTemp(smgr);
> 
> *hit = false;
> 
> /* Make sure we will have room to remember the buffer pin */
> ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> 
> isExtend = (blockNum == P_NEW);

isLocalBuf and isExtend will both be false for the RBM_CACHE_ONLY case,
which is good for us. But that probably needs to be mentioned in a
comment here.

> TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
>smgr->smgr_rnode.node.spcNode,
>smgr->smgr_rnode.node.dbNode,
>smgr->smgr_rnode.node.relNode,
>smgr->smgr_rnode.backend,
>isExtend);

We know we're not going to be doing IO in the RBM_CACHE_ONLY case, but
that's probably OK? I don't understand this TRACE_* stuff. But we need
to either skip this, or also do the corresponding TRACE_*_DONE later.

> if (isLocalBuf)
> {
> bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
> if (found)
> pgBufferUsage.local_blks_hit++;
> else
> pgBufferUsage.local_blks_read++;
> }
> else
> {
> /*
>  * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
>  * not currently in memory.
>  */
> bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
>  strategy, &found);
> if (found)
> pgBufferUsage.shared_blks_hit++;
> else
> pgBufferUsage.shared_blks_read++;
> }

The nicest option I could think of here was to copy the shared_buffers
lookup from BufferAlloc into a new BufferAlloc_shared function, and add
a new branch for RBM_CACHE_ONLY. That would look like:

if (isLocalBuf)
…
else if (mode == RBM_CACHE_ONLY)
{
/*
 * We check to see if the buffer is already cached, and if it's
 * not, we return InvalidBuffer because we know it's not pinned.
 */
bufHdr = BufferAlloc_shared(…, &found);
if (!found)
return InvalidBuffer;
LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
return BufferDescriptorGetBuffer(bufHdr);
}
else
{
/*
 * lookup the buffer …
}

…or if we went with the rest of the code and didn't do the lock/return
immediately, we would fall through to this code:

> /* At this point we do NOT hold any locks. */
> 
> /* if it was already in the buffer pool, we're done */
> if (found)
> {
> if (!isExtend)
> {
> /* Just need to update stats before we exit */
> *hit = true;
> VacuumPageHit++;
> 
> if (VacuumCostActive)
> VacuumCostBalance += VacuumCostPageHit;
> 
> TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
>   smgr->smgr_rnode.node.spcNode,
>   smgr->smgr_rnode.node.dbNode,
>   smgr->smgr_rnode.node.relNode,
>   smgr->smgr_rnode.backend,
>   isExtend,
>   found);
> 
> /*
>  * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
>  * locked on return.
>  */
> if (!isLocalBuf)
> {
> if (mode == RBM_ZERO_AND_LOCK)
> LWLockAcquire(bufHdr->content_lock, LW_EXCLUSIVE);
> else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
> LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
> }
> 
> return BufferDescript

Re: [HACKERS] A couple of newlines missing in pg_rewind log entries

2015-06-23 Thread Heikki Linnakangas

On 06/23/2015 07:39 AM, Michael Paquier wrote:

Hi all,

Some grepping is showing up that a couple of newlines are missing in
pg_rewind, leading to unreadable log entries:
libpq_fetch.c:pg_log(PG_DEBUG, "getting file chunks");


Fixed.


logging.c:pg_log(PG_PROGRESS, "%*s/%s kB (%d%%) copied",


This one was on purpose; note the printf("\r") after that line. That's 
supposed to be a progress indicator that is updated on the single line.



filemap.c:pg_fatal("could not stat file \"%s\": %s",


Peter fixed this already.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-23 Thread Heikki Linnakangas

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:


Thanks for working on this!


- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.


With thes patches, pg_read_file() will return NULL for any failure to 
open the file, which makes pg_rewind to assume that the file doesn't 
exist in the source server, and will remove the file from the 
destination. That's dangerous, those functions should check specifically 
for ENOENT.


There's still a small race condition with tablespaces. If you run CREATE 
TABLESPACE in the source server while pg_rewind is running, it's 
possible that the recursive query that pg_rewind uses sees the symlink 
in pg_tblspc/ directory, but its snapshot doesn't see the row in 
pg_tablespace yet. It will think that the symlink is a regular file, try 
to read it, and fail (if we checked for ENOENT).


Actually, I think we need try to deal with symlinks a bit harder. 
Currently, pg_rewind assumes that anything in pg_tblspace that has a 
matching row in pg_tablespace is a symlink, and nothing else is. I think 
symlinks to directories. I just noticed that pg_rewind fails miserable 
if pg_xlog is a symlink, because of that:



The servers diverged at WAL position 0/3023F08 on timeline 1.
Rewinding from last common checkpoint at 0/260 on timeline 1

"data-master//pg_xlog" is not a directory
Failure, exiting


I think we need to add a column to pg_stat_file output, to indicate 
symbolic links, and add a pg_readlink() function. That still leaves a 
race condition if the type of a file changes, i.e. a file is deleted and 
a directory with the same name is created in its place, but that seems 
acceptable. I don't think PostgreSQL ever does such a thing, so that 
could only happen if you mess with the data directory manually while the 
server is running.


I just realized another problem: We recently learned the hard way that 
some people have files in the data directory that are not writeable by 
the 'postgres' user 
(http://www.postgresql.org/message-id/20150523172627.ga24...@msg.df7cb.de). 
pg_rewind will try to overwrite all files it doesn't recognize as 
relation files, so it's going to fail on those. A straightforward fix 
would be to first open the destination file in read-only mode, and 
compare its contents, and only open the file in write mode if it has 
changed. It would still fail when the files really differ, but I think 
that's acceptable.


I note that pg_rewind doesn't need to distinguish between an empty and a 
non-existent directory, so it's quite silly for it to pass 
include_dot_dirs=true, and then filter out "." and ".." from the result 
set. The documentation should mention the main reason for including "." 
and "..": to distinguish between an empty and non-existent directory.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Memory context at PG_init call ?

2015-06-23 Thread Sandro Santilli
Empirically, I seem to be getting the _PG_init call for a module while
the active memory context lifetime is that of the function call which
first needed to load the shared object.

Is this the case ? Documented anywhere ?
Initializing memory meant to be alive for the whole lifetime of a backend
in that function is a bit complex if that's confirmed.

--strk;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 make world failing due to sgml tools missing

2015-06-23 Thread Peter Eisentraut
On 6/23/15 1:06 AM, Keith Fiske wrote:
> 
> 
> 
> 
> 
> On Sun, Jun 21, 2015 at 10:56 AM, Peter Eisentraut  > wrote:
> 
> On 6/18/15 8:54 AM, Tom Lane wrote:
> > Sure; the point is that libxml2 has suddenly been reclassified as a
> > documentation build tool, which is at least a surprising categorization.
> 
> libxml2 has been a required documentation build tool since PostgreSQL
> 9.0.  The only thing that's new is that xmllint is in a different
> subpackage on some systems.  So just install that and you're all set for
> the foreseeable future.
> 
> 
> Well, something is different in 9.5. On this same system (Linux Mint
> 17.1) I can build all previous versions with "make world" and I do not
> get this error.

Hence the use of the phrase "The only thing that's new ...". ;-)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PGXS "check" target forcing an install ?

2015-06-23 Thread Sandro Santilli
On Tue, Jun 23, 2015 at 02:31:03PM +0900, Michael Paquier wrote:
> On Tue, Jun 23, 2015 at 12:11 AM, Sandro Santilli  wrote:
> > I've noted that upgrading from PostgreSQL 9.3 to 9.5 I'm suddenly
> > unable to specify a "check" rule in the Makefile that includes the
> > PGXS one. The error is:
> >
> >  $ make check
> >  rm -rf ''/tmp_install
> >  make -C '/home/postgresql-9.5/lib/pgxs/src/makefiles/../..' 
> > DESTDIR=''/tmp_install install
> >  make[1]: Entering directory `/home/postgresql-9.5/lib/pgxs'
> >  make[1]: *** No rule to make target `install'.  Stop.
> >  make[1]: Leaving directory `/home/postgresql-9.5/lib/pgxs'
> >  make: *** [temp-install] Error 2
> >
> > I tracked the dangerous -rf to come from Makefile.global and it's empty
> > string being due to abs_top_builddir not being define in my own Makefile.
> > But beside that, which I can probably fix, it doesn't sound correct
> > that a "check" rule insists in finding an "install" rule.
> 
> Oops, this is a regression, and a dangerous one indeed. This is caused
> by dcae5fac.
> 
> One fix is to use NO_TEMP_INSTALL=yes in Makefile.global in the
> context of PGXS, like in the patch attached, this variable needing to
> be set before Makefile.global is loaded. We could as well use directly
> PGXS in the section "Testing", but that does not sound appealing for
> Makefile.global's readability.

Thanks, setting NO_TEMP_INSTALL=yes in the including Makefile fixes
this issue.

> > I'm also
> > surprised that there's no warning coming out from the "make" invocation
> > given I'm defining a "check" rule myself (after inclusion).
> 
> Why? It looks perfectly normal to me to be able to define your own
> check rule. That's more flexible this way.

I'm surprised because I used to get warnings on overrides, and I actually
still get them for other rules. For example:

Makefile:192: warning: overriding commands for target `install'
/home/postgresql-9.3.4/lib/pgxs/src/makefiles/pgxs.mk:120: warning: ignoring 
old commands for target `install'

The same warning isn't raised for the "check" rule, while it is clearly
defined in some upper Makefile (as shown by the forced-install-bug).

> Thoughts?

Mixed... One one hand I'm happy to implement my own rules, but in this
specific case the lack of a warning left me with no hint about where
the offending "check" rule was defined.

--strk;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] how is a query passed between a coordinator and a datanode

2015-06-23 Thread Rui Hai Jiang


Hello, 

I'm trying to figure out how a query and its result is passed between a 
coordinator and a datanode. I know there are many messages passed between them 
to finish a query.


I did a test against the coordinator by adding a row to a table and the sql 
was, insert into hg1(id, name) values(1,'tom'). 

I found a command 'P' was sent from the coordinator to a datanode and there was 
a remote statement as following,


stmt_name=p_1_25af_f
query_string=Remote Subplan
plan_string={REMOTESTMT :commandType 3 :hasReturning false ...}


My questions are,
 1-does the coordinator use the remote statement to tell a datanode what to do? 
If so, how is the plan string created by the coordinator and how is the 
plan_string parsed by the datanode?

 2-if there are multiple rows in the result of the query, how are the rows of 
data passed from the datanode to the coordinator? Does the datanode just send 
all the rows of data to the coordinator? or the coordinator get each row of 
data by sending a query?


 Thank you very much!

 Rui Hai  
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upper planner path-ification

2015-06-23 Thread Kouhei Kaigai
> -Original Message-
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Sent: Tuesday, June 23, 2015 2:06 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org; Tom Lane
> Subject: Re: [HACKERS] upper planner path-ification
> 
> 
> On 23 June 2015 at 13:55, Kouhei Kaigai  wrote:
> 
> 
>   Once we support to add aggregation path during path consideration,
>   we need to pay attention morphing of the final target-list according
>   to the intermediate path combination, tentatively chosen.
>   For example, if partial-aggregation makes sense from cost perspective;
>   like SUM(NRows) of partial COUNT(*) AS NRows instead of COUNT(*) on
>   billion rows, planner also has to adjust the final target-list according
>   to the interim paths. In this case, final output shall be SUM(), instead
>   of COUNT().
> 
> 
> 
> 
> This sounds very much like what's been discussed here:
> 
> http://www.postgresql.org/message-id/CA+U5nMJ92azm0Yt8TT=hNxFP=VjFhDqFpaWfmj
> +66-4zvcg...@mail.gmail.com
> 
> 
> The basic concept is that we add another function set to aggregates that allow
> the combination of 2 states. For the case of MIN() and MAX() this will just be
> the same as the transfn. SUM() is similar for many types, more complex for 
> others.
> I've quite likely just borrowed SUM(BIGINT)'s transfer functions to allow
> COUNT()'s to be combined.
>
STDDEV, VARIANCE and relevant can be constructed using nrows, sum(X) and 
sum(X^2).
REGR_*, COVAR_* and relevant can be constructed using nrows, sum(X), sum(Y),
sum(X^2), sum(Y^2) and sum(X*Y).

Let me introduce a positive side effect of this approach.
Because final aggregate function processes values already aggregated partially,
the difference between the state value and transition value gets relatively 
small.
It reduces accidental errors around floating-point calculation. :-)

> More time does need spent inventing the new combining functions that don't
> currently exist, but that shouldn't matter as it can be done later.
> 
> Commitfest link to patch here https://commitfest.postgresql.org/5/131/
> 
> I see you've signed up to review it!
>
Yes, all of us looks at same direction.

Problem is, we have to cross the mountain of the planner enhancement to reach
all the valuable:
 - parallel aggregation
 - aggregation before join
 - remote aggregation via FDW

So, unless we don't find out a solution around planner, 2-phase aggregation is
like a curry without rice

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upper planner path-ification

2015-06-23 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Monday, May 18, 2015 1:12 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] upper planner path-ification
> 
> Robert Haas  writes:
> > So, getting back to this part, what's the value of returning a list of
> > Paths rather than a list of Plans?
> 
> (1) less work, since we don't have to fill in details not needed for
> costing purposes;
> (2) paths carry info that the planner wants but the executor doesn't,
> notably sort-order annotations.
> 
> > target lists are normally computed when paths are converted to plans,
> > but for the higher-level plan nodes adding by grouping_planner, the
> > path list is typically just passed in.  So would the new path types be
> > expected to carry target lists of their own, or would they need to
> > figure out the target list on the fly at plan generation time?
> 
> Yeah, that is something I've been struggling with while thinking about
> this.  I don't really want to add tlists as such to Paths, but it's
> not very clear how else to annotate a Path as to what it computes,
> and that seems like an annotation we have to have in some form in order
> to convert these planning steps into a Path universe.
> 
> There are other cases where it would be useful to have some notion of
> this kind.  An example is that right now, if you have an expression index
> on an expensive function and a query that wants the value of that function,
> the planner is able to extract the value from the index --- but there is
> nothing that gives any cost benefit to doing so, so it's just as likely
> to choose some other index and eat the cost of recalculating the function.
> It seems like the only way to fix that in a principled fashion is to have
> some concept that the index-scan Path can produce the function value,
> and then when we come to some appropriate costing step, penalize the other
> paths for having to compute the value that's available for free from this
> one.
> 
> Rather than adding tlists per se to Paths, I've been vaguely toying with
> a notion of identifying all the "interesting" subexpressions in a query
> (expensive functions, aggregates, etc), giving them indexes 1..n, and then
> marking Paths with bitmapsets showing which interesting subexpressions
> they can produce values for.  This would make things like "does this Path
> compute all the needed aggregates" much cheaper to deal with than a raw
> tlist representation would do.  But maybe that's still not the best way.
>
Hmm it seems to me a little bit complicated than flat expression node.

> Another point is that a Path that computes aggregates is fundamentally
> different from a Path that doesn't, because it doesn't even produce the
> same number of rows.  So I'm not at all sure how to visualize the idea
> of a Path that computes only some aggregates, or whether it's even a
> sensible thing to worry about supporting.
>
I expect partial aggregate shall be done per a particular input stream,
not per aggregate function. In other words, once planner determined
a relation scan/join has advantage to run partial aggregate, all the
aggregate functions that consume rows produced by this scan/join have
to have partial aggregate / final aggregate form, doesn't it?
If so, number of rows to be returned is associated with a Path.

For example, when we break down a query below using 2-phase aggregation,

  SELECT t1.cat, avg(t2.x) FROM t1 JOIN t2 ON t1.id_1 = t2.id_2 GROUP BY t1.cat;

expected plan is as shown below, isn't it?

  FinalAgg (nrows=100)
  tlist: t1.cat, avg(nrows, sum(t2.x))
  grouping key: t1.cat
   -> HashJoin (nrows=1000)
  tlist: t1.cat, count(t2.x) nrows, sum(t2.x)
   -> PartialAgg (nrows=1000)
  tlist: count(t2.x) nrows, sum(t2.x), t2.id_2
  grouping key: t2.id_2
   -> SeqScan on t2 (nrows=1)
   -> Hash
   -> SeqScan on t1 (nrows=100)

It is clear that PartialAgg consumes 1 rows, then output 1000
rows because of partial reduction. All the partial aggregation on this
node will work in a coordinated manner.

Do you have another vision for the partial aggregation behavior?


> > One thing that seems like it might complicate things here is that a
> > lot of planner functions take PlannerInfo *root as an argument.  But
> > if we generate only paths in grouping_planner() and path-ify them
> > later, the subquery's root will not be available when we're trying to
> > do the Path -> Plan transformation.
> 
> Ah, you're wrong there, because we hang onto the subquery's root already
> (I forget why exactly, but see PlannerGlobal.subroots for SubPlans, and
> RelOptInfo.subroot for subquery-in-FROM).  So it would not be a
> fundamental problem to postpone create_plan() for a subquery.
> 
> > I think grouping_planner() is badly in need of some refactoring just
> > to make it shor

Re: [HACKERS] Time to get rid of PQnoPasswordSupplied?

2015-06-23 Thread Craig Ringer
On 22 June 2015 at 22:00, Tom Lane  wrote:

> I do not follow Craig's argument that this is somehow connected to the
> wire protocol version.

Upon revisiting  it, neither do I. You know when you read code and
think "what idiot wrote this" ... then "git blame" says it was you? I,
at least, know that feeling... and that's what reading that part of
that email was like.

Tom's right, of course. It's libpq-to-client. The string "fe_sendauth:
no password supplied" never goes over the wire; it's a response libpq
generates in response to an auth failure ErrorResponse message from
the server. It's only relevant for libpq-based clients.

Lets change it. PgAdmin-III will need a patch, but that's about the
only client I found that would care.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-06-23 Thread Fabien COELHO



I'd also like to see concurrent workloads with synchronous_commit=off -
I've seen absolutely horrible latency behaviour for that, and I'm hoping
this will help. It's also a good way to simulate faster hardware than
you have.


It helps. I've done a few runs, where the very-very-bad situation is 
improved to... I would say very-bad:


medium3: scale=200 shared_buffers=4GB checkpoint_timeout=15min
max_wal_size=4GB warmup=1200 time=6000 clients=4
synchronous_commit=off

 flush sort |  tps | percent of seconds offline
 off   off  |  296 | 83% offline
 off   on   | 1496 | 33% offline
 off   on   | 1641 | 59% offline
 onon   | 1515 | 31% offline

The offline figure is the percentage of seconds in the 6000 seconds run 
where 0.0 tps are reported, or where nothing is reported because pgbench 
is stuck.


It is somehow better... on an abysmal scale: sorting and flushing reduced 
the offline time by a factor of 2.6. Too bad it is so high to begin with. 
The tps is improved by a factor of 5 with either options.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers