Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-06 Thread Amit Langote
On 2016/10/07 11:47, Amit Langote wrote:
> Just noticed that the getBlobs() query does not work for a 7.3 server
> (maybe <= 7.3) due to the following change in commit 23f34fa4 [1]:
> 
>  else if (fout->remoteVersion >= 70100)
>  appendPQExpBufferStr(blobQry,
> - "SELECT DISTINCT loid, NULL::oid, NULL::oid"
> + "SELECT DISTINCT loid, NULL::oid, NULL, "
> + "NULL AS rlomacl, NULL AS initlomacl, "
> + "NULL AS initrlomacl "
>   " FROM pg_largeobject");
>  else
>  appendPQExpBufferStr(blobQry,
> - "SELECT oid, NULL::oid, NULL::oid"
> + "SELECT oid, NULL::oid, NULL, "
> + "NULL AS rlomacl, NULL AS initlomacl, "
> + "NULL AS initrlomacl "
>   " FROM pg_class WHERE relkind = 'l'");
> 
> The following error is reported by the server:
> 
> pg_dump: [archiver (db)] query failed: ERROR:  Unable to identify an
> ordering operator '<' for type '"unknown"'
>   Use an explicit ordering operator or modify the query
> pg_dump: [archiver (db)] query was: SELECT DISTINCT loid, NULL::oid, NULL,
> NULL AS rlomacl, NULL AS initlomacl, NULL AS initrlomacl  FROM pg_largeobject
> 
> I could fix that using the attached patch.

Forgot to mention that it needs to be fixed in both HEAD and 9.6.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-06 Thread Alfred Perlstein

Robert,

This contention on WAL reminds me of another scenario I've heard about 
that was similar.


To fix things what happened was that anyone that the first person to 
block would be responsible for writing out all buffers for anyone 
blocked behind "him".


The for example if you have many threads, A, B, C, D

If while A is writing to WAL and hold the lock, then B arrives, B of 
course blocks, then C comes along and blocks as well, then D.


Finally A finishes its write and then 

Now you have two options for resolving this, either

1) A drops its lock, B picks up the lock... B writes its buffer and then 
drops the lock.  Then C gets the lock, writes its buffer, drops the 
lock, then finally D gets the lock, writes its buffer and then drops the 
lock.


2) A then writes out B's, C's, and D's buffers, then A drops the lock, 
B, C and D wake up, note that their respective buffers are written and 
just return.  This greatly speeds up the system. (just be careful to 
make sure A doesn't do "too much work" otherwise you can get a sort of 
livelock if too many threads are blocked behind it, generally only issue 
one additional flush on behalf of other threads, do not "loop until the 
queue is empty")


I'm not sure if this is actually possible with the way WAL is 
implemented, (or perhaps if this strategy is already implemented) but 
it's definitely worth if not done already as it can speed things up 
enormously.



On 10/6/16 11:38 AM, Robert Haas wrote:

Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench.  I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on.  During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.  First, I tried scale factor
3000 with 32 clients and got these results:

   1  LWLockTranche   | buffer_mapping
   9  LWLockNamed | CLogControlLock
  14  LWLockNamed | ProcArrayLock
  16  Lock| tuple
  25  LWLockNamed | CheckpointerCommLock
  49  LWLockNamed | WALBufMappingLock
 122  LWLockTranche   | clog
 182  Lock| transactionid
 287  LWLockNamed | XidGenLock
1300  Client  | ClientRead
1375  LWLockTranche   | buffer_content
3990  Lock| extend
   21014  LWLockNamed | WALWriteLock
   28497  |
   58279  LWLockTranche   | wal_insert

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload.   Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test.  The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:

   1  BufferPin   | BufferPin
   1  LWLockTranche   | lock_manager
   2  LWLockTranche   | buffer_mapping
   8  LWLockNamed | ProcArrayLock
   9  LWLockNamed | CheckpointerCommLock
   9  LWLockNamed | CLogControlLock
  11  LWLockTranche   | buffer_content
  37  LWLockTranche   | clog
 153  Lock| tuple
 388  LWLockNamed | XidGenLock
 827  Lock| transactionid
1267  Client  | ClientRead
   20631  Lock| extend
   91767  |

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement.  The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture.  However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O.  This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:

  14  Lock| 

[HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-06 Thread Amit Langote
Just noticed that the getBlobs() query does not work for a 7.3 server
(maybe <= 7.3) due to the following change in commit 23f34fa4 [1]:

 else if (fout->remoteVersion >= 70100)
 appendPQExpBufferStr(blobQry,
- "SELECT DISTINCT loid, NULL::oid, NULL::oid"
+ "SELECT DISTINCT loid, NULL::oid, NULL, "
+ "NULL AS rlomacl, NULL AS initlomacl, "
+ "NULL AS initrlomacl "
  " FROM pg_largeobject");
 else
 appendPQExpBufferStr(blobQry,
- "SELECT oid, NULL::oid, NULL::oid"
+ "SELECT oid, NULL::oid, NULL, "
+ "NULL AS rlomacl, NULL AS initlomacl, "
+ "NULL AS initrlomacl "
  " FROM pg_class WHERE relkind = 'l'");

The following error is reported by the server:

pg_dump: [archiver (db)] query failed: ERROR:  Unable to identify an
ordering operator '<' for type '"unknown"'
Use an explicit ordering operator or modify the query
pg_dump: [archiver (db)] query was: SELECT DISTINCT loid, NULL::oid, NULL,
NULL AS rlomacl, NULL AS initlomacl, NULL AS initrlomacl  FROM pg_largeobject

I could fix that using the attached patch.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=23f34fa4ba358671adab16773e79c17c92cbc870
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 299e887..a1165fa 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2881,15 +2881,15 @@ getBlobs(Archive *fout)
 		  username_subquery);
 	else if (fout->remoteVersion >= 70100)
 		appendPQExpBufferStr(blobQry,
-			 "SELECT DISTINCT loid, NULL::oid, NULL, "
-			 "NULL AS rlomacl, NULL AS initlomacl, "
-			 "NULL AS initrlomacl "
+			 "SELECT DISTINCT loid, NULL::oid, NULL::oid, "
+			 "NULL::oid AS rlomacl, NULL::oid AS initlomacl, "
+			 "NULL::oid AS initrlomacl "
 			 " FROM pg_largeobject");
 	else
 		appendPQExpBufferStr(blobQry,
-			 "SELECT oid, NULL::oid, NULL, "
-			 "NULL AS rlomacl, NULL AS initlomacl, "
-			 "NULL AS initrlomacl "
+			 "SELECT oid, NULL::oid, NULL::oid, "
+			 "NULL::oid AS rlomacl, NULL::oid AS initlomacl, "
+			 "NULL::oid AS initrlomacl "
 			 " FROM pg_class WHERE relkind = 'l'");
 
 	res = ExecuteSqlQuery(fout, blobQry->data, PGRES_TUPLES_OK);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-10-06 Thread Etsuro Fujita

On 2016/10/07 10:26, Amit Langote wrote:

On 2016/10/06 21:55, Etsuro Fujita wrote:

On 2016/10/06 20:17, Amit Langote wrote:

On 2016/10/05 20:45, Etsuro Fujita wrote:



I noticed that we were wrong.  Your patch was modified so that
dependencies on FDW-related objects would be extracted from a given plan
in create_foreignscan_plan (by Ashutosh) and then in
set_foreignscan_references by me, but that wouldn't work well for INSERT
cases.  To fix that, I'd like to propose that we collect the dependencies
from the given rte in add_rte_to_flat_rtable, instead.



I see.  So, doing it from set_foreignscan_references() fails to capture
plan dependencies in case of INSERT because it won't be invoked at all
unlike the UPDATE/DELETE case.


Right.


If some writable FDW consulted foreign table/server/FDW options in
AddForeignUpdateTarget, which adds the extra junk columns for
UPDATE/DELETE to the targetList of the given query tree, the rewritten
query tree would also need to be invalidated.  But I don't think such an
FDW really exists because that routine in a practical FDW wouldn't change
such columns depending on those options.



I had second thoughts about that; since the possibility wouldn't be zero,
I added to extract_query_dependencies_walker the same code I added to
add_rte_to_flat_rtable.



And here, since AddForeignUpdateTargets() could possibly utilize foreign
options which would cause *query tree* dependencies.  It's possible that
add_rte_to_flat_rtable may not be called before an option change, causing
invalidation of any cached objects created based on the changed options.
So, must record dependencies from extract_query_dependencies as well.


Right.


I think this (v4) patch is in the best shape so far.


Thanks for the review!

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-10-06 Thread Amit Langote
On 2016/10/06 21:55, Etsuro Fujita wrote:
> On 2016/10/06 20:17, Amit Langote wrote:
>> On 2016/10/05 20:45, Etsuro Fujita wrote:
>>> On 2016/10/05 14:09, Ashutosh Bapat wrote:
 IMO, maintaining that extra function and
 the risk of bugs because of not keeping those two functions in sync
 outweigh the small not-so-frequent gain.
>>> The inefficiency wouldn't be negligible in the case where there are large
>>> numbers of cached plans.  So, I'd like to introduce a helper function that
>>> checks the dependency list for the generic plan, to eliminate most of the
>>> duplication.
> 
>> I too am inclined to have a PlanCacheForeignCallback().
>>
>> Just one minor comment:
> 
> Thanks for the comments!
> 
> I noticed that we were wrong.  Your patch was modified so that
> dependencies on FDW-related objects would be extracted from a given plan
> in create_foreignscan_plan (by Ashutosh) and then in
> set_foreignscan_references by me, but that wouldn't work well for INSERT
> cases.  To fix that, I'd like to propose that we collect the dependencies
> from the given rte in add_rte_to_flat_rtable, instead.

I see.  So, doing it from set_foreignscan_references() fails to capture
plan dependencies in case of INSERT because it won't be invoked at all
unlike the UPDATE/DELETE case.

> Attached is an updated version, in which I removed the
> PlanCacheForeignCallback and adjusted regression tests a bit.
> 
>>> If some writable FDW consulted foreign table/server/FDW options in
>>> AddForeignUpdateTarget, which adds the extra junk columns for
>>> UPDATE/DELETE to the targetList of the given query tree, the rewritten
>>> query tree would also need to be invalidated.  But I don't think such an
>>> FDW really exists because that routine in a practical FDW wouldn't change
>>> such columns depending on those options.
> 
> I had second thoughts about that; since the possibility wouldn't be zero,
> I added to extract_query_dependencies_walker the same code I added to
> add_rte_to_flat_rtable.

And here, since AddForeignUpdateTargets() could possibly utilize foreign
options which would cause *query tree* dependencies.  It's possible that
add_rte_to_flat_rtable may not be called before an option change, causing
invalidation of any cached objects created based on the changed options.
So, must record dependencies from extract_query_dependencies as well.

> After all, the updated patch is much like your version, but I think your
> patch, which modified extract_query_dependencies_walker only, is not
> enough because a generic plan can have more dependencies than its query
> tree (eg, consider inheritance).

Agreed. I didn't know at the time that extract_query_dependencies is only
able to capture dependencies using the RTEs in the *rewritten* query tree;
it wouldn't have gone through the planner at that point.

I think this (v4) patch is in the best shape so far.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-06 Thread Jeff Janes
On Thu, Oct 6, 2016 at 2:46 PM, Robert Haas  wrote:

>
> > Also, I think that doing more counts which get amalgamated into the same
> > threshold, rather than introducing another parameter, is a bad thing.  I
> > have insert-mostly, most of the time, tables which are never going to
> > benefit from index-only-scans, and I don't want to pay the cost of them
> > getting vacuumed just because of some iOn Thu, Oct 6, 2016 at 3:56 PM,
> Jeff Janes  wrote:
> >> Sure, I could handle each case separately, but the goal of this patch
>
> nserts, when I will never get a
> > benefit out of it.  I could turn autovacuum off for those tables, but
> then I
> > would have to remember to manually intervene on the rare occasion they do
> > get updates or deletes.  I want to be able to manually pick which tables
> I
> > sign up for this feature (and then forget it), not manually pick which
> ones
> > to exempt from it and have to constantly review that.
>
> Of course, if you do that, then what will happen is eventually it will
> be time to advance relfrozenxid for that relation, and you'll get a
> giant soul-crushing VACUUM of doom, rather than a bunch of small,
> hopefully-innocuous VACUUMs.


I think I will get the soul-crushing vacuum of doom anyway, if the database
lasts that long.  For one reason, in my case while updates and deletes are
rare, they are common enough that the frozen vm bits from early vacuums
will be mostly cleared before the vacuum of doom comes around.

For a second reason, I don't think the frozen bit in the vm actually gets
set by much of anything other than a autovacuum-for-wraparound or a manual
VACUUM FREEZE.

In commit 37484ad2aacef5ec7, you changed the way that frozen tuples were
represented, so that we could make freezing more aggressive without losing
forensic evidence.  But I don't think we ever did anything to actually make
the freezing more aggressive.

When I applied the up-thread patch so that pgbench_history gets autovac,
those autovacs don't actually cause any pages to get frozen until the wrap
around kicks in, even when all the tuples on the early pages should be well
beyond vacuum_freeze_min_age.

Cheers,

Jeff


Re: [HACKERS] pgbench vs. wait events

2016-10-06 Thread Michael Paquier
On Fri, Oct 7, 2016 at 3:38 AM, Robert Haas  wrote:
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:

Nice. Thanks for sharing.

> \t
> select wait_event_type, wait_event from pg_stat_activity where pid !=
> pg_backend_pid()
> \watch 0.5
>
> Then, I used a little shell-scripting to count up the number of times
> each wait event occurred in the output.

Or an INSERT SELECT on an unlogged table? No need of extra maths then.

> First, I tried scale factor
> 3000 with 32 clients and got these results:
>
>   1  LWLockTranche   | buffer_mapping
> [...]
>   21014  LWLockNamed | WALWriteLock
>   28497  |
>   58279  LWLockTranche   | wal_insert
>
> tps = 1150.803133 (including connections establishing)
>
> What jumps out here is, at least to me, is that there is furious
> contention on both the wal_insert locks and on WALWriteLock.
> Apparently, the system simply can't get WAL on disk fast enough to
> keep up with this workload.   Relation extension locks and
> buffer_content locks also are also pretty common, both ahead of
> ClientRead, a relatively uncommon wait event on this test.  The load
> average on the system was only about 3 during this test, indicating
> that most processes are in fact spending most of their time off-CPU.

Increasing the number of WAL insert slots would help? With your tests
this is at 8 all the time.

> The first thing I tried was switching to unlogged tables, which
> produces these results:
>
>   1  BufferPin   | BufferPin
>   1  LWLockTranche   | lock_manager
>   2  LWLockTranche   | buffer_mapping
>   8  LWLockNamed | ProcArrayLock
>   9  LWLockNamed | CheckpointerCommLock
>   9  LWLockNamed | CLogControlLock
>  11  LWLockTranche   | buffer_content
>  37  LWLockTranche   | clog
> 153  Lock| tuple
> 388  LWLockNamed | XidGenLock
> 827  Lock| transactionid
>1267  Client  | ClientRead
>   20631  Lock| extend
>   91767  |
>
> tps = 1223.239416 (including connections establishing)
>
> If you don't look at the TPS number, these results look like a vast
> improvement.  The overall amount of time spent not waiting for
> anything is now much higher, and the problematic locks have largely
> disappeared from the picture.  However, the load average now shoots up
> to about 30, because most of the time that the backends are "not
> waiting for anything" they are in fact in kernel wait state D; that
> is, they're stuck doing I/O.  This suggests that we might want to
> consider advertising a wait state when a backend is doing I/O, so we
> can measure this sort of thing.

Maybe something in fd.c.. It may be a good idea to actually have a
look at a perf output on Linux to see where this contention is
happening, use this conclusion to decide the place of a wait point,
and then see if on other OSes share a similar pattern.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-06 Thread Michael Paquier
On Fri, Oct 7, 2016 at 8:05 AM, Peter Geoghegan  wrote:
> Your customer
> databases might feature far more use of Japanese collations, for
> example, which might be an important factor.

Not really :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-06 Thread Peter Geoghegan
On Sun, Oct 2, 2016 at 6:48 PM, Michael Paquier
 wrote:
> Okay, moved to next CF. I may look at it finally I got some use-cases
> for it, similar to yours I guess..

Let me know how that goes.

One thing I've definitely noticed is that the tool is good at finding
corner-case bugs, so even if you can only test a small fraction of the
number of databases that I've been able to test, there could still be
significant value in your performing your own exercise. Your customer
databases might feature far more use of Japanese collations, for
example, which might be an important factor. (Well, the use of Arabic
code points turned out to be an important factor in the question of
whether or not en_US.utf8 could have been affected by the Glibc +
abbreviated keys bug.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: About CMake v2

2016-10-06 Thread Yury Zhuravlev

Stas Kelvich wrote:
On 17 Sep 2016, at 20:21, Yury Zhuravlev 
 wrote:


Michael Paquier wrote: ...



Tried to generate Xcode project out of cmake, build fails on 
genbki.pl: can't locate Catalog.pm (which itself lives in 
src/backend/catalog/Catalog.pm)




Can you try again? On my Xcode 7.3 / ElCapitan everything is working 
correctly.

--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-06 Thread Robert Haas
On Thu, Oct 6, 2016 at 4:40 PM, Jeff Janes  wrote:
> Scale factor 3000 obviously doesn't fit in shared_buffers.  But does it fit
> in RAM?  That is, are the backends doing real IO, or they just doing fake IO
> to the kernel's fs cache?

That's a good question.

[rhaas@hydra ~]$ free -g
 total   used   free sharedbuffers cached
Mem:61 26 34  0  0 24
-/+ buffers/cache:  2 58
Swap:   19  4 15

rhaas=# select pg_size_pretty(pg_relation_size('pgbench_accounts'));
 pg_size_pretty

 38 GB
(1 row)

rhaas=# select pg_size_pretty(pg_database_size(current_database()));
 pg_size_pretty

 44 GB
(1 row)

That's pretty tight, especially since I now notice Andres also left a
postmaster running on this machine back in April, with
shared_buffers=8GB.  44GB for this database plus 8GB for
shared_buffers plus 8GB for the other postmaster's shared_buffers
leaves basically no slack, so it was probably doing quite a bit of
real I/O, especially after the database got a bit of bloat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-06 Thread Robert Haas
On Thu, Oct 6, 2016 at 3:56 PM, Jeff Janes  wrote:
>> Sure, I could handle each case separately, but the goal of this patch
>> (as hinted at in the Subject) is to generalize all the different tasks
>> we've been giving to VACUUM.  The only missing piece is what the first
>> patch addresses; which is insert-mostly tables never getting vacuumed.
>
> I don't buy the argument that we should do this in order to be "general".
> Those other pieces are present to achieve a specific job, not out of
> generality.

+1.

> If we want to have something to vacuum insert-mostly tables for the sake of
> the index-only-scans, then I don't think we can ignore the fact that the
> visibility map is page based, not tuple based.  If we insert 10,000 tuples
> into a full table and they all go into 100 pages at the end, that is very
> different than inserting 10,000 tuples which each go into a separate page
> (because each page has that amount of freespace).  In one case you have
> 10,000 tuples not marked as all visible, in the other case you have
> 1,000,000 not marked as all visible.

+1.

> Also, I think that doing more counts which get amalgamated into the same
> threshold, rather than introducing another parameter, is a bad thing.  I
> have insert-mostly, most of the time, tables which are never going to
> benefit from index-only-scans, and I don't want to pay the cost of them
> getting vacuumed just because of some inserts, when I will never get a
> benefit out of it.  I could turn autovacuum off for those tables, but then I
> would have to remember to manually intervene on the rare occasion they do
> get updates or deletes.  I want to be able to manually pick which tables I
> sign up for this feature (and then forget it), not manually pick which ones
> to exempt from it and have to constantly review that.

Of course, if you do that, then what will happen is eventually it will
be time to advance relfrozenxid for that relation, and you'll get a
giant soul-crushing VACUUM of doom, rather than a bunch of small,
hopefully-innocuous VACUUMs.  I've been wondering if would make sense
to trigger vacuums based on inserts based on a *fixed* number of
pages, rather than a percentage of the table.  Say, for example,
whenever we have 64MB of not-all-visible pages, we VACUUM.

There are some difficulties:

1. We don't want to vacuum too early.  For example, a bulk load may
vastly inflate the size of a table, but vacuuming it while the load is
in progress will be useless.  You can maybe avoid that problem by
basing this on statistics only reported at the end of the transaction,
but there's another, closely-related issue: vacuuming right after the
transaction completes may be useless, too, if there are old snapshots
still in existence that render the newly-inserted tuples
not-all-visible.

2. We don't want to trigger index vacuuming for a handful of dead
tuples, or at least not too often.  I've previously suggested
requiring a certain minimum number of dead index tuples that would be
required before index vacuuming occurs; prior to that, we'd just
truncate to dead line pointers and leave those for the next vacuum
cycle.  This might need some refinement - should it be page-based? -
but the basic idea still seems sound.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?

2016-10-06 Thread Tom Lane
Robert Haas  writes:
> Alternatively, get a bigger box.  :-)

So what's it take to get access to hydra?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-10-06 Thread Simon Riggs
On 6 October 2016 at 21:27, Robert Haas  wrote:

>> * The labelling "Autonomous Transaction" is a simple coat of paint,
>> which can easily be transferred to a better implementation if one
>> comes. If one doesn't then its better to have something than nothing.
>> So I suggest we commit Background Transactions first and then in a
>> fairly thin commit, implement Autonomous Transactions on top of it for
>> now and if we get a better one, switch it over.
>
> I think we should implement background transactions and call them
> background transactions.  That allows us to expose additional
> functionality which is useful, like the ability to kick something off
> and check back later for the results.  There's no reason to call it
> background transactions and also call it autonomous transactions: one
> feature doesn't need two names.

For myself, I don't care what you call it.

I just want to be able to invoke it by saying PRAGMA AUTONOMOUS_TRANSACTION;
and I know many others do also.
If a better implementation emerges I would happily replace this one
with another.

I'm happy to also invoke it via an alternate mechanism or API, so that
it can continue to be used even if the above mechanism changes.

We have no need to wait for the perfect solution, even assuming we
would ever agree that just one exists.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-06 Thread Jeff Janes
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas  wrote:

> Hi,
>
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:
>
> \t
> select wait_event_type, wait_event from pg_stat_activity where pid !=
> pg_backend_pid()
> \watch 0.5
>
> Then, I used a little shell-scripting to count up the number of times
> each wait event occurred in the output.  First, I tried scale factor
> 3000 with 32 clients and got these results:
>

Scale factor 3000 obviously doesn't fit in shared_buffers.  But does it fit
in RAM?  That is, are the backends doing real IO, or they just doing fake
IO to the kernel's fs cache?

Cheers,

Jeff


Re: [HACKERS] autonomous transactions

2016-10-06 Thread Robert Haas
On Thu, Oct 6, 2016 at 5:56 AM, Simon Riggs  wrote:
> Just to point out that I've actually written this approach already.
> The patch is available, Autonomous Subtransactions.
> We discussed it in Ottawa and it was rejected. (I thought Robert was
> there, but Serge and Tom definitely were).

Where is the patch?

> See other posts in this thread by Serge and Craig to explain why.

I don't think the posts on Craig and Serge explain why that approach
was rejected or would be a bad idea.

> We have various approaches... summarised in chronological order of
> their suggestion
>
> 1. Use additional PGXACTs - rejected because it wouldn't provide enough room

Of course, a background worker uses a PGXACT too and a lot more, so if
you think extra PGXACTs are bad, you should *really* think background
workers are bad.

> 2. Use Autonomous SubTransactions - rejected because the semantics are
> different to what we might expect from ATs

In what way?  I think the questions of how you implement it and what
the semantics are are largely orthogonal questions.  To which proposal
is this referring?

> 3. Use background transactions (this thread)

Sure.

> 4. Use pause and resume so we don't use up too many PGXACTs

I don't know what "pause and resume" means.

> * The labelling "Autonomous Transaction" is a simple coat of paint,
> which can easily be transferred to a better implementation if one
> comes. If one doesn't then its better to have something than nothing.
> So I suggest we commit Background Transactions first and then in a
> fairly thin commit, implement Autonomous Transactions on top of it for
> now and if we get a better one, switch it over.

I think we should implement background transactions and call them
background transactions.  That allows us to expose additional
functionality which is useful, like the ability to kick something off
and check back later for the results.  There's no reason to call it
background transactions and also call it autonomous transactions: one
feature doesn't need two names.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-06 Thread Jeff Janes
On Sat, Oct 1, 2016 at 1:34 PM, Vik Fearing  wrote:

>
> Sure, I could handle each case separately, but the goal of this patch
> (as hinted at in the Subject) is to generalize all the different tasks
> we've been giving to VACUUM.  The only missing piece is what the first
> patch addresses; which is insert-mostly tables never getting vacuumed.
>

I don't buy the argument that we should do this in order to be "general".
Those other pieces are present to achieve a specific job, not out of
generality.

If we want to have something to vacuum insert-mostly tables for the sake of
the index-only-scans, then I don't think we can ignore the fact that the
visibility map is page based, not tuple based.  If we insert 10,000 tuples
into a full table and they all go into 100 pages at the end, that is very
different than inserting 10,000 tuples which each go into a separate page
(because each page has that amount of freespace).  In one case you have
10,000 tuples not marked as all visible, in the other case you have
1,000,000 not marked as all visible.

Also, I think that doing more counts which get amalgamated into the same
threshold, rather than introducing another parameter, is a bad thing.  I
have insert-mostly, most of the time, tables which are never going to
benefit from index-only-scans, and I don't want to pay the cost of them
getting vacuumed just because of some inserts, when I will never get a
benefit out of it.  I could turn autovacuum off for those tables, but then
I would have to remember to manually intervene on the rare occasion they do
get updates or deletes.  I want to be able to manually pick which tables I
sign up for this feature (and then forget it), not manually pick which ones
to exempt from it and have to constantly review that.

Cheers,

Jeff


Re: [HACKERS] PostgreSQL - Weak DH group

2016-10-06 Thread Christoph Berg
Re: Heikki Linnakangas 2016-10-06 
> I propose the attached patch. It gives up on trying to deal with multiple
> key lengths (as noted earlier, OpenSSL just always passed keylength=1024, so
> that was useless). Instead of using the callback, it just sets fixed DH
> parameters with SSL_CTX_set_tmp_dh(), like we do for the ECDH curve. The DH
> parameters are loaded from a file called "dh_params.pem" (instead of
> "dh1024.pem"), if present, otherwise the built-in 2048 bit parameters are
> used.

Shouldn't this be a GUC pointing to a configurable location like
ssl_cert_file? This way, people reading the security section of the
default postgresql.conf would notice that there's something (new) to
configure. (And I wouldn't want to start creating symlinks from PGDATA
to /etc/ssl/something again...)

Thanks,
Christoph


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?

2016-10-06 Thread Robert Haas
On Thu, Oct 6, 2016 at 9:46 AM, Tom Lane  wrote:
> Can anyone think of a test case that would stress semaphore operations
> more heavily, without being unrealistic?

I think it's going to be pretty hard to come up with a non-artificial
test case that has exhibits meaningful lwlock contention on an 8-core
system.  If you go back to 9.1, before we had fast-path locking, you
can do it, because the relation locks and vxid locks do cause
noticeable contention on the lock manager locks in that version.
Alternatively, you might try something like "pgbench -n -S -c $N -j
$N" with a scale factor that doesn't fit in shared buffers.  This
probably won't produce significant contention because there are 128
LWLocks and only 8 cores, but you could reduce the number of buffer
mapping LWLocks to, say, 4 and then you'd probably hit it fairly hard.

Alternatively, get a bigger box.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pgbench vs. wait events

2016-10-06 Thread Robert Haas
Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench.  I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on.  During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.  First, I tried scale factor
3000 with 32 clients and got these results:

  1  LWLockTranche   | buffer_mapping
  9  LWLockNamed | CLogControlLock
 14  LWLockNamed | ProcArrayLock
 16  Lock| tuple
 25  LWLockNamed | CheckpointerCommLock
 49  LWLockNamed | WALBufMappingLock
122  LWLockTranche   | clog
182  Lock| transactionid
287  LWLockNamed | XidGenLock
   1300  Client  | ClientRead
   1375  LWLockTranche   | buffer_content
   3990  Lock| extend
  21014  LWLockNamed | WALWriteLock
  28497  |
  58279  LWLockTranche   | wal_insert

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload.   Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test.  The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:

  1  BufferPin   | BufferPin
  1  LWLockTranche   | lock_manager
  2  LWLockTranche   | buffer_mapping
  8  LWLockNamed | ProcArrayLock
  9  LWLockNamed | CheckpointerCommLock
  9  LWLockNamed | CLogControlLock
 11  LWLockTranche   | buffer_content
 37  LWLockTranche   | clog
153  Lock| tuple
388  LWLockNamed | XidGenLock
827  Lock| transactionid
   1267  Client  | ClientRead
  20631  Lock| extend
  91767  |

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement.  The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture.  However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O.  This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:

 14  Lock| tuple
 22  LWLockTranche   | lock_manager
 39  LWLockNamed | WALBufMappingLock
331  LWLockNamed | CLogControlLock
461  LWLockNamed | ProcArrayLock
536  Lock| transactionid
552  Lock| extend
716  LWLockTranche   | buffer_content
763  LWLockNamed | XidGenLock
   2113  LWLockNamed | WALWriteLock
   6190  LWLockTranche   | wal_insert
  25002  Client  | ClientRead
  78466  |

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work.  ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare.  Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.  It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables.  At scale factor 300, 32 

[HACKERS] FSM corruption leading to errors

2016-10-06 Thread Pavan Deolasee
I investigated a bug report from one of our customers and it looked very
similar to previous bug reports here [1], [2], [3] (and probably more). In
these reports, the error looks something like this:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The function
returns without setting the DIRTY bit on the standby:

3413 /*
3414  * If we're in recovery we cannot dirty a page because of
a hint.
3415  * We can set the hint, just not dirty the page as a
result so the
3416  * hint is lost when we evict the page or shutdown.
3417  *
3418  * See src/backend/storage/page/README for longer
discussion.
3419  */
3420 if (RecoveryInProgress())
3421 return;
3422

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
the FSM. I think that's usually alright because FSM changes are not WAL
logged and if FSM ever returns a block with less free space than the caller
needs, the caller is usually prepared to update the FSM and request for a
new block. But if it returns a block that is outside the size of the
relation, then we've a trouble. The very next ReadBuffer() fails to handle
such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove references
to the heap blocks that are being truncated. But since the FSM buffer may
not be marked DIRTY on the standby, if the buffer gets evicted from the
buffer cache, the on-disk copy of the FSM page may be left with references
to the truncated heap pages. When the standby is later promoted to be the
master, and an insert/update is attempted to the table, the FSM may return
a block that is outside the valid range of the relation. That results in
the said error.

Once this was clear, it was easy to put together a fully reproducible test
case. See the attached script; you'll need to adjust to your environment.
This affects all releases starting 9.3 and the script can reproduce the
problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during
relation truncation, I don't see any problem from performance perspective.

What bothers me is how to fix the problem for already affected standbys. If
the FSM for some table is already corrupted at the standby, users won't
notice it until the standby is promoted to be the new master. If the
standby starts throwing errors suddenly after failover, it will be a very
bad situation for the users, like we noticed with our customers. The fix is
simple and users can just delete the FSM (and VACUUM the table), but that
doesn't sound nice and they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is outside the
range and discard such blocks after setting the FSM (attached patch does
that). The problem with that approach is that RelationGetNumberOfBlocks()
is not really cheap and invoking it everytime FSM is consulted may not be a
bright idea. Can we cache that value in the RelationData or some such place
(BulkInsertState?) and use that as a hint for quickly checking if the block
is (potentially) outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the real
problem, until I saw RecoveryInProgress() specific code, is: can this also
affect stand-alone masters? The comments at MarkBufferDirtyHint() made me
think so:

3358  * 3. This function does not guarantee that the buffer is always
marked dirty
3359  *(due to a race condition), so it cannot be used for important
changes.

So I was working with a theory that somehow updates to the FSM page are
lost because the race mentioned in the comment actually kicks in. But I'm
not sure if the race is only possible when the caller is holding a SHARE
lock on the buffer. When the FSM is truncated, the caller holds an
EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not
reproduce the issue on a stand-alone master. But probably worth checking.

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
standby perspective. I did one round, and couldn't see another problem.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/CAJakt-8%3DaXa-F7uFeLAeSYhQ4wFuaX3%2BytDuDj9c8Gx6S_ou%3Dw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/20160601134819.30392.85...@wrigleys.postgresql.org
[3]
https://www.postgresql.org/message-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0%40AMSPR06MB504.eurprd06.prod.outlook.com


-- 
 

Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Vitaly Burovoy
On 10/6/16, Serge Rielau  wrote:
>> On Oct 6, 2016, at 9:20 AM, Tom Lane  wrote:
>> Vitaly Burovoy  writes:
>>> But what I discover for myself is that we have pg_attrdef separately
>>> from the pg_attribute. Why?
>>
>> The core reason for that is that the default expression needs to be
>> a separate object from the column for purposes of dependency analysis.
>> For example, if you have a column whose default is "foo()", then the
>> default expression depends on the function foo(), but the column should
>> not: if you drop the function, only the default expression ought to
>> be dropped, not the column.
>>
>> Because of this, the default expression needs to have its own OID
>> (to be stored in pg_depend) and it's convenient to store it in a
>> separate catalog so that the classoid can identify it as being a
>> default expression rather than some other kind of object.
>
> Good to know.
>
>> If we were going to allow these missing_values or creation_defaults
>> or whatever they're called to be general expressions, then they would
>> need
>> to have their own OIDs for dependency purposes.  That would lead me to
>> think that the best representation is to put them in their own rows in
>> pg_attrdef, perhaps adding a boolean column to pg_attrdef to distinguish
>> regular defaults from these things.  Or maybe they even need their own
>> catalog, depending on whether you think dependency analysis would want
>> to distinguish them from regular defaults using just the classed.
>>
>> Now, as I just pointed out in another mail, realistically we're probably
>> going to restrict the feature to simple constants, which'd mean they will
>> depend only on the column's type and can never need any dependencies of
>> their own.  So we could take the shortcut of just storing them in a new
>> column in pg_attribute.

I agree with you.

>> But maybe that's shortsighted and we'll
>> eventually wish we'd done them as full-fledged separate objects.

I don't think so. If we try to implement non-blocking adding columns
with volatile defaults (and for instance update old rows in the
background), we can end up with the next situation:

CREATE TABLE a(i bigint PRIMARY KEY);
INSERT INTO a SELECT generate_series(1,100);

ALTER TABLE a ADD COLUMN b bigserial CHECK (b BETWEEN 1 AND 100);

For indexes (even unique) created concurrently similar troubles are
solved with a "not valid" mark, but what to do with a heap if we try
to do it in the background?

>> But on the third hand ... once one of these is in place, how could you
>> drop it separately from the column?  That would amount to a change in the
>> column's stored data, which is not what one would expect from dropping
>> a separate object.  So maybe it's senseless to think that these things
>> could ever be distinct objects.  But that definitely leads to the
>> conclusion that they're constants and nothing else.

> I cannot follow this reasoning.
> Let’s look past what PG does today:
> For each row (whether that’s necessary or not) we evaluate the expression,
> compute the value and
> store it in the rewritten table.
> We do not record dependencies on the “pedigree” of the value.
> It happened to originate from the DEFAULT expression provided with the ADD
> COLUMN,
> but that is not remembered anywhere.
> All we remember is the value - in each row.
> So the only change that is proposed here - when it comes right down to it -
> is to remember the value once only (IFF it is provably the same for each row)
> and thus avoid the need to rewrite the table.
> So I see no reason to impose any restriction other than “evaluated value is
> provably the same for every row”.

Tom says the same thing. The expression at the end should be a value
if it allows to avoid rewriting table.

> Regarding the location of storage.
> I did start of using pg_attrdef, but ran into some snags.
> My approach was to add the value as an extra column (rather than an extra
> row).
> That caused trouble since a SET DEFAULT operation is decomposed into a DROP
> and a SET and
> preserving the value across such operations did not come naturally.

I'm sorry for making you be confused. The best way is to use an extra
column in the pg_attribute to store serialized value.

> If we were to use extra rows instead that issue would be solved, assuming we
> add a “default kind” sort of column.
> It would dictate the storage format though which may be considered overkill
> for a a constant.

-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Serge Rielau

> On Oct 6, 2016, at 9:20 AM, Tom Lane  wrote:
> 
> Vitaly Burovoy  writes:
>> But what I discover for myself is that we have pg_attrdef separately
>> from the pg_attribute. Why?
> 
> The core reason for that is that the default expression needs to be
> a separate object from the column for purposes of dependency analysis.
> For example, if you have a column whose default is "foo()", then the
> default expression depends on the function foo(), but the column should
> not: if you drop the function, only the default expression ought to
> be dropped, not the column.
> 
> Because of this, the default expression needs to have its own OID
> (to be stored in pg_depend) and it's convenient to store it in a
> separate catalog so that the classoid can identify it as being a
> default expression rather than some other kind of object.
Good to know.
> 
> If we were going to allow these missing_values or creation_defaults
> or whatever they're called to be general expressions, then they would need
> to have their own OIDs for dependency purposes.  That would lead me to
> think that the best representation is to put them in their own rows in
> pg_attrdef, perhaps adding a boolean column to pg_attrdef to distinguish
> regular defaults from these things.  Or maybe they even need their own
> catalog, depending on whether you think dependency analysis would want
> to distinguish them from regular defaults using just the classed.
> 
> Now, as I just pointed out in another mail, realistically we're probably
> going to restrict the feature to simple constants, which'd mean they will
> depend only on the column's type and can never need any dependencies of
> their own.  So we could take the shortcut of just storing them in a new
> column in pg_attribute.  But maybe that's shortsighted and we'll
> eventually wish we'd done them as full-fledged separate objects.
> 
> But on the third hand ... once one of these is in place, how could you
> drop it separately from the column?  That would amount to a change in the
> column's stored data, which is not what one would expect from dropping
> a separate object.  So maybe it's senseless to think that these things
> could ever be distinct objects.  But that definitely leads to the
> conclusion that they're constants and nothing else.
I cannot follow this reasoning.
Let’s look past what PG does today:
For each row (whether that’s necessary or not) we evaluate the expression, 
compute the value and 
store it in the rewritten table.
We do not record dependencies on the “pedigree” of the value.
It happened to originate from the DEFAULT expression provided with the ADD 
COLUMN,
but that is not remembered anywhere.
All we remember is the value - in each row.

So the only change that is proposed here - when it comes right down to it - is 
to
remember the value once only (IFF it is provably the same for each row) and 
thus 
avoid the need to rewrite the table.
So I see no reason to impose any restriction other than “evaluated value is 
provably the same for every row”.

Regarding the location of storage.
I did start of using pg_attrdef, but ran into some snags.
My approach was to add the value as an extra column (rather than an extra row).
That caused trouble since a SET DEFAULT operation is decomposed into a DROP and 
a SET and
preserving the value across such operations did not come naturally.

If we were to use extra rows instead that issue would be solved, assuming we ad 
a “default kind” sort of column.
It would dictate the storage format though which may be considered overkill for 
a a constant. 

Cheers
Serge

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Vitaly Burovoy
On 10/6/16, Vitaly Burovoy  wrote:
> Ough. I made a mistake about pg_attribute because I forgot about the
> pg_attrdef.
> If we do not merge these tables, the pg_attrdef is the best place to
> store evaluated expression as a constant the same way defaults are
> stored in adbin.

Oops. While I was writing the previous email, Tom explained necessity
of the pg_attrdef.
With that explanation it is obvious that I was wrong and a value for
pre-existing rows should be in a new column in the pg_attribute.
All the other thoughts from my previous email stand good.

-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Vitaly Burovoy
On 10/6/16, Serge Rielau  wrote:
>> On Oct 6, 2016, at 9:01 AM, Tom Lane  wrote:
>> BTW, it also occurs to me that there are going to be good implementation
>> reasons for restricting it to be a hard constant, not any sort of
>> expression.  We are likely to need to be able to insert the value in
>> low-level code where general expression evaluation is impractical.
>>
> Yes, the padding must happen primarily in the getAttr() routines.
> Clearly we do not want to evaluate an expression there.
> But what speaks against evaluating the expression before we store it?
> After all we seem to all agree that this only works if the expression
> computes to the same constant all the time.
>
> If we do not want to store an “untyped” datum straight in pg_attribute as a
> BYTEA (my current approach)

Ough. I made a mistake about pg_attribute because I forgot about the pg_attrdef.
If we do not merge these tables, the pg_attrdef is the best place to
store evaluated expression as a constant the same way defaults are
stored in adbin.

> we could store the pretty printed version of the constant

It is a wrong way. It ruins commands like "ALTER COLUMN ... TYPE ... USING"

> and evaluate that when we build the tuple descriptor.
> This happens when we load the relation into the relcache.
>
> Anyway, I’m jumping ahead and it’s perhaps best to let the code speak for
> itself once I have the WIP patch ready so we have something concrete to
> discuss

-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Tom Lane
Vitaly Burovoy  writes:
> But what I discover for myself is that we have pg_attrdef separately
> from the pg_attribute. Why?

The core reason for that is that the default expression needs to be
a separate object from the column for purposes of dependency analysis.
For example, if you have a column whose default is "foo()", then the
default expression depends on the function foo(), but the column should
not: if you drop the function, only the default expression ought to
be dropped, not the column.

Because of this, the default expression needs to have its own OID
(to be stored in pg_depend) and it's convenient to store it in a
separate catalog so that the classoid can identify it as being a
default expression rather than some other kind of object.

If we were going to allow these missing_values or creation_defaults
or whatever they're called to be general expressions, then they would need
to have their own OIDs for dependency purposes.  That would lead me to
think that the best representation is to put them in their own rows in
pg_attrdef, perhaps adding a boolean column to pg_attrdef to distinguish
regular defaults from these things.  Or maybe they even need their own
catalog, depending on whether you think dependency analysis would want
to distinguish them from regular defaults using just the classoid.

Now, as I just pointed out in another mail, realistically we're probably
going to restrict the feature to simple constants, which'd mean they will
depend only on the column's type and can never need any dependencies of
their own.  So we could take the shortcut of just storing them in a new
column in pg_attribute.  But maybe that's shortsighted and we'll
eventually wish we'd done them as full-fledged separate objects.

But on the third hand ... once one of these is in place, how could you
drop it separately from the column?  That would amount to a change in the
column's stored data, which is not what one would expect from dropping
a separate object.  So maybe it's senseless to think that these things
could ever be distinct objects.  But that definitely leads to the
conclusion that they're constants and nothing else.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Vitaly Burovoy
On 10/6/16, Tom Lane  wrote:
> Serge Rielau  writes:
>>> On Oct 6, 2016, at 5:25 AM, Vitaly Burovoy 
>>> wrote:
 Which makes me think we should call this missing_value or absent_value

Be honest Simon Rigg's wrote that words.

 so its clear that it is not a "default" it is the value we use for
 rows that do not have any value stored for them.
>
>> I like Tom’s “creation default”. Another one could be “initial default”.
>> But that, too, can be misread.
>
> Something based on missing_value/absent_value could work for me too.
>
> If we name it something involving "default", that definitely increases
> the possibility for confusion with the regular user-settable default.
>
> Also worth thinking about here is that the regular default expression
> affects what will be put into future inserted rows, whereas this thing
> affects the interpretation of past rows.  So it's really quite a different
> animal.  That's kind of leading me away from calling it creation_default.
>
> BTW, it also occurs to me that there are going to be good implementation
> reasons for restricting it to be a hard constant, not any sort of
> expression.  We are likely to need to be able to insert the value in
> low-level code where general expression evaluation is impractical.

Yes, I mentioned that it should be evaluated and stored as a value
because user functions can be changed (besides the speed reason),
that's why I like the "value" in its name. The "default" is usually
identified with expressions, not values (which are particular cases of
expressions).

Serge mentioned the phrase "pre-existing rows", which makes me think
about something like "pre_existing_value"

-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Serge Rielau

> On Oct 6, 2016, at 9:01 AM, Tom Lane  wrote:
> 
> BTW, it also occurs to me that there are going to be good implementation
> reasons for restricting it to be a hard constant, not any sort of
> expression.  We are likely to need to be able to insert the value in
> low-level code where general expression evaluation is impractical.
> 
Yes, the padding must happen primarily in the getAttr() routines.
Clearly we do not want to evaluate an expression there.
But what speaks against evaluating the expression before we store it?
After all we seem to all agree that this only works if the expression computes 
to the same constant all the time.

If we do not want to store an “untyped” datum straight in pg_attribute as a 
BYTEA (my current approach) we could store the pretty printed version of the 
constant
and evaluate that when we build the tuple descriptor.
This happens when we load the relation into the relcache.

Anyway, I’m jumping ahead and it’s perhaps best to let the code speak for 
itself once I have the WIP patch ready so we have something concrete to discuss

Cheers
Serge




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-06 Thread Peter Geoghegan
On Thu, Oct 6, 2016 at 8:44 AM, Peter Geoghegan  wrote:
> Besides, what I propose to do is really exactly the same as what you
> also want to do, except it avoids actually changing state->maxTapes.
> We'd just pass down what you propose to assign to state->maxTapes
> directly, which differs (and not just in the common case where there
> are inactive tapes -- it's always at least off-by-one). Right?

What I mean is that you should pass down numTapes alongside
numInputTapes. The function init_tape_buffers() could either have an
additional argument (numTapes), or derive numTapes itself.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Tom Lane
Serge Rielau  writes:
>> On Oct 6, 2016, at 5:25 AM, Vitaly Burovoy  wrote:
>>> Which makes me think we should call this missing_value or absent_value
>>> so its clear that it is not a "default" it is the value we use for
>>> rows that do not have any value stored for them.

> I like Tom’s “creation default”. Another one could be “initial 
> default”.
> But that, too, can be misread.

Something based on missing_value/absent_value could work for me too.
If we name it something involving "default", that definitely increases
the possibility for confusion with the regular user-settable default.

Also worth thinking about here is that the regular default expression
affects what will be put into future inserted rows, whereas this thing
affects the interpretation of past rows.  So it's really quite a different
animal.  That's kind of leading me away from calling it creation_default.

BTW, it also occurs to me that there are going to be good implementation
reasons for restricting it to be a hard constant, not any sort of
expression.  We are likely to need to be able to insert the value in
low-level code where general expression evaluation is impractical.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-06 Thread Peter Geoghegan
On Thu, Oct 6, 2016 at 12:00 AM, Heikki Linnakangas  wrote:
> This is related to earlier the discussion with Peter G, on whether we should
> change state->maxTapes to reflect the actual number of tape that were used,
> when that's less than maxTapes. I think his confusion about the loop in
> init_tape_buffers(), in
> CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would
> also have been avoided, if we had done that. So I think we should reconsider
> that.

-1 on that from me. I don't think that you should modify a variable
that is directly linkable to Knuth's description of polyphase merge --
doing so seems like a bad idea. state->maxTapes (Knuth's T) really is
something that is established pretty early on, and doesn't change.

While the fix you pushed was probably a good idea anyway, I still
think you should not use state->maxTapes to exhaustively call
LogicalTapeAssignReadBufferSize() on every tape, even non-active
tapes. That's the confusing part. It's not as if your need for the
number of input tapes (the number of maybe-active tapes) is long
lived; you just need to instruct logtape.c on buffer sizing once, at
the start of mergeruns().

Besides, what I propose to do is really exactly the same as what you
also want to do, except it avoids actually changing state->maxTapes.
We'd just pass down what you propose to assign to state->maxTapes
directly, which differs (and not just in the common case where there
are inactive tapes -- it's always at least off-by-one). Right?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Serge Rielau

> On Oct 6, 2016, at 5:25 AM, Vitaly Burovoy  wrote:
> 
> On 10/6/16, Simon Riggs  wrote:
>> On 6 October 2016 at 04:43, Serge Rielau  wrote:
> Or should I compose some sort of a design document?
>> 
>> Having read this thread, I'm a little unclear as to what you're
>> writing now, though there's definitely good ideas here.
>> 
>> I think it would be beneficial to write up a single coherent
>> description of this, including behaviour and a small sketch of
>> implementation, just so everyone knows what this is. No design doc,
>> but a summary.
> 
> At the moment I think it can also be a good idea to post the current
> patch as a Proposal or a WIP to get initial feedback.
I can do that -  Accompanied by a posting sized overview.

> 
> Yes, it works for stable "now()" but does not work for volatile
> functions like "random()", "uuid_generate_v4()" or default for serial
> columns. The only possible way I can see is to check an expression has
> only "T_Const"s, static and stable functions. In such case the
> expression can be evaluated and the result be saved as a value for
> absented attributes of a tuple. In the other case save NULL there and
> rewrite the table.
Agreed. I think DEFAULT as-is does the job nicely function wise.
One can always decompose the ADD COLUMN into two steps within the same 
transaction 
if the initial column value for pre-existing rows does not match the default 
for new or updated rows.
 
AT Just needs a performance boost for large tables where that’s reasonably 
possible.

>> Which makes me think we should call this missing_value or absent_value
>> so its clear that it is not a "default" it is the value we use for
>> rows that do not have any value stored for them.
> 
> It is definitely a default for a user, it is not a regular default internally.
> I'm not a native speaker, "absent_value" can be mixed up with a NULL.
> As for me the best phrase is "pre-add-column-default", but it is
> impossible to use it as a column name. :-(
> It is still an open question.
I like Tom’s “creation default”. Another one could be “initial default”.
But that, too, can be misread.

Cheers
Serge Rielau
Salesforce.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-06 Thread Amit Kapila
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas  wrote:
> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila  wrote:
>> I think one way to avoid the risk of deadlock in above scenario is to
>> take the cleanup lock conditionally, if we get the cleanup lock then
>> we will delete the items as we are doing in patch now, else it will
>> just mark the tuples as dead and ensure that it won't try to remove
>> tuples that are moved-by-split.  Now, I think the question is how will
>> these dead tuples be removed.  We anyway need a separate mechanism to
>> clear dead tuples for hash indexes as during scans we are marking the
>> tuples as dead if corresponding tuple in heap is dead which are not
>> removed later.  This is already taken care in btree code via
>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>> be handled by a separate patch.
>
> That seems like it could work.  The hash scan code will need to be
> made smart enough to ignore any tuples marked dead, if it isn't
> already.
>

It already takes care of ignoring killed tuples in below code, though
the way is much less efficient as compare to btree.  Basically, it
fetches the matched tuple and then check if it is dead where as in
btree while matching the key, it does the same check.  It might be
efficient to do it before matching the hashkey, but I think it is a
matter of separate patch.
hashgettuple()
{
..
/*
* Skip killed tuples if asked to.
*/
if (scan->ignore_killed_tuples)
}



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL - Weak DH group

2016-10-06 Thread Heikki Linnakangas

On 10/05/2016 09:57 PM, Heikki Linnakangas wrote:

On 10/05/2016 05:15 PM, Nicolas Guini wrote:

We are working with Postgres 9.3.14 and executing nmap we
found that it is using “weak DH group” (nmap –script ssl-dh-params). Weak =
1024 bits.


Yeah, it seems that we're a bit behind the times on this...


This issue is similar to what this post explains about using weak DH
parameters: http://www.usefuljs.net/2016/09/29/imperfect-forward-secrecy/


The blog post points out that, as counterintuitive as it sounds, the
SSL_CTX_set_tmp_dh_callback() callback should ignore the keylength
argument, and always return a DH group with 2048 bits, or stronger. As
you pointed out, that's not what our callback does.


I propose the attached patch. It gives up on trying to deal with 
multiple key lengths (as noted earlier, OpenSSL just always passed 
keylength=1024, so that was useless). Instead of using the callback, it 
just sets fixed DH parameters with SSL_CTX_set_tmp_dh(), like we do for 
the ECDH curve. The DH parameters are loaded from a file called 
"dh_params.pem" (instead of "dh1024.pem"), if present, otherwise the 
built-in 2048 bit parameters are used.


I removed the code for generating DH parameters on-the-fly altogether. 
The OpenSSL man page clearly says that that should not be done, because 
generating the parameters takes a long time. And because OpenSSL always 
passed keylength=1024, AFAICS that had been always dead code.


If we want to get really fancy, we could generate the parameters the 
first time you turn ssl=on, and store them in $PGDATA. But the 
generation takes a very long time, so the admin might think it's stuck.


I note that supplying custom DH parameters in the file is completely 
undocumented :-(. We should add a note in the docs on how to generate 
the custom DH parameters with openssl. Also, there's no easy way of 
telling if your custom parameters are actually been used. If you copy 
the params file in $PGDATA, but you have a typo in the name, you won't 
notice. Perhaps we should print a line in the log when the file is loaded.


I'm afraid of back-patching this, out of fear that older clients would 
stop working, or would downgrade to an even weaker cipher. You could fix 
it by putting weaker 1024 bit params in dh_params.pem, so maybe we could 
do it if we put a warning and instructions for doing that in the release 
notes. Thoughts?


- Heikki

>From 69f74e3cced093c9ae2cce830e31c3fd76a8a06b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 6 Oct 2016 16:22:29 +0300
Subject: [PATCH 1/1] Always use 2048 bit DH parameters for OpenSSL ephemeral
 DH ciphers.

1024 bits is considered weak these days, but OpenSSL always passes 1024 as
the key length to the tmp_dh callback. All the code to handle other key
lengths as in fact dead.

To remedy those issues:

* Only include hard-coded 2048-bit parameters.
* Set the parameters directly with SSL_CTX_set_tmp_dh(), without the
  callback
* The file for the DH parameters is now called "dh_params.pem", instead of
  "dh1024.pem" (the files for other key lengths, dh512.pem, dh2048.pem, etc.
  were never actually used)

Per report by Nicolas Guini

Discussion: 
---
 src/backend/libpq/be-secure-openssl.c | 198 +-
 1 file changed, 48 insertions(+), 150 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 668f217..fc7dd6a 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -65,18 +65,19 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+/* name of the file containing DH parameters */
+#define DH_PARAMS_FILE	"dh_params.pem"
 
 static int	my_sock_read(BIO *h, char *buf, int size);
 static int	my_sock_write(BIO *h, const char *buf, int size);
 static BIO_METHOD *my_BIO_s_socket(void);
 static int	my_SSL_set_fd(Port *port, int fd);
 
-static DH  *load_dh_file(int keylength);
+static DH  *load_dh_file(void);
 static DH  *load_dh_buffer(const char *, size_t);
-static DH  *generate_dh_parameters(int prime_len, int generator);
-static DH  *tmp_dh_cb(SSL *s, int is_export, int keylength);
 static int	verify_cb(int, X509_STORE_CTX *);
 static void info_cb(const SSL *ssl, int type, int args);
+static void initialize_dh(void);
 static void initialize_ecdh(void);
 static const char *SSLerrmessage(unsigned long ecode);
 
@@ -94,14 +95,14 @@ static SSL_CTX *SSL_context = NULL;
  *	sessions even if the static private key is compromised,
  *	so we are *highly* motivated to ensure that we can use
  *	EDH even if the DBA... or an attacker... deletes the
- *	$DataDir/dh*.pem files.
+ *	$DataDir/dh_params.pem file.
  *
  *	We could refuse SSL connections unless a good DH parameter
  *	file exists, but some clients may quietly renegotiate an
  *	unsecured connection without fully informing the user.
  *	

Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?

2016-10-06 Thread Tom Lane
I wrote:
> Although in normal cases the semaphore code paths aren't very heavily
> exercised in our code, I was able to get a measurable performance
> difference by building with --disable-spinlocks, so that spinlocks are
> emulated with semaphores.  On an 8-core RHEL6 machine, "pgbench -S -c 20
> -j 20" seems to be about 4% faster with unnamed semaphores than SysV
> semaphores.  It'd be good to replicate that test on some higher-end
> hardware, but provisionally I'd say unnamed semaphores are faster.

I realized that the above test is probably bogus, or at least not very
relevant to real-world Postgres performance.  A key performance aspect of
Linux futexes is that uncontended lock acquisitions, as well as releases
that don't need to wake anyone, don't enter the kernel at all.  However,
in PG's normal use of semaphores, neither scenario occurs very often;
processes lock their semaphores only after determining that they need to
wait, and release semaphores only when it's known they'll waken a sleeper.
The futex fast-path cases can occur only in the race condition that
someone else awakens a would-be waiter before it actually reaches its
semop call.  However, uncontended locks and releases *are* very common
for spinlocks.  This means that testing with --disable-spinlocks will show
a futex performance benefit that's totally irrelevant for real cases.

Based on that analysis, I abandoned testing with --disable-spinlocks
and instead tried to measure the actual speed of contended heavyweight
lock acquisition/release.  I used
pgbench -f lockscript.sql -c 20 -j 20 -T 60 bench
with the script being
begin; lock table pgbench_accounts; commit;
I got speeds between 10500 and 10800 TPS with either semaphore API;
if there's any difference at all, it's below the noise level for
this test scenario.

So I'm now thinking there's basically no performance consideration
here, and the point of switching would just be to get out from
under SysV kernel resource limits.  (Again though, this applies
only to Linux --- the other thread I cited suggests things might
be quite different on FreeBSD for instance.)

Can anyone think of a test case that would stress semaphore operations
more heavily, without being unrealistic?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-10-06 Thread Etsuro Fujita

On 2016/10/06 20:17, Amit Langote wrote:

On 2016/10/05 20:45, Etsuro Fujita wrote:

On 2016/10/05 14:09, Ashutosh Bapat wrote:


I wrote:

So, I added a new callback function for those caches
that is much like PlanCacheFuncCallback but skips checking the list for
the
query tree.



IMO, maintaining that extra function and
the risk of bugs because of not keeping those two functions in sync
outweigh the small not-so-frequent gain.



The inefficiency wouldn't be negligible in the case where there are large
numbers of cached plans.  So, I'd like to introduce a helper function that
checks the dependency list for the generic plan, to eliminate most of the
duplication.



I too am inclined to have a PlanCacheForeignCallback().

Just one minor comment:


Thanks for the comments!

I noticed that we were wrong.  Your patch was modified so that 
dependencies on FDW-related objects would be extracted from a given plan 
in create_foreignscan_plan (by Ashutosh) and then in 
set_foreignscan_references by me, but that wouldn't work well for INSERT 
cases.  To fix that, I'd like to propose that we collect the 
dependencies from the given rte in add_rte_to_flat_rtable, instead.


Attached is an updated version, in which I removed the 
PlanCacheForeignCallback and adjusted regression tests a bit.



If some writable FDW consulted foreign table/server/FDW options in
AddForeignUpdateTarget, which adds the extra junk columns for
UPDATE/DELETE to the targetList of the given query tree, the rewritten
query tree would also need to be invalidated.  But I don't think such an
FDW really exists because that routine in a practical FDW wouldn't change
such columns depending on those options.


I had second thoughts about that; since the possibility wouldn't be 
zero, I added to extract_query_dependencies_walker the same code I added 
to add_rte_to_flat_rtable.


After all, the updated patch is much like your version, but I think your 
patch, which modified extract_query_dependencies_walker only, is not 
enough because a generic plan can have more dependencies than its query 
tree (eg, consider inheritance).


Best regards,
Etsuro Fujita


foreign_plan_cache_inval_v4.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql autocomplete works not good in USER command in 9.6

2016-10-06 Thread Michael Paquier
On Thu, Oct 6, 2016 at 9:17 PM, Дмитрий Воронин
 wrote:
> I find, that autocomplete does not works good when I type in psql some USER 
> commands:
>
> CREATE USER  ...
> ALTER USER  ...
>
> psql's autocomplete returns database users and "pg_signal_backend" function, 
> I suppose, that is not correct.
>
> I try to undestand reasons and make a patch if those problem has not been 
> solved  yet.

I don't see any problems here. CREATE/ALTER USER have for a pretty
long time tab-completed with the existing role names. 9.6 is not
really new to that. What's new though in 9.6 is the introduction of
pg_signal_backend in the list as a system role.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind hangs if --source-server is used and syncrep is enabled

2016-10-06 Thread Michael Paquier
On Thu, Oct 6, 2016 at 7:37 PM, Heikki Linnakangas  wrote:
> Committed, thanks! I moved the call to where we establish the connection,
> that felt slightly more natural.

Thanks for the commit. Indeed that's better with the other sanity checks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Vitaly Burovoy
On 10/6/16, Simon Riggs  wrote:
> On 6 October 2016 at 04:43, Serge Rielau  wrote:
 Or should I compose some sort of a design document?
>
> Having read this thread, I'm a little unclear as to what you're
> writing now, though there's definitely good ideas here.
>
> I think it would be beneficial to write up a single coherent
> description of this, including behaviour and a small sketch of
> implementation, just so everyone knows what this is. No design doc,
> but a summary.

At the moment I think it can also be a good idea to post the current
patch as a Proposal or a WIP to get initial feedback.

> It would be very useful to be able to do this...
> ALTER TABLE foo ADD last_updated_timestamp timestamp default
> current_timestamp
> so that it generates a constant value and stores that for all prior
> rows, but then generates a new value for future rows.

Yes, it works for stable "now()" but does not work for volatile
functions like "random()", "uuid_generate_v4()" or default for serial
columns. The only possible way I can see is to check an expression has
only "T_Const"s, static and stable functions. In such case the
expression can be evaluated and the result be saved as a value for
absented attributes of a tuple. In the other case save NULL there and
rewrite the table.

> Which makes me think we should call this missing_value or absent_value
> so its clear that it is not a "default" it is the value we use for
> rows that do not have any value stored for them.

It is definitely a default for a user, it is not a regular default internally.
I'm not a native speaker, "absent_value" can be mixed up with a NULL.
As for me the best phrase is "pre-add-column-default", but it is
impossible to use it as a column name. :-(
It is still an open question.

(I remember funny versions in a discussion[1] when people tried to
choose a name for a function reversed to pg_size_pretty...)

[1] 
https://www.postgresql.org/message-id/flat/cafj8prd-tgodknxdygecza4on01_urqprwf-8ldkse-6bdh...@mail.gmail.com

--
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] psql autocomplete works not good in USER command in 9.6

2016-10-06 Thread Дмитрий Воронин
Hello, 

I find, that autocomplete does not works good when I type in psql some USER 
commands:

CREATE USER  ...
ALTER USER  ...

psql's autocomplete returns database users and "pg_signal_backend" function, I 
suppose, that is not correct.

I try to undestand reasons and make a patch if those problem has not been 
solved  yet.


-- 
Best regards, Dmitry Voronin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-10-06 Thread Petr Jelinek
On 06/10/16 11:56, Simon Riggs wrote:
> 
> * The labelling "Autonomous Transaction" is a simple coat of paint,
> which can easily be transferred to a better implementation if one
> comes. If one doesn't then its better to have something than nothing.
> So I suggest we commit Background Transactions first and then in a
> fairly thin commit, implement Autonomous Transactions on top of it for
> now and if we get a better one, switch it over.
> 

+1 to this

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs

2016-10-06 Thread Simon Riggs
On 6 October 2016 at 04:43, Serge Rielau  wrote:

>>> Or should I compose some sort of a design document?

Having read this thread, I'm a little unclear as to what you're
writing now, though there's definitely good ideas here.

I think it would be beneficial to write up a single coherent
description of this, including behaviour and a small sketch of
implementation, just so everyone knows what this is. No design doc,
but a summary.


It would be very useful to be able to do this...
ALTER TABLE foo ADD last_updated_timestamp timestamp default current_timestamp
so that it generates a constant value and stores that for all prior
rows, but then generates a new value for future rows.

Which makes me think we should call this missing_value or absent_value
so its clear that it is not a "default" it is the value we use for
rows that do not have any value stored for them.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-10-06 Thread Amit Langote

Thanks to both of you for taking this up and sorry about the delay in
responding.

On 2016/10/05 20:45, Etsuro Fujita wrote:
> On 2016/10/05 14:09, Ashutosh Bapat wrote:
>>> I think it would be a bit inefficient to use PlanCacheFuncCallback as the
>>> inval callback function for those caches, because that checks the
>>> inval-item
>>> list for the rewritten query tree, but any updates on such catalogs
>>> wouldn't
>>> affect the query tree.
> 
>> I am not sure about that. Right now it seems that only the plans are
>> affected, but can we say that for all FDWs?
> 
> If some writable FDW consulted foreign table/server/FDW options in
> AddForeignUpdateTarget, which adds the extra junk columns for
> UPDATE/DELETE to the targetList of the given query tree, the rewritten
> query tree would also need to be invalidated.  But I don't think such an
> FDW really exists because that routine in a practical FDW wouldn't change
> such columns depending on those options.
> 
>>> So, I added a new callback function for those caches
>>> that is much like PlanCacheFuncCallback but skips checking the list for
>>> the
>>> query tree.
> 
>> I am not sure that the inefficiency because of an extra loop on
>> plansource->invalItems is a good reason to duplicate most of the code
>> in PlanCacheFuncCallback(). IMO, maintaining that extra function and
>> the risk of bugs because of not keeping those two functions in sync
>> outweigh the small not-so-frequent gain. The name of function may be
>> changed to be more generic, instead of current one referring Func.
> 
> The inefficiency wouldn't be negligible in the case where there are large
> numbers of cached plans.  So, I'd like to introduce a helper function that
> checks the dependency list for the generic plan, to eliminate most of the
> duplication.

I too am inclined to have a PlanCacheForeignCallback().

Just one minor comment:

+static void
+CheckGenericPlanDependencies(CachedPlanSource *plansource,
+ int cacheid, uint32 hashvalue)
+{
+if (plansource->gplan && plansource->gplan->is_valid)
+{

How about move the if() to the callers so that:

+/*
+ * Check the dependency list for the generic plan.
+ */
+if (plansource->gplan && plansource->gplan->is_valid)
+CheckGenericPlanDependencies(plansource, cacheid, hashvalue);

That will reduce the indentation level within the function definition.


By the way, one wild idea may be to have a fixed number of callbacks based
on their behavior, rather than which catalogs they are meant for.  For
example, have 2 suitably named callbacks: first that invalidates both
CachedPlanSource and CachedPlan, second that invalidates only CachedPlan.
Use the appropriate one whenever a new case arises where we decide to mark
plans dependent on still other types of objects.  Just an idea...

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind hangs if --source-server is used and syncrep is enabled

2016-10-06 Thread Heikki Linnakangas

On 10/06/2016 02:24 AM, Michael Paquier wrote:

On Wed, Oct 5, 2016 at 11:53 PM, Michael Banck
 wrote:

My colleague Christoph Berg pointed out that pg_rewind could just set
synchronous_commit = local before creating the temporary table, which
indeed works, proof-of-concept patch attached


Even synchronous_commit = off would not matter much, and we could just
use that for performance reasons. The temporary table used in this
context is just used to track the delta chunks of blocks, so this
solution sounds better to me. I'll patch 9.4's pg_rewind similarly to
what is decided here, and we could as well use an extra PQexec call to
bring more clarity for the code, now an extra round-trip could be a
big deal where network lag matters, but compared to the COPY chunks
sent out that would not matter much I guess.


Committed, thanks! I moved the call to where we establish the 
connection, that felt slightly more natural.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-10-06 Thread Simon Riggs
On 7 September 2016 at 20:46, Robert Haas  wrote:
> On Sat, Sep 3, 2016 at 7:09 AM, Simon Riggs  wrote:
>> On 2 September 2016 at 09:45, Robert Haas  wrote:
>>> On Wed, Aug 31, 2016 at 7:20 AM, Peter Eisentraut
>>>  wrote:
 I would like to propose the attached patch implementing autonomous
 transactions for discussion and review.
>>>
>>> I'm pretty skeptical of this approach.  Like Greg Stark, Serge Rielau,
>>> and Constantin Pan, I had expected that autonomous transactions should
>>> be implemented inside of a single backend, without relying on workers.
>>
>> The problem is that we have limits on the number of concurrent
>> transactions, which is currently limited by the number of procs.
>
> True.  I believe this issue has been discussed before, and I think
> it's a solvable issue.  I believe that an autonomous transaction could
> be made to appear to the rest of the system (outside the affected
> backend) as if it were a subtransaction, but then get committed before
> the containing transaction gets committed.  This would avoid needing
> more PGPROCs (but getting deadlock detection to work is hard).

Just to point out that I've actually written this approach already.
The patch is available, Autonomous Subtransactions.
We discussed it in Ottawa and it was rejected. (I thought Robert was
there, but Serge and Tom definitely were).

See other posts in this thread by Serge and Craig to explain why.


I take your point that startup time may be not as good as it can be.
But what Peter has works and is useful, whatever we call it.

We have various approaches... summarised in chronological order of
their suggestion

1. Use additional PGXACTs - rejected because it wouldn't provide enough room
2. Use Autonomous SubTransactions - rejected because the semantics are
different to what we might expect from ATs
3. Use background transactions (this thread)
4. Use pause and resume so we don't use up too many PGXACTs

What I see is that there are some valid and useful features here and
we should be developing them further because they have other benefits.

* Autonomous Subtransactions might not work for Autonomous
Transactions, but they are useful for incremental loading of large
data, which then allows easier logical decoding. So that sounds like
we should do that anyway, even if they are not ATs. They can also be
used within parallel tasks.

* Background Transactions are useful and we should have them.

* The labelling "Autonomous Transaction" is a simple coat of paint,
which can easily be transferred to a better implementation if one
comes. If one doesn't then its better to have something than nothing.
So I suggest we commit Background Transactions first and then in a
fairly thin commit, implement Autonomous Transactions on top of it for
now and if we get a better one, switch it over.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-06 Thread Amit Langote
On 2016/10/06 17:45, Ashutosh Bapat wrote:
> On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada  wrote:
>> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat 
>>  wrote:
 My understanding is that basically the local server can not return
 COMMIT to the client until 2nd phase is completed.
>>>
>>> If we do that, the local server may not return to the client at all,
>>> if the foreign server crashes and never comes up. Practically, it may
>>> take much longer to finish a COMMIT, depending upon how long it takes
>>> for the foreign server to reply to a COMMIT message.
>>
>> Yes, I think 2PC behaves so, please refer to [1].
>> To prevent local server stops forever due to communication failure.,
>> we could provide the timeout on coordinator side or on participant
>> side.
> 
> This too, looks like a heuristic and shouldn't be the default
> behaviour and hence not part of the first version of this feature.

At any rate, the coordinator should not return to the client until after
the 2nd phase is completed, which was the original point.  If COMMIT
taking longer is an issue, then it could be handled with one of the
approaches mentioned so far (even if not in the first version), but no
version of this feature should really return COMMIT to the client only
after finishing the first phase.  Am I missing something?

I am saying this because I am assuming that this feature means the client
itself does not invoke 2PC, even knowing that there are multiple servers
involved, but rather rely on the involved FDW drivers and related core
code handling it transparently.  I may have misunderstood the feature
though, apologies if so.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-06 Thread Ashutosh Bapat
On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada  wrote:
> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat
>  wrote:

 No, the COMMIT returns after the first phase. It can not wait for all
 the foreign servers to complete their second phase
>>>
>>> Hm, it sounds like it's same as normal commit (not 2PC).
>>> What's the difference?
>>>
>>> My understanding is that basically the local server can not return
>>> COMMIT to the client until 2nd phase is completed.
>>
>>
>> If we do that, the local server may not return to the client at all,
>> if the foreign server crashes and never comes up. Practically, it may
>> take much longer to finish a COMMIT, depending upon how long it takes
>> for the foreign server to reply to a COMMIT message.
>
> Yes, I think 2PC behaves so, please refer to [1].
> To prevent local server stops forever due to communication failure.,
> we could provide the timeout on coordinator side or on participant
> side.
>

This too, looks like a heuristic and shouldn't be the default
behaviour and hence not part of the first version of this feature.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-06 Thread Masahiko Sawada
On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat
 wrote:
>>>
>>> No, the COMMIT returns after the first phase. It can not wait for all
>>> the foreign servers to complete their second phase
>>
>> Hm, it sounds like it's same as normal commit (not 2PC).
>> What's the difference?
>>
>> My understanding is that basically the local server can not return
>> COMMIT to the client until 2nd phase is completed.
>
>
> If we do that, the local server may not return to the client at all,
> if the foreign server crashes and never comes up. Practically, it may
> take much longer to finish a COMMIT, depending upon how long it takes
> for the foreign server to reply to a COMMIT message.

Yes, I think 2PC behaves so, please refer to [1].
To prevent local server stops forever due to communication failure.,
we could provide the timeout on coordinator side or on participant
side.

>
>> Otherwise the next transaction can see data that is not committed yet
>> on remote server.
>
> 2PC doesn't guarantee transactional consistency all by itself. It only
> guarantees that all legs of a distributed transaction are either all
> rolled back or all committed. IOW, it guarantees that a distributed
> transaction is not rolled back on some nodes and committed on the
> other node.
> Providing a transactionally consistent view is a very hard problem.
> Trying to solve all those problems in a single patch would be very
> difficult and the amount of changes required may be really huge. Then
> there are many possible consistency definitions when it comes to
> consistency of distributed system. I have not seen a consensus on what
> kind of consistency model/s we want to support in PostgreSQL. That's
> another large debate. We have had previous attempts where people have
> tried to complete everything in one go and nothing has been completed
> yet.

Yes, providing a atomic visibility is hard problem, and it's a
separated issue[2].

> 2PC implementation OR guaranteeing that all the legs of a transaction
> commit or roll back, is an essential block of any kind of distributed
> transaction manager. So, we should at least support that one, before
> attacking further problems.

I agree.

[1]https://en.wikipedia.org/wiki/Two-phase_commit_protocol
[2]http://www.bailis.org/papers/ramp-sigmod2014.pdf

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature suggestion: ON CONFLICT DO NOTHING RETURNING which returns existing row data

2016-10-06 Thread Tom Dunstan

> On 5 Oct 2016, at 8:11 PM, Pantelis Theodosiou  wrote:
> 
> This can be solved by chaining modifying CTEs. 
> 
> Something like this (not tested)  that can work with multiple rows inserted:

Thanks for the suggestion, but it was actually slower than our current 
implementation, I believe due to always looking up t1’s id in that join rather 
than only doing it when we didn’t get an id back from the insert. My hope with 
this feature suggestion / request was that we wouldn’t have to do that 
subsequent lookup at all, as pg would just give it back to us.

Maybe it would be a win if we were inserting multiple rows, but this code is 
actually in a trigger on a dummy table that we COPY data in to - thus it can’t 
be rewritten as a rule or a multi-row insert like that.

Thanks

Tom



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-06 Thread Heikki Linnakangas

On 10/06/2016 07:50 AM, Tomas Vondra wrote:

it seems e94568ecc10 has a pretty bad memory leak. A simple


Oops, fixed, thanks for the report!

To be precise, this wasn't a memory leak, just a gross overallocation of 
memory. The new code in tuplesort.c assumes that it's harmless to  call 
LogicalTapeRewind() on never-used tapes, but in fact, it allocated the 
read buffer for the tape. I fixed that by changing LogicalTapeRewind() 
to not allocate it, if the tape was completely empty.


This is related to earlier the discussion with Peter G, on whether we 
should change state->maxTapes to reflect the actual number of tape that 
were used, when that's less than maxTapes. I think his confusion about 
the loop in init_tape_buffers(), in 
CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would 
also have been avoided, if we had done that. So I think we should 
reconsider that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relids in upper relations

2016-10-06 Thread Ashutosh Bapat
On Wed, Oct 5, 2016 at 7:12 PM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> While reviewing aggregate pushdown patch [1] we noticed that
>> RelOptInfos for upper relations do not have relids set.
>
> Indeed, because they don't correspond to any particular scan relation or
> set of scan relations.  I have in mind that in future releases, any
> particular upperrel might have its own definition of what the relids
> mean --- for example, in UPPERREL_SETOP it would likely be useful for
> the relids to represent the set of leaf SELECTs that have been merged
> in a particular path.
> You could imagine UPPERREL_WINDOW using the
> relids to track which window functions have been implemented in a path,
> whenever we get around to considering multiple window function orderings.
> None of that's there yet.

Why do we want to save something in Relids object which is not relids?
Wouldn't this overloading create confusion. I don't know much about
window function ordering, but purely from the above description it
looks like we are saving something in RelOptInfo which is specific to
a path. That sounds odd. But most probably, I misunderstood you.

>
>> create_foreignscan_plan() copies the relids from RelOptInfo into
>> ForeignScan::fs_relids. That field is used to identify the RTIs
>> covered by the ForeignScan.
>
> That's fine for scan/join paths.  If you want to pay attention to
> relids for an upper rel, it's up to you to know what they mean.
> I would not counsel assuming that they have anything at all to do
> with baserel RTIs.
>
>> We could prevent the crash by passing input_rel->relids to
>> fetch_upper_rel() in create_grouping_path() as seen in the attached
>> patch.
>
> I think this is fundamentally wrongheaded.  If we go that route,
> the only valid relids for any upper path would be the union of all
> baserel RTIs, making it rather pointless to carry the value around
> at all,

That won't be true if we start pushing upper operations under Append
or even Join. An aggregate of append may be calculated as aggregate of
append or append of aggregates. In the later case, we will need to
differentiate between upper relations from each segment being appended
and also the upper relation representing the overall result. A natural
way is to set relids of that upper relation with the relids of
underlying scan relations. In this case, setting relids to underlying
scan relations isn't pointless at all. And then this may apply to all
kinds of upper relations, not just aggregate/grouping

> and definitely making it impossible to use the field to
> distinguish different partial implementations of the same upperrel.

Probably, we should add another set of fields exclusive to be used for
upper relations, like some members which are exclusively used for base
relations. Storing something in relids, which is not relids looks
confusing, unless we change the name (and type) of that field or what
we store has a meaning similar to relids.

>
> You should look to root->all_baserels, instead, if that's the value
> you want when considering an upperrel Path.
>

Thanks. It looks like, for now, aggregate pushdown should use
all_baserels for an upper relation. Although, that will need to change
if we push down aggregate to the foreign server as part of converting
aggregate of append to append of aggregates.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers