Re: Skylake-S warning

2018-10-03 Thread Adrien Nayrat
On 10/3/18 11:29 PM, Daniel Wood wrote:
> If running benchmarks or you are a customer which is currently impacted by
> GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
> 
> 
> Performance differences of nearly 2X can be seen on select only pgbench due to
> nothing else but unlucky choices for max_connections.  Scale 1000, 192 local
> clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) 
> system. 
> pgbench -S
> 

Hi,

Could it be related to :
https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net
?





signature.asc
Description: OpenPGP digital signature


RE: Protect syscache from bloating with negative cache entries

2018-10-03 Thread Ideriha, Takeshi
Hi, thank you for the explanation.

>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>>
>> Can I confirm about catcache pruning?
>> syscache_memory_target is the max figure per CatCache.
>> (Any CatCache has the same max value.) So the total max size of
>> catalog caches is estimated around or slightly more than # of SysCache
>> array times syscache_memory_target.
>
>Right.
>
>> If correct, I'm thinking writing down the above estimation to the
>> document would help db administrators with estimation of memory usage.
>> Current description might lead misunderstanding that
>> syscache_memory_target is the total size of catalog cache in my impression.
>
>Honestly I'm not sure that is the right design. Howerver, I don't think 
>providing such
>formula to users helps users, since they don't know exactly how many CatCaches 
>and
>brothres live in their server and it is a soft limit, and finally only few or 
>just one catalogs
>can reach the limit.

Yeah, I agree with that kind of formula is not suited for the document.
But if users don't know how many catcaches and brothers is used at postgres,
then how about changing syscache_memory_target as total soft limit of catcache,
rather than size limit of individual catcache. Internally 
syscache_memory_target can 
be divided by the number of Syscache and does its work. The total amount would 
be
easier to understand for users who don't know the detailed contents of catalog 
caches.

Or if user can tell how many/what kind of catcaches exists, for instance by 
using 
the system view you provided in the previous email, the current design looks 
good to me.

>The current design based on the assumption that we would have only one
>extremely-growable cache in one use case.
>
>> Related to the above I just thought changing sysycache_memory_target
>> per CatCache would make memory usage more efficient.
>
>We could easily have per-cache settings in CatCache, but how do we provide the 
>knobs
>for them? I can guess only too much solutions for that.
Agreed.

>> Though I haven't checked if there's a case that each system catalog
>> cache memory usage varies largely, pg_class cache might need more memory than
>others and others might need less.
>> But it would be difficult for users to check each CatCache memory
>> usage and tune it because right now postgresql hasn't provided a handy way to
>check them.
>
>I supposed that this is used without such a means. Someone suffers syscache 
>bloat
>just can set this GUC to avoid the bloat. End.
Yeah, I took the purpose wrong.

>Apart from that, in the current patch, syscache_memory_target is not exact at 
>all in
>the first place to avoid overhead to count the correct size. The major 
>difference comes
>from the size of cache tuple itself. But I came to think it is too much to 
>omit.
>
>As a *PoC*, in the attached patch (which applies to current master), size of 
>CTups are
>counted as the catcache size.
>
>It also provides pg_catcache_size system view just to give a rough idea of how 
>such
>view looks. I'll consider more on that but do you have any opinion on this?
>
>=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order 
>by size
>desc;
>  relid  |   indid   |  size
>-+---+--
>-+---+--
>-+---+--
>-+---+--
> pg_class| pg_class_oid_index| 131072
> pg_class| pg_class_relname_nsp_index| 131072
> pg_cast | pg_cast_source_target_index   |   5504
> pg_operator | pg_operator_oprname_l_r_n_index   |   4096
> pg_statistic| pg_statistic_relid_att_inh_index  |   2048
> pg_proc | pg_proc_proname_args_nsp_index|   2048
>..

Great! I like this view.
One of the extreme idea would be adding all the members printed by 
CatCachePrintStats(), 
which is only enabled with -DCATCACHE_STATS at this moment. 
All of the members seems too much for customers who tries to change the cache 
limit size
But it may be some of the members are useful because for example cc_hits would 
indicate that current 
cache limit size is too small.

>> Another option is that users only specify the total memory target size
>> and postgres dynamically change each CatCache memory target size according 
>> to a
>certain metric.
>> (, which still seems difficult and expensive to develop per benefit)
>> What do you think about this?
>
>Given that few caches bloat at once, it's effect is not so different from the 
>current
>design.
Yes agreed.

>> As you commented here, guc variable syscache_memory_target and
>> syscache_prune_min_age are used for both syscache and relcache (HTAB), r

Re: Shouldn't ExecShutdownNode() be called from standard_ExecutorFinish()?

2018-10-03 Thread Amit Kapila
On Thu, Oct 4, 2018 at 12:33 AM Andres Freund  wrote:
>
> Hi,
>
> There's a few cases where by the time ExecutorFinish() is called,
> ExecShutdownNode() has not yet been called.  As, among other reasons,
> ExecShutdownNode() also collects instrumentation, I think that's
> problematic.
>
> In ExplainOnePlan() we call
>
> /* run cleanup too */
> ExecutorFinish(queryDesc);
>
> and then print the majority of the explain data:
>
> /* Create textual dump of plan tree */
> ExplainPrintPlan(es, queryDesc);
>
> and only then shut down the entire query:
>
> ExecutorEnd(queryDesc);
>
> which seems to mean that if a node hasn't yet been shut down for some
> reason, we'll not have information that's normally collected in
> ExecShutdownNode().
>

Ideally, ExecShutdownNode would have been called by the end of
ExecutorRun phase.  Can you tell which particular case you are
bothered about?

> ISTM, we should have a new EState member that runs ExecShutdownNode() if
> in standard_ExecutorEnd() if not already done.
>

Even if it wasn't called before due to any reason, I think the
relevant work should be done via
standard_ExecutorEnd->ExecEndPlan->ExecEndNode.  For example, for
gather node, it will call ExecEndGather to perform the necessary work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Secondary index access optimizations

2018-10-03 Thread David Rowley
On 12 September 2018 at 08:32, Konstantin Knizhnik
 wrote:
> Also the patch proposed by you is much simple and does mostly the same. Yes,
> it is not covering CHECK constraints,

I started to look at this and found a problem in regards to varno
during the predicate_implied_by() test.  The problem is that the
partition bound is always stored as varno=1 (For example, see how
get_qual_for_list() calls makeVar()). This causes the patch to fail in
cases where the partitioned table is not varno=1. You're also
incorrectly using rinfo->clause to pass to predicate_implied_by().
This is a problem because stored here have not been translated to have
the child varattnos.  childqual is the correct thing to use as that's
just been translated. You may have not used it as the varnos will have
been converted to the child's varno, which will never be varno=1, so
you might have found that not to work due to the missing code to
change the varnos to 1.

I've attached the diff for allpaths.c (only) that I ended up with to
make it work. This causes the output of many other regression test to
change, so you'll need to go over these and verify everything is
correct again.

Please, can you also add a test which tests this code which has a
partition with columns in a different order than it's parent. Having
an INT and a TEXT column is best as if the translations are done
incorrectly it's likely to result in a crash which will alert us to
the issue. It would be good to also verify the test causes a crash if
you temporarily put the code back to using the untranslated qual.

Thanks for working on this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


skip_implied_child_quals_allpaths.diff
Description: Binary data


Re: SerializeParamList vs machines with strict alignment

2018-10-03 Thread Amit Kapila
On Wed, Oct 3, 2018 at 10:02 AM Amit Kapila  wrote:
> >
> > Now chipmunk also failed for the same test.
> >
> > >  What I find more interesting is
> > > that both of the live Sparc critters are happy --- so despite
> > > Thomas' statements upthread, they're coping with unaligned accesses.
> > > Maybe you should have back-patched the test to older branches so
> > > we could see what castoroides and protosciurus would do.  But it's
> > > probably not worth additional delay.
> > >
> >
> > Agreed, I will push the code-fix on HEAD and code+test in back-branches.
> >
>
> Pushed, let's wait and see if this can make all the failing buidfarm
> members (due to this issue) happy.
>

All the affected members (gharial, chipmunk, anole) are happy.  It
feels good to see chipmunk becoming green after so many days.  Thanks
a lot, Tom for your assistance.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Add SKIP LOCKED to VACUUM and ANALYZE

2018-10-03 Thread Bossart, Nathan
On 10/3/18, 7:10 PM, "Michael Paquier"  wrote:
> Thanks, I have committed the patch, after minor tweaks to the docs.

Thanks!

> Mainly, in the VACUUM page, I have added a mention that VACUUM ANALYZE
> can block for row sampling.  I have also removed for now the set of
> recommendations the patch added, as we still gather statistics on the
> partitioned tables from the row samples gathered from the partitions, so
> recommending to list only the leaves is not completely correct either
> all the time.

Good catch.  The committed version of the documentation looks good to
me.

> We could perhaps add something about such recommendations.  A separate
> section covering more general things may be interesting.

Agreed, this could be a good addition to the "Notes" sections.

Nathan



RE: Function for listing archive_status directory

2018-10-03 Thread Iwata, Aya
Hi Christoph,

I think it is convenient to be able to check the archive_status directory 
contents information.

I reviewed patch. It applies and passes regression test.

I checked the code. 
It refers to the patch which added pg_ls_waldir() and pg_ls_logdir(), so I 
think it is good.

There is one point I care about. 
All similar function are named pg_ls_***dir. It is clear these functions return 
directory contents information.
If the new function intends to display the contents of the directory, 
pg_ls_***dir style might be better (e.g. pg_ls_archive_statusdir).
But everyone know archive_status is a directory...
If you want to follow the standard naming, then you may add the dir.

Do you watch this thread?
https://www.postgresql.org/message-id/flat/92f458a2-6459-44b8-a7f2-2add32250...@amazon.com
They are also discussing about generic pg_ls function.

Regards, 
Aya Iwata




Re: Skylake-S warning

2018-10-03 Thread Andres Freund
On 2018-10-03 17:01:44 -0700, Daniel Wood wrote:
> 
> > On October 3, 2018 at 3:55 PM Andres Freund  wrote:
> 
> > In the thread around 
> > https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de
> > I'd found doing more aggressive padding helped a lot.  Unfortunately I
> > didn't pursue this further :(
> 
> Interesting.  Looks like you saw the same thing I see now.
> 
> > I really suspect we're going to have to change the layout of PGXACT data
> > in a way that makes a contiguous scan possible. That'll probably require
> > some uglyness because a naive "use first free slot" scheme obviously is
> > sensitive to there being holes.  But I suspect that a scheme were
> > backends would occasionally try to move themselves to an earlier slot
> > wouldn't be too hard to implement.
> 
> I've also thought of this.  But someone could create a thousand connections 
> and
> then disconnect all but the first and last creating a huge hole.

Yea, that's what I was suggesting to address with "a scheme where
backends would occasionally try to move to move themselves to an earlier
slot". I was basically thinking that in occasions where a backend holds
ProcArrayLock exclusively it could just compare the current number of
connections and its slot and move itself earlier if it's more than, say,
10% after the #num-connection's slot.  Or something.

The advantage of that approach would be that the size of the change is
probably fairly manageable.

Greetings,

Andres Freund



Re: Skylake-S warning

2018-10-03 Thread Daniel Wood
One other thought.  Could we update pgxact->xmin less often?  What would be the 
impact of this lower bound being lower than it would normally be with the 
existing scheme.  Yes, it needs to be moved forward "occasionally".

FYI, be careful with padding PGXACT's to a full cache line.  With 1024 clients 
you'd completely blow out the L1 cache.  The benefits with less than 200 
clients is dubious.  One problem with multiple(5) pgxact's in a single cache 
line is that you may get a miss fetching xmin and then a second miss fetching 
xid because an invalidation occurs between the 2 fetches from updating any of 
the other 5 pgxact's on the same cache line.  I've found some improvement 
fetching both xmin and xid with a single 64 bit fetch.  This prevents the 
invalidation between the two fetches.  Similarily updating them with a single 
64 bit write helps.

Yes, this is uber tweaky.



Re: partition tree inspection functions

2018-10-03 Thread Michael Paquier
On Wed, Oct 03, 2018 at 08:12:59AM -0400, Jesper Pedersen wrote:
> Removing isleaf would require extra round trips to the server to get
> that information. So, I think we should keep it.

I don't really get your point about extra round trips with the server,
and getting the same level of information is as simple as a join between
the result set of pg_partition_tree() and pg_class (better to add schema
qualification and aliases to relations by the way):
=# SELECT relid::regclass,
  parentrelid::regclass, level,
  relkind != 'p' AS isleaf
 FROM pg_partition_tree('ptif_test'::regclass), pg_class
 WHERE oid = relid;
relid| parentrelid | level | isleaf
-+-+---+
 ptif_test   | null| 0 | f
 ptif_test0  | ptif_test   | 1 | f
 ptif_test1  | ptif_test   | 1 | f
 ptif_test2  | ptif_test   | 1 | t
 ptif_test01 | ptif_test0  | 2 | t
 ptif_test11 | ptif_test1  | 2 | t
(6 rows)
--
Michael


signature.asc
Description: PGP signature


Re: [RFC] Removing "magic" oids

2018-10-03 Thread Vik Fearing
On 30/09/18 05:48, Andres Freund wrote:
> 5a) The fact that system table oids don't show up in selects by default
>makes it more work than necessary to look at catalogs.

As a user, this is a big pain point for me.  +1 for getting rid of it.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support



Re: Add SKIP LOCKED to VACUUM and ANALYZE

2018-10-03 Thread Michael Paquier
On Wed, Oct 03, 2018 at 04:20:42PM +, Bossart, Nathan wrote:
> Here is what I have so far for the docs:
>
> [snip]

Thanks, I have committed the patch, after minor tweaks to the docs.
Mainly, in the VACUUM page, I have added a mention that VACUUM ANALYZE
can block for row sampling.  I have also removed for now the set of
recommendations the patch added, as we still gather statistics on the
partitioned tables from the row samples gathered from the partitions, so
recommending to list only the leaves is not completely correct either
all the time.

We could perhaps add something about such recommendations.  A separate
section covering more general things may be interesting.
--
Michael


signature.asc
Description: PGP signature


Re: Skylake-S warning

2018-10-03 Thread Daniel Wood


> On October 3, 2018 at 3:55 PM Andres Freund  wrote:

> In the thread around 
> https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de
> I'd found doing more aggressive padding helped a lot.  Unfortunately I
> didn't pursue this further :(

Interesting.  Looks like you saw the same thing I see now.

> I really suspect we're going to have to change the layout of PGXACT data
> in a way that makes a contiguous scan possible. That'll probably require
> some uglyness because a naive "use first free slot" scheme obviously is
> sensitive to there being holes.  But I suspect that a scheme were
> backends would occasionally try to move themselves to an earlier slot
> wouldn't be too hard to implement.

I've also thought of this.  But someone could create a thousand connections and
then disconnect all but the first and last creating a huge hole.  One thought
would be to have a bit array of used entries and to bias toward using free lower
indexed entries.  Only two cache lines of 1024 bits would need to be scanned to
handle 1024 connections.  ProcArrayAdd() would just do an atomic xchg to set the
used bit.

I was even thinking of decomposing PGXACT into separate arrays of XID's, XMIN's,
etc.  Then have a bit map of those entries where the XID was valid.  When you
set/clear your XID you'd also set/clear this bit.  With the select only workload
the XID are all "Invalid" thus scanning this bit array of zeroed 64 bit long 
entries
would be quite efficient.  The vacuum and logical decoding flags could be 
directly
done as a bit map.  Having a array of 1 byte subtxn counts creates another
opportunity for improvement.  Scanning for a non-zero in env's which use few
subtxn's is efficient.  If subtxn's are used a lot then the repeated PGXACT
cache line invalidations when setting nxids is offset by the fact that you can
grab 8 of them in a single fetch.

- Dan



Re: Skylake-S warning

2018-10-03 Thread Andres Freund
Hi,


On 2018-10-03 14:29:39 -0700, Daniel Wood wrote:
> If running benchmarks or you are a customer which is currently
> impacted by GetSnapshotData() on high end multisocket systems be wary
> of Skylake-S.

> Performance differences of nearly 2X can be seen on select only
> pgbench due to nothing else but unlucky choices for max_connections.
> Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon
> Platinum 8175M @ 2.50-GHz) system.  pgbench -S

FWIW, I've seen performance differences of that magnitude going back to
at least nehalem. But not on every larger system, interestingly.


> If this is indeed just disadvantageous placement of structures/arrays
> in memory then you might also find that after upgrading a previous
> good choice for max_connections becomes a bad choice if things move
> around.

In the thread around 
https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de
I'd found doing more aggressive padding helped a lot.  Unfortunately I
didn't pursue this further :(


> NOTE2: It is unclear why PG needs to support over 64K sessions.  At
> about 10MB per backend(at the low end) the empty backends alone would
> consume 640GB's of memory!  Changing pgprocnos from int to short gives
> me the following results.

I've argued this before. After that we reduced MAX_BACKENDS to 0x3 -
I personaly think we could go to 16bit without it being a practical
problem.

I really suspect we're going to have to change the layout of PGXACT data
in a way that makes a contiguous scan possible. That'll probably require
some uglyness because a naive "use first free slot" scheme obviously is
sensitive to there being holes.  But I suspect that a scheme were
backends would occasionally try to move themselves to an earlier slot
wouldn't be too hard to implement.

Greetings,

Andres Freund



DROP DATABASE doesn't force other backends to close FDs

2018-10-03 Thread Andres Freund
Hi,

Joerg reported on IRC that after a DROP DATABASE the space of the
dropped database wasn't available to the OS until he killed a session
that was active from before the drop. I was kinda incredulous at first,
the code looks sane...  Thomas figured out significant parts of this.

dropdb() does a fairly careful dance to ensure processes close FDs for
about-to-be-removed files:
/*
 * Drop pages for this database that are in the shared buffer cache. 
This
 * is important to ensure that no remaining backend tries to write out a
 * dirty buffer to the dead database later...
 */
DropDatabaseBuffers(db_id);

/*
 * Tell the stats collector to forget it immediately, too.
 */
pgstat_drop_database(db_id);

/*
 * Tell checkpointer to forget any pending fsync and unlink requests for
 * files in the database; else the fsyncs will fail at next checkpoint, 
or
 * worse, it will delete files that belong to a newly created database
 * with the same OID.
 */
ForgetDatabaseFsyncRequests(db_id);

/*
 * Force a checkpoint to make sure the checkpointer has received the
 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
 * ensures that background procs don't hold any open files, which would
 * cause rmdir() to fail.
 */
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | 
CHECKPOINT_WAIT);

/*
 * Remove all tablespace subdirs belonging to the database.
 */
remove_dbtablespaces(db_id);


By my code reading that ensures that FDs held open by checkpointer and
bgwriter are closed in a relatively timely fashion (although I don't
think there's a nice time-bound guarantee for bgwriter, which could be
problematic for windows).

That unfortunately disregards that normal backends could have plenty
files open for the to-be-dropped database, if there's any sort of
shared_buffer pressure. Whenever a backend writes out a dirty victim
page belonging to another database, it'll also open files therein.

I thought that'd not be a problem because surely we do a smgrclose() at
some proper point to avoid that issue.  But we don't.

I don't quite know how to trivially [1] guarantee that an invalidation
we send out is processed by all backends before continuing. But even if
we just send out an invalidation to all backends that instruct them to
do an smgrcloseall() the next time they do invalidation processing, we'd
be much better off than what are now (where we'll never close those
files until there's a lot of fd pressure or the backend exits).


I wonder if this is responsible for some issues we had with windows in
the past...

[1] I think we need something like an acknowledge protocol for a few
things, most recently onlining enabling checksums anyway.  What I
had in mind is simple:
1) one global 64bit request counter in shared memory
2) per-backend counter in shared memory, that acknowledge up to
   which point requests have been processed.
3) per-purpose procsignal that request e.g. that checksums get
   enabled, that smgrcloseall has been processed, etc.
4) Function that sends out such a signal to all backends and then
   waits till all such requests have been processed.

This obviously shouldn't be used for very frequent occurances, but
the use-cases I have in mind should never be that.

Greetings,

Andres Freund



Skylake-S warning

2018-10-03 Thread Daniel Wood
If running benchmarks or you are a customer which is currently impacted by 
GetSnapshotData() on high end multisocket systems be wary of Skylake-S.


Performance differences of nearly 2X can be seen on select only pgbench due to 
nothing else but unlucky choices for max_connections.  Scale 1000, 192 local 
clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. 
 pgbench -S


Results from 5 runs varying max_connections from 400 to 405:


max

conn TPS

400 677639

4011146776

4021122140

403 765664

404 671455

4051190277

...


perf top shows about 21% GetSnapshotData() with the good numbers and 48% with 
the bad numbers.


This problem is not seen on a 2 socket 32 core Haswell system.  Being a one man 
show I lack some of the diagnostic tools to drill down further.  My suspicion 
is that the fact that Intel has lowered the L2 associativity from 8(Haswell) to 
4(Skylake-S) may be the cause.  The other possibility is that at higher core 
counts the shared 16-way inclusive associative L3 cache becomes insufficient.  
Perhaps that is why Intel has moved to an exclusive L3 cache on Skylake-SP.


If this is indeed just disadvantageous placement of structures/arrays in memory 
then you might also find that after upgrading a previous good choice for 
max_connections becomes a bad choice if things move around.


NOTE: int pgprocno = pgprocnos[index];

is where the big increase in time occurs in GetSnapshotData()

This is largely read-only, once all connections are established, and easily 
fits in the L1, and is not next to anything else causing invalidations.


NOTE2: It is unclear why PG needs to support over 64K sessions.  At about 10MB 
per backend(at the low end) the empty backends alone would consume 640GB's of 
memory!  Changing pgprocnos from int to short gives me the following results.


max

conn TPS

400 780119

4011129286

4021263093

403 887021

404 679891

4051218118


While this change is significant on large Skylake systems it is likely just a 
trivial improvement on other systems or workloads.

Re: executor relation handling

2018-10-03 Thread Tom Lane
I wrote:
> Amit Langote  writes:
>> Should this check that we're not in a parallel worker process?

> Hmm.  I've not seen any failures in the parallel parts of the regular
> regression tests, but maybe I'd better do a force_parallel_mode
> run before committing.
> In general, I'm not on board with the idea that parallel workers don't
> need to get their own locks, so I don't really want to exclude parallel
> workers from this check.  But if it's not safe for that today, fixing it
> is beyond the scope of this particular patch.

So the place where that came out in the wash is the commit I just made
(9a3cebeaa) to change the executor from taking table locks to asserting
that somebody else took them already.  To make that work, I had to make
both ExecOpenScanRelation and relation_open skip checking for lock-held
if IsParallelWorker().

This makes me entirely uncomfortable with the idea that parallel workers
can be allowed to not take any locks of their own.  There is no basis
for arguing that we have field proof that that's safe, because *up to
now, parallel workers in fact did take their own locks*.  And it seems
unsafe on its face, because there's nothing that really guarantees that
the parent process won't go away while children are still running.
(elog(FATAL) seems like a counterexample, for instance.)

I think that we ought to adjust parallel query to insist that children
do take locks, and then revert the IsParallelWorker() exceptions I made
here.  I plan to leave that point in abeyance till we've got the rest
of these changes in place, though.  The easiest way to do it will
doubtless change once we've centralized the executor's table-opening
logic, so trying to code it right now seems like a waste of effort.

regards, tom lane



Re: Query is over 2x slower with jit=on

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-01 00:32:18 -0700, Lukas Fittl wrote:
> On Tue, Sep 25, 2018 at 1:17 PM, Andres Freund  wrote:
> >
> > I've pushed the change without that bit - it's just a few additional
> > lines if we want to change that.
> >
> 
> It seems that since this commit, JIT statistics are now only being printed
> for parallel query plans. This is due to ExplainPrintJIT being called
> before ExecutorEnd, so in a non-parallel query,
> queryDesc->estate->es_jit_combined_instr will never get set.

Ugh.


> Attached a patch that instead introduces a new ExplainPrintJITSummary
> method that summarizes the statistics before they get printed.

Yea, I had something like that, and somehow concluded that it wasn't
needed, and moved it to the wrong place :/.


> I've also removed an (I believe) unnecessary "if (estate->es_instrument)"
> check that prevented EXPLAIN without ANALYZE from showing whether JIT would
> be used or not.
> 
> In addition this also updates a missed section in the documentation with
> the new stats output format.

Thanks a lot for the patch!  I've pushed it with some mild changes
(renamed es_jit_combined_stats to worker_stats; changed
ExplainPrintJITSummary to not look at es_jit, but es_jit_flags as
theoretically es_jit might not be allocated; very minor comment
changes).


- Andres



Shouldn't ExecShutdownNode() be called from standard_ExecutorFinish()?

2018-10-03 Thread Andres Freund
Hi,

There's a few cases where by the time ExecutorFinish() is called,
ExecShutdownNode() has not yet been called.  As, among other reasons,
ExecShutdownNode() also collects instrumentation, I think that's
problematic.

In ExplainOnePlan() we call

/* run cleanup too */
ExecutorFinish(queryDesc);

and then print the majority of the explain data:

/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);

and only then shut down the entire query:

ExecutorEnd(queryDesc);

which seems to mean that if a node hasn't yet been shut down for some
reason, we'll not have information that's normally collected in
ExecShutdownNode().

ISTM, we should have a new EState member that runs ExecShutdownNode() if
in standard_ExecutorEnd() if not already done.

Am I missing something?

Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-03 14:01:35 -0400, Tom Lane wrote:
>> BTW, so far as I can tell on F28, strfromd isn't exposed without
>> "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary;
>> what else does that affect?

> So I don't think anything's needed to enable that in pg, given that we
> define _GNU_SOURCE

Ah, OK.  I thought my test case had _GNU_SOURCE defined already,
but it didn't.  You might want to do something like what I stuck
in for strchrnul, though:

/*
 * glibc's  declares strchrnul only if _GNU_SOURCE is defined.
 * While we typically use that on glibc platforms, configure will set
 * HAVE_STRCHRNUL whether it's used or not.  Fill in the missing declaration
 * so that this file will compile cleanly with or without _GNU_SOURCE.
 */
#ifndef _GNU_SOURCE
extern char *strchrnul(const char *s, int c);
#endif


regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
I wrote:
>> Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How
>> about we simply add a static assert that long long isn't bigger than
>> int64?

> WFM, I'll make it happen.

Actually, while writing a comment to go with that assertion, I decided
this was dumb.  If we're expecting the compiler to have "long long",
and if we're convinced that no platforms define "long long" as wider
than 64 bits, we may as well go with the s/int64/long long/g solution.
That should result in no code change on any platform today.  And it
will still work correctly, if maybe a bit inefficiently, on some
hypothetical future platform where long long is wider.  We (or our
successors) can worry about optimizing that when the time comes.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 14:01:35 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > So when using pg's snprintf() to print a single floating point number
> > with precision, we get nearly a 10% boost.
> 
> I just tested that using my little standalone testbed, and I failed
> to replicate the result.  I do see that strfromd is slightly faster,
> but it's just a few percent measuring snprintf.c in isolation --- in
> the overall context of COPY, I don't see how you get to 10% net savings.

I just tested your patch, and I see (best of three):

master:
16224.727 ms
hack-use-of-strfromd.patch:
14944.927 ms

So not quite 10%, but pretty close.


CREATE TABLE somefloats(id serial, data1 float8, data2 float8, data3 float8);
INSERT INTO somefloats(data1, data2, data3) SELECT random(), random(), random() 
FROM generate_series(1, 1000);
VACUUM FREEZE somefloats;

COPY somefloats TO '/dev/null';

What difference do you see?


> So I continue to think there's something fishy about your test case.
> 
> BTW, so far as I can tell on F28, strfromd isn't exposed without
> "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary;
> what else does that affect?

My copy says:

#undef __GLIBC_USE_IEC_60559_BFP_EXT
#if defined __USE_GNU || defined __STDC_WANT_IEC_60559_BFP_EXT__
# define __GLIBC_USE_IEC_60559_BFP_EXT 1
#else
# define __GLIBC_USE_IEC_60559_BFP_EXT 0
#endif

And __USE_GNU is enabled by
#ifdef  _GNU_SOURCE
# define __USE_GNU  1
#endif

So I don't think anything's needed to enable that in pg, given that we
define _GNU_SOURCE


Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Alvaro Herrera
On 2018-Oct-03, Tom Lane wrote:

> Andres Freund  writes:

> BTW, so far as I can tell on F28, strfromd isn't exposed without
> "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary;
> what else does that affect?

https://en.cppreference.com/w/c/experimental/fpext1

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> So when using pg's snprintf() to print a single floating point number
> with precision, we get nearly a 10% boost.

I just tested that using my little standalone testbed, and I failed
to replicate the result.  I do see that strfromd is slightly faster,
but it's just a few percent measuring snprintf.c in isolation --- in
the overall context of COPY, I don't see how you get to 10% net savings.

So I continue to think there's something fishy about your test case.

BTW, so far as I can tell on F28, strfromd isn't exposed without
"-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary;
what else does that affect?

regards, tom lane

diff --git a/src/port/snprintf.c b/src/port/snprintf.c
index b9b6add..f75369c 100644
--- a/src/port/snprintf.c
+++ b/src/port/snprintf.c
@@ -1137,17 +1137,19 @@ fmtfloat(double value, char type, int forcesign, int leftjust,
 		zeropadlen = precision - prec;
 		fmt[0] = '%';
 		fmt[1] = '.';
-		fmt[2] = '*';
-		fmt[3] = type;
-		fmt[4] = '\0';
-		vallen = sprintf(convert, fmt, prec, value);
+		fmt[2] = (prec / 100) + '0';
+		fmt[3] = ((prec % 100) / 10) + '0';
+		fmt[4] = (prec % 10) + '0';
+		fmt[5] = type;
+		fmt[6] = '\0';
+		vallen = strfromd(convert, sizeof(convert), fmt, value);
 	}
 	else
 	{
 		fmt[0] = '%';
 		fmt[1] = type;
 		fmt[2] = '\0';
-		vallen = sprintf(convert, fmt, value);
+		vallen = strfromd(convert, sizeof(convert), fmt, value);
 	}
 	if (vallen < 0)
 		goto fail;


Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-03 13:18:35 -0400, Tom Lane wrote:
>> Having said that, maybe there's a case for changing the type spec
>> in only the va_arg() call, and leaving snprintf's related local
>> variables as int64.  (Is that what you actually meant?)  Then,
>> if long long really is different from int64, at least we have
>> predictable truncation behavior after fetching the value, rather
>> than undefined behavior while fetching it.

> Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How
> about we simply add a static assert that long long isn't bigger than
> int64?

WFM, I'll make it happen.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
On 2018-10-03 13:40:03 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-10-03 13:31:09 -0400, Tom Lane wrote:
> >> I do not see the point of messing with snprintf.c here.  I doubt that
> >> strfromd is faster than the existing sprintf call (because the latter
> >> can use ".*" instead of serializing and deserializing the precision).
>
> > I'm confused, the numbers I posted clearly show that it's faster?
>
> Those were in the context of whether float8out went through snprintf.c
> or directly to strfromd, no?

I measured both, changing float8out directly, and just adapting
snprintf.c:

> snprintf using sprintf via pg_double_to_string:
> 16195.787
>
> snprintf using strfromd via pg_double_to_string:
> 14856.974 ms
>
> float8out using sprintf via pg_double_to_string:
> 16176.169
>
> float8out using strfromd via pg_double_to_string:
> 13532.698

So when using pg's snprintf() to print a single floating point number
with precision, we get nearly a 10% boost. The win unsurprisingly is
bigger when not going through snprintf.c.

Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 13:18:35 -0400, Tom Lane wrote:
> I wrote:
> > Andres Freund  writes:
> >> - I know it's not new, but is it actually correct to use va_arg(args, 
> >> int64)
> >> for ATYPE_LONGLONG?
> 
> > Well, the problem with just doing s/int64/long long/g is that the
> > code would then fail on compilers without a "long long" type.
> > We could ifdef our way around that, but I don't think the code would
> > end up prettier.
> 
> I spent a bit more time thinking about that point.  My complaint about
> lack of long long should be moot given that we're now requiring C99.

True, I didn't think of that.


> So the two cases we need to worry about are (1) long long exists and
> is 64 bits, and (2) long long exists and is wider than 64 bits.  In
> case (1) there's nothing actively wrong with the code as it stands.
> In case (2), if we were to fix the problem by s/int64/long long/g,
> the result would be that we'd be doing the arithmetic for all
> integer-to-text conversions in 128 bits, which seems likely to be
> pretty expensive.

Yea, that seems quite undesirable.


> So a "real" fix would probably require having separate versions of
> fmtint for long and long long.  I'm not terribly excited about
> going there.  I can see it happening some day when/if we need to
> use 128-bit math more extensively than today, but I do not think
> that day is close.

Right, that seems a bit off.


> (Are there *any* platforms where "long long" is 128 bits today?)

Not that I'm aware off.


> Having said that, maybe there's a case for changing the type spec
> in only the va_arg() call, and leaving snprintf's related local
> variables as int64.  (Is that what you actually meant?)  Then,
> if long long really is different from int64, at least we have
> predictable truncation behavior after fetching the value, rather
> than undefined behavior while fetching it.

Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How
about we simply add a static assert that long long isn't bigger than
int64?

Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-03 13:31:09 -0400, Tom Lane wrote:
>> I do not see the point of messing with snprintf.c here.  I doubt that
>> strfromd is faster than the existing sprintf call (because the latter
>> can use ".*" instead of serializing and deserializing the precision).

> I'm confused, the numbers I posted clearly show that it's faster?

Those were in the context of whether float8out went through snprintf.c
or directly to strfromd, no?

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 13:31:09 -0400, Tom Lane wrote:
> I do not see the point of messing with snprintf.c here.  I doubt that
> strfromd is faster than the existing sprintf call (because the latter
> can use ".*" instead of serializing and deserializing the precision).

I'm confused, the numbers I posted clearly show that it's faster?

Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-03 12:54:52 -0400, Tom Lane wrote:
>> (1) The need to use sprintf for portability means that we need very
>> tight constraints on the precision spec *and* the buffer size *and*
>> the format type (%f pretty much destroys certainty about how long the
>> output string is).  So this isn't going to be general purpose code.
>> I think just writing it into float[48]out is sufficient.

> Well, the numbers suggest it's also useful to do so from snprintf - it's
> not that rare that we output floating point numbers from semi
> performance critical code, even leaving aside float[48]out.  So I'm not
> convinced that we shouldn't do this from within snprintf.c too. Now we
> could open-code it twice, but i'm not sure I see the point.

I do not see the point of messing with snprintf.c here.  I doubt that
strfromd is faster than the existing sprintf call (because the latter
can use ".*" instead of serializing and deserializing the precision).
Even if it is, I do not want to expose an attractive-nuisance API
in a header, and I think this would be exactly that.

> If we just define the API as having to guarantee there's enough space
> for the output format, I think it'll work well enough for now?

No, because that's a recipe for buffer-overflow bugs.  It's *hard*
to be sure the buffer is big enough, and easy to make breakable
assumptions.

> snprintf.c already assumes everything floating point can be output in
> 1024 chars, no?

Indeed, and it's got hacks like a forced limit to precision 350 in order
to make that safe.  I don't want to be repeating the reasoning in
fmtfloat() in a bunch of other places.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> - I know it's not new, but is it actually correct to use va_arg(args, int64)
>> for ATYPE_LONGLONG?

> Well, the problem with just doing s/int64/long long/g is that the
> code would then fail on compilers without a "long long" type.
> We could ifdef our way around that, but I don't think the code would
> end up prettier.

I spent a bit more time thinking about that point.  My complaint about
lack of long long should be moot given that we're now requiring C99.
So the two cases we need to worry about are (1) long long exists and
is 64 bits, and (2) long long exists and is wider than 64 bits.  In
case (1) there's nothing actively wrong with the code as it stands.
In case (2), if we were to fix the problem by s/int64/long long/g,
the result would be that we'd be doing the arithmetic for all
integer-to-text conversions in 128 bits, which seems likely to be
pretty expensive.

So a "real" fix would probably require having separate versions of
fmtint for long and long long.  I'm not terribly excited about
going there.  I can see it happening some day when/if we need to
use 128-bit math more extensively than today, but I do not think
that day is close.  (Are there *any* platforms where "long long"
is 128 bits today?)

Having said that, maybe there's a case for changing the type spec
in only the va_arg() call, and leaving snprintf's related local
variables as int64.  (Is that what you actually meant?)  Then,
if long long really is different from int64, at least we have
predictable truncation behavior after fetching the value, rather
than undefined behavior while fetching it.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 12:54:52 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > It seems the general "use strfromd if available" approach is generally
> > useful, even if we need to serialize the precision.
> 
> Agreed.
> 
> > Putting it into an
> > inline appears to be helpful, avoids some of the otherwise precision
> > related branches.  Do you have any feelings about which header to put
> > the code in?  I used common/string.h so far.
> 
> I do not think it should be in a header, for two reasons:
> 
> (1) The need to use sprintf for portability means that we need very
> tight constraints on the precision spec *and* the buffer size *and*
> the format type (%f pretty much destroys certainty about how long the
> output string is).  So this isn't going to be general purpose code.
> I think just writing it into float[48]out is sufficient.

Well, the numbers suggest it's also useful to do so from snprintf - it's
not that rare that we output floating point numbers from semi
performance critical code, even leaving aside float[48]out.  So I'm not
convinced that we shouldn't do this from within snprintf.c too. Now we
could open-code it twice, but i'm not sure I see the point.

If we just define the API as having to guarantee there's enough space
for the output format, I think it'll work well enough for now?
snprintf.c already assumes everything floating point can be output in
1024 chars, no?

Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> It seems the general "use strfromd if available" approach is generally
> useful, even if we need to serialize the precision.

Agreed.

> Putting it into an
> inline appears to be helpful, avoids some of the otherwise precision
> related branches.  Do you have any feelings about which header to put
> the code in?  I used common/string.h so far.

I do not think it should be in a header, for two reasons:

(1) The need to use sprintf for portability means that we need very
tight constraints on the precision spec *and* the buffer size *and*
the format type (%f pretty much destroys certainty about how long the
output string is).  So this isn't going to be general purpose code.
I think just writing it into float[48]out is sufficient.

(2) It's already the case that most code trying to emit floats ought
to go through float[48]out, in order to have standardized treatment
of Inf and NaN.  Providing some other API in a common header would
just create a temptation to break that policy.

Now, if we did write our own float output code then we would standardize
Inf/NaN outputs inside that, and both of these issues would go away ...
but I think what we'd do is provide something strfromd-like as an
alternate API for that code, so we still won't need a wrapper.
And anyway it doesn't sound like either of us care to jump that hurdle
right now.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-03 11:59:27 -0400, Tom Lane wrote:
>> I experimented with adding an initial check for "format is exactly %s"
>> at the top of dopr(), and couldn't get excited about that.  Instrumenting
>> things showed that the optimization fired in only 1.8% of the calls
>> during a run of our core regression tests.  Now, that might not count
>> as a really representative workload, but it doesn't make me think that
>> the case is worth optimizing for us.

> Seems right.  I also have a hard time to believe that any of those "%s"
> printfs are performance critical - we'd hopefully just have avoided the
> sprintf in that case.

Yup, that's probably a good chunk of the reason why there aren't very
many.  But we *do* use %s a lot to assemble multiple strings or combine
them with fixed text, which is why the other form of the optimization
is useful.

>> But then it occurred to me that there's more than one way to skin this
>> cat.  We could, for an even cheaper extra test, detect that any one
>> format specifier is just "%s", and use the same kind of fast-path
>> within the loop.  With the same sort of instrumentation, I found that
>> a full 45% of the format specs executed in the core regression tests
>> are just %s.  That makes me think that a patch along the lines of the
>> attached is a good win for our use-cases.  Comparing to Fedora 28's
>> glibc, this gets us to

> Hm, especially if we special case the float->string conversions directly
> at the hot callsites, that seems reasonable.  I kinda wish we could just
> easily move the format string processing to compile-time, but given
> translatability that won't be widely possible even if it were otherwise
> feasible.

Yeah, there's a mighty big pile of infrastructure that depends on the
way *printf works.  I agree that one way or another we're going to be
special-casing float8out and float4out.

BTW, I poked around in the related glibc sources the other day, and
it seemed like they are doing some sort of quasi-compilation of format
strings.  I couldn't figure out how they made it pay, though --- without
some way to avoid re-compiling the same format string over and over,
seems like it couldn't net out as a win.  But if they are avoiding
that, I didn't find where.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 12:22:13 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-10-03 12:07:32 -0400, Tom Lane wrote:
> >> [ scratches head ... ]  How would that work?  Seems like it necessarily
> >> adds a strlen() call to whatever we'd be doing otherwise.  palloc isn't
> >> going to be any faster just from asking it for slightly fewer bytes.
> >> I think there might be something wrong with your test scenario ...
> >> or there's more noise in the numbers than you thought.
> 
> > I guess the difference is that we're more likely to find reusable chunks
> > in aset.c and/or need fewer OS allocations.  As the memory is going to
> > be touched again very shortly afterwards, the cache effects probably are
> > neglegible.
> 
> > The strlen definitely shows up in profiles, it just seems to save at
> > least as much as it costs.
> 
> > Doesn't strike me as THAT odd?
> 
> What it strikes me as is excessively dependent on one particular test
> scenario.  I don't mind optimizations that are tradeoffs between
> well-understood costs, but this smells like handwaving that's going to
> lose as much or more often than winning, once it hits the real world.

I'm not particularly wedded to doing the allocation differently - I was
just mildly wondering if the increased size of the allocations could be
problematic. And that lead me to testing that. And reporting it.   I
don't think the real-world test differences are that large in this
specific case, but whatever.

It seems the general "use strfromd if available" approach is generally
useful, even if we need to serialize the precision.  Putting it into an
inline appears to be helpful, avoids some of the otherwise precision
related branches.  Do you have any feelings about which header to put
the code in?  I used common/string.h so far.

Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 11:59:27 -0400, Tom Lane wrote:
> I wrote:
> > ... However, I did add recent glibc (Fedora 28)
> > to the mix, and I was interested to discover that they seem to have
> > added a fast-path for format strings that are exactly "%s", just as
> > NetBSD did.  I wonder if we should reconsider our position on doing
> > that.  It'd be a simple enough addition...
> 
> I experimented with adding an initial check for "format is exactly %s"
> at the top of dopr(), and couldn't get excited about that.  Instrumenting
> things showed that the optimization fired in only 1.8% of the calls
> during a run of our core regression tests.  Now, that might not count
> as a really representative workload, but it doesn't make me think that
> the case is worth optimizing for us.

Seems right.  I also have a hard time to believe that any of those "%s"
printfs are performance critical - we'd hopefully just have avoided the
sprintf in that case.


> But then it occurred to me that there's more than one way to skin this
> cat.  We could, for an even cheaper extra test, detect that any one
> format specifier is just "%s", and use the same kind of fast-path
> within the loop.  With the same sort of instrumentation, I found that
> a full 45% of the format specs executed in the core regression tests
> are just %s.  That makes me think that a patch along the lines of the
> attached is a good win for our use-cases.  Comparing to Fedora 28's
> glibc, this gets us to

Hm, especially if we special case the float->string conversions directly
at the hot callsites, that seems reasonable.  I kinda wish we could just
easily move the format string processing to compile-time, but given
translatability that won't be widely possible even if it were otherwise
feasible.

Greetings,

Andres Freund



Re: libpq host/hostaddr/conninfo inconsistencies

2018-10-03 Thread Arthur Zakirov

Sorry for late answer.

On 9/30/18 10:21 AM, Fabien COELHO wrote:

sh> psql "host=127.0.0.2 hostaddr=127.0.0.1"


I'm not sure that is is the issue. User defined the host name and psql
show it.


The issue is that "host" is an ip, "\conninfo" will inform wrongly that 
you are connected to "127.0.0.2", but the actual connection is really to 
"127.0.0.1", this is plain misleading, and I consider this level of 
unhelpfullness more a bug than a feature.


I didn't think that this is an issue, because I determined "host" as 
just a host's display name when "hostaddr" is defined. So user may 
determine 127.0.0.1 (hostaddr) as "happy_host", for example. It 
shouldn't be a real host.


I searched for another use cases of PQhost(). In PostgreSQL source code 
I found that it is used in pg_dump and psql to connect to some instance.


There is the next issue with PQhost() and psql (pg_dump could have it 
too, see CloneArchive() in pg_backup_archiver.c and _connectDB() in 
pg_backup_db.c):


$ psql "host=host_1,host_2 hostaddr=127.0.0.1,127.0.0.3 dbname=postgres"
=# \conninfo
You are connected to database "postgres" as user "artur" on host 
"host_1" at port "5432".

=# \connect test
could not translate host name "host_1" to address: Неизвестное имя или 
служба

Previous connection kept

So in the example above you cannot reuse connection string with 
\connect. What do you think?



I cannot agree with you. When I've learned libpq before I found
host/hostaddr rules description useful. And I disagree that it is good
to remove it (as the patch does).
Of course it is only my point of view and others may have another 
opinion.


I'm not sure I understand your concern.

Do you mean that you would prefer the document to keep describing that 
host/hostaddr/port accepts one value, and then have in some other place 
or at the end of the option documentation a line that say, "by the way, 
we really accept lists, and they must be somehow consistent between 
host/hostaddr/port"?


I wrote about the following part of the documentation:


-Using hostaddr instead of host 
allows the
-application to avoid a host name look-up, which might be important
-in applications with time constraints. However, a host name is
-required for GSSAPI or SSPI authentication
-methods, as well as for verify-full SSL
-certificate verification.  The following rules are used:
-

> ...

So I think description of these rules is useful here and shouldn't be 
removed. Your patch removes it and maybe it shouldn't do that. But now I 
realised that the patch breaks this behavior and backward compatibility 
is broken.



Patch gives me an error if I specified only hostaddr:

psql -d "hostaddr=127.0.0.1"
psql: host "/tmp" cannot have an hostaddr "127.0.0.1"


This is the expected modified behavior: hostaddr can only be specified 
on a host when it is a name, which is not the case here.


See the comment above about backward compatibility. psql without the 
patch can connect to an instance if I specify only hostaddr.


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-03 12:07:32 -0400, Tom Lane wrote:
>> [ scratches head ... ]  How would that work?  Seems like it necessarily
>> adds a strlen() call to whatever we'd be doing otherwise.  palloc isn't
>> going to be any faster just from asking it for slightly fewer bytes.
>> I think there might be something wrong with your test scenario ...
>> or there's more noise in the numbers than you thought.

> I guess the difference is that we're more likely to find reusable chunks
> in aset.c and/or need fewer OS allocations.  As the memory is going to
> be touched again very shortly afterwards, the cache effects probably are
> neglegible.

> The strlen definitely shows up in profiles, it just seems to save at
> least as much as it costs.

> Doesn't strike me as THAT odd?

What it strikes me as is excessively dependent on one particular test
scenario.  I don't mind optimizations that are tradeoffs between
well-understood costs, but this smells like handwaving that's going to
lose as much or more often than winning, once it hits the real world.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 12:07:32 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > FWIW, it seems that using a local buffer and than pstrdup'ing that in
> > float8out_internal is a bit faster, and would probably save a bit of
> > memory on average:
> > float8out using sprintf via pg_double_to_string, pstrdup:
> > 15370.774
> > float8out using strfromd via pg_double_to_string, pstrdup:
> > 13498.331
> 
> [ scratches head ... ]  How would that work?  Seems like it necessarily
> adds a strlen() call to whatever we'd be doing otherwise.  palloc isn't
> going to be any faster just from asking it for slightly fewer bytes.
> I think there might be something wrong with your test scenario ...
> or there's more noise in the numbers than you thought.

I guess the difference is that we're more likely to find reusable chunks
in aset.c and/or need fewer OS allocations.  As the memory is going to
be touched again very shortly afterwards, the cache effects probably are
neglegible.

The strlen definitely shows up in profiles, it just seems to save at
least as much as it costs.

Doesn't strike me as THAT odd?

Greetings,

Andres Freund



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> FWIW, it seems that using a local buffer and than pstrdup'ing that in
> float8out_internal is a bit faster, and would probably save a bit of
> memory on average:
> float8out using sprintf via pg_double_to_string, pstrdup:
> 15370.774
> float8out using strfromd via pg_double_to_string, pstrdup:
> 13498.331

[ scratches head ... ]  How would that work?  Seems like it necessarily
adds a strlen() call to whatever we'd be doing otherwise.  palloc isn't
going to be any faster just from asking it for slightly fewer bytes.
I think there might be something wrong with your test scenario ...
or there's more noise in the numbers than you thought.

regards, tom lane



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
I wrote:
> ... However, I did add recent glibc (Fedora 28)
> to the mix, and I was interested to discover that they seem to have
> added a fast-path for format strings that are exactly "%s", just as
> NetBSD did.  I wonder if we should reconsider our position on doing
> that.  It'd be a simple enough addition...

I experimented with adding an initial check for "format is exactly %s"
at the top of dopr(), and couldn't get excited about that.  Instrumenting
things showed that the optimization fired in only 1.8% of the calls
during a run of our core regression tests.  Now, that might not count
as a really representative workload, but it doesn't make me think that
the case is worth optimizing for us.

But then it occurred to me that there's more than one way to skin this
cat.  We could, for an even cheaper extra test, detect that any one
format specifier is just "%s", and use the same kind of fast-path
within the loop.  With the same sort of instrumentation, I found that
a full 45% of the format specs executed in the core regression tests
are just %s.  That makes me think that a patch along the lines of the
attached is a good win for our use-cases.  Comparing to Fedora 28's
glibc, this gets us to

Test case: %s
snprintf time = 8.83615 ms total, 8.83615e-06 ms per iteration
pg_snprintf time = 23.9372 ms total, 2.39372e-05 ms per iteration
ratio = 2.709

Test case: %sx
snprintf time = 59.4481 ms total, 5.94481e-05 ms per iteration
pg_snprintf time = 29.8983 ms total, 2.98983e-05 ms per iteration
ratio = 0.503

versus what we have as of this morning's commit:

Test case: %s
snprintf time = 7.7427 ms total, 7.7427e-06 ms per iteration
pg_snprintf time = 26.2439 ms total, 2.62439e-05 ms per iteration
ratio = 3.390

Test case: %sx
snprintf time = 61.4452 ms total, 6.14452e-05 ms per iteration
pg_snprintf time = 32.7516 ms total, 3.27516e-05 ms per iteration
ratio = 0.533

The penalty for non-%s cases seems to be a percent or so, although
it's barely above the noise floor in my tests.

regards, tom lane

diff --git a/src/port/snprintf.c b/src/port/snprintf.c
index cad7345..b9b6add 100644
*** a/src/port/snprintf.c
--- b/src/port/snprintf.c
*** dopr(PrintfTarget *target, const char *f
*** 431,436 
--- 431,449 
  
  		/* Process conversion spec starting at *format */
  		format++;
+ 
+ 		/* Fast path for conversion spec that is exactly %s */
+ 		if (*format == 's')
+ 		{
+ 			format++;
+ 			strvalue = va_arg(args, char *);
+ 			Assert(strvalue != NULL);
+ 			dostr(strvalue, strlen(strvalue), target);
+ 			if (target->failed)
+ break;
+ 			continue;
+ 		}
+ 
  		fieldwidth = precision = zpad = leftjust = forcesign = 0;
  		longflag = longlongflag = pointflag = 0;
  		fmtpos = accum = 0;


Re: Early WIP/PoC for inlining CTEs

2018-10-03 Thread David Fetter
On Tue, Oct 02, 2018 at 12:03:06PM +0200, Andreas Karlsson wrote:
> Hi,
> 
> Here is an updated patch which adds some simple syntax for adding the
> optimization barrier. For example:
> 
> WITH x AS MATERIALIZED (
>SELECT 1
> )
> SELECT * FROM x;
> 
> Andreas

This is great!

Is there any meaningful distinction between "inlining," by which I
mean converting to a subquery, and predicate pushdown, which
would happen at least for a first cut, at the rewrite stage?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Andres Freund
Hi,

On 2018-10-03 08:20:14 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> While there might be value in implementing our own float printing code,
> >> I have a pretty hard time getting excited about the cost/benefit ratio
> >> of that.  I think that what we probably really ought to do here is hack
> >> float4out/float8out to bypass the extra overhead, as in the 0002 patch
> >> below.
> 
> > I'm thinking we should do a bit more than just that hack. I'm thinking
> > of something (barely tested) like
> 
> Meh.  The trouble with that is that it relies on the platform's snprintf,
> not sprintf, and that brings us right back into a world of portability
> hurt.  I don't feel that the move to C99 gets us out of worrying about
> noncompliant snprintfs --- we're only requiring a C99 *compiler*, not
> libc.  See buildfarm member gharial for a counterexample.

Oh, we could just use sprintf() and tell strfromd the buffer is large
enough. I only used snprintf because it seemed more symmetric, and
because I was at most 1/3 awake.


> I'm happy to look into whether using strfromd when available buys us
> anything over using sprintf.  I'm not entirely convinced that it will,
> because of the need to ASCII-ize and de-ASCII-ize the precision, but
> it's worth checking.

It's definitely faster.  It's not a full-blown format parser, so I guess
the cost of the conversion isn't too bad:
https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/strfrom-skeleton.c;hb=HEAD#l68

CREATE TABLE somefloats(id serial, data1 float8, data2 float8, data3 float8);
INSERT INTO somefloats(data1, data2, data3) SELECT random(), random(), random() 
FROM generate_series(1, 1000);
VACUUM FREEZE somefloats;

I'm comparing the times of:
COPY somefloats TO '/dev/null';

master (including your commit):
16177.202 ms

snprintf using sprintf via pg_double_to_string:
16195.787

snprintf using strfromd via pg_double_to_string:
14856.974 ms

float8out using sprintf via pg_double_to_string:
16176.169

float8out using strfromd via pg_double_to_string:
13532.698



FWIW, it seems that using a local buffer and than pstrdup'ing that in
float8out_internal is a bit faster, and would probably save a bit of
memory on average:

float8out using sprintf via pg_double_to_string, pstrdup:
15370.774

float8out using strfromd via pg_double_to_string, pstrdup:
13498.331


Greetings,

Andres Freund



Re: BUG #15307: Low numerical precision of (Co-) Variance

2018-10-03 Thread Madeleine Thompson
On Wed, Oct 3, 2018 at 9:04 AM Dean Rasheed  wrote:
>
> On Thu, 27 Sep 2018 at 14:22, Dean Rasheed  wrote:
> > I'll post an updated patch in a while.
> >
>
> I finally got round to looking at this again. Here is an update, based
> on the review comments.
>
> The next question is whether or not to back-patch this. Although this
> was reported as a bug, my inclination is to apply this to HEAD only,
> based on the lack of prior complaints. That also matches our decision
> for other similar patches, e.g., 7d9a4737c2.

This diff looks good to me. Also, it applies cleanly against
abd9ca377d669a6e0560e854d7e987438d0e612e and passes `make
check-world`.

I agree that this is not suitable for a patch release.

Madeleine



Re: BUG #15307: Low numerical precision of (Co-) Variance

2018-10-03 Thread Tom Lane
Dean Rasheed  writes:
> The next question is whether or not to back-patch this. Although this
> was reported as a bug, my inclination is to apply this to HEAD only,
> based on the lack of prior complaints. That also matches our decision
> for other similar patches, e.g., 7d9a4737c2.

+1 for no back-patch.  Even though the new results are hopefully more
accurate, they're still different from before, and that might cause
issues for somebody if it happens in a stable branch.

regards, tom lane



Re: BUG #15307: Low numerical precision of (Co-) Variance

2018-10-03 Thread Dean Rasheed
On Thu, 27 Sep 2018 at 14:22, Dean Rasheed  wrote:
> I'll post an updated patch in a while.
>

I finally got round to looking at this again. Here is an update, based
on the review comments.

The next question is whether or not to back-patch this. Although this
was reported as a bug, my inclination is to apply this to HEAD only,
based on the lack of prior complaints. That also matches our decision
for other similar patches, e.g., 7d9a4737c2.

Regards,
Dean
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
new file mode 100644
index df35557..d1e12c1
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -2401,13 +2401,39 @@ setseed(PG_FUNCTION_ARGS)
  *		float8_stddev_samp	- produce final result for float STDDEV_SAMP()
  *		float8_stddev_pop	- produce final result for float STDDEV_POP()
  *
+ * The naive schoolbook implementation of these aggregates works by
+ * accumulating sum(X) and sum(X^2).  However, this approach suffers from
+ * large rounding errors in the final computation of quantities like the
+ * population variance (N*sum(X^2) - sum(X)^2) / N^2, since each of the
+ * intermediate terms is potentially very large, while the difference is often
+ * quite small.
+ *
+ * Instead we use the Youngs-Cramer algorithm [1] which works by accumulating
+ * Sx=sum(X) and Sxx=sum((X-Sx/N)^2), using a numerically stable algorithm to
+ * incrementally update those quantities.  The final computations of each of
+ * the aggregate values is then trivial and gives more accurate results (for
+ * example, the population variance is just Sxx/N).  This algorithm is also
+ * fairly easy to generalize to allow parallel execution without loss of
+ * precision (see, for example, [2]).  For more details, and a comparison of
+ * this with other algorithms, see [3].
+ *
  * The transition datatype for all these aggregates is a 3-element array
- * of float8, holding the values N, sum(X), sum(X*X) in that order.
+ * of float8, holding the values N, Sx, Sxx in that order.
  *
  * Note that we represent N as a float to avoid having to build a special
  * datatype.  Given a reasonable floating-point implementation, there should
  * be no accuracy loss unless N exceeds 2 ^ 52 or so (by which time the
  * user will have doubtless lost interest anyway...)
+ *
+ * [1] Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms,
+ * E. A. Youngs and E. M. Cramer, Technometrics Vol 13, No 3, August 1971.
+ *
+ * [2] Updating Formulae and a Pairwise Algorithm for Computing Sample
+ * Variances, T. F. Chan, G. H. Golub & R. J. LeVeque, COMPSTAT 1982.
+ *
+ * [3] Numerically Stable Parallel Computation of (Co-)Variance, Erich
+ * Schubert and Michael Gertz, Proceedings of the 30th International
+ * Conference on Scientific and Statistical Database Management, 2018.
  */
 
 static float8 *
@@ -2441,18 +2467,90 @@ float8_combine(PG_FUNCTION_ARGS)
 	ArrayType  *transarray2 = PG_GETARG_ARRAYTYPE_P(1);
 	float8	   *transvalues1;
 	float8	   *transvalues2;
-
-	if (!AggCheckCallContext(fcinfo, NULL))
-		elog(ERROR, "aggregate function called in non-aggregate context");
+	float8		N1,
+Sx1,
+Sxx1,
+N2,
+Sx2,
+Sxx2,
+tmp,
+N,
+Sx,
+Sxx;
 
 	transvalues1 = check_float8_array(transarray1, "float8_combine", 3);
 	transvalues2 = check_float8_array(transarray2, "float8_combine", 3);
 
-	transvalues1[0] = transvalues1[0] + transvalues2[0];
-	transvalues1[1] = float8_pl(transvalues1[1], transvalues2[1]);
-	transvalues1[2] = float8_pl(transvalues1[2], transvalues2[2]);
+	N1 = transvalues1[0];
+	Sx1 = transvalues1[1];
+	Sxx1 = transvalues1[2];
 
-	PG_RETURN_ARRAYTYPE_P(transarray1);
+	N2 = transvalues2[0];
+	Sx2 = transvalues2[1];
+	Sxx2 = transvalues2[2];
+
+	/*
+	 * The transition values combine using a generalization of the
+	 * Youngs-Cramer algorithm as follows:
+	 *
+	 *	N = N1 + N2
+	 *	Sx = Sx1 + Sx2
+	 *	Sxx = Sxx1 + Sxx2 + N1 * N2 * (Sx1/N1 - Sx2/N2)^2 / N;
+	 *
+	 * It's worth handling the special cases N1 = 0 and N2 = 0 separately
+	 * since those cases are trivial, and we then don't need to worry about
+	 * division-by-zero errors in the general case.
+	 *
+	 */
+	if (N1 == 0.0)
+	{
+		N = N2;
+		Sx = Sx2;
+		Sxx = Sxx2;
+	}
+	else if (N2 == 0.0)
+	{
+		N = N1;
+		Sx = Sx1;
+		Sxx = Sxx1;
+	}
+	else
+	{
+		N = N1 + N2;
+		Sx = float8_pl(Sx1, Sx2);
+		tmp = Sx1 / N1 - Sx2 / N2;
+		Sxx = Sxx1 + Sxx2 + N1 * N2 * tmp * tmp / N;
+		check_float8_val(Sxx, isinf(Sxx1) || isinf(Sxx2), true);
+	}
+
+	/*
+	 * If we're invoked as an aggregate, we can cheat and modify our first
+	 * parameter in-place to reduce palloc overhead. Otherwise we construct a
+	 * new array with the updated transition data and return it.
+	 */
+	if (AggCheckCallContext(fcinfo, NULL))
+	{
+		transvalues1[0] = N;
+		transvalues1[1] = Sx;
+		transvalues1[2] = Sxx;
+
+		PG_RETURN_ARRAYTYPE_P(transarray1);
+	}
+	else
+	{
+		Datum		transdatums[3];
+		ArrayType 

Re: [RFC] Removing "magic" oids

2018-10-03 Thread Andreas Karlsson
On 09/30/2018 05:48 AM, Andres Freund wrote:> I think we should drop 
WITH OIDs support.


+1


2) We need to be able to manually insert into catalog tables. Both
initdb and emergency surgery.  My current hack is doing so directly
in nodeModifyTable.c but that's beyond ugly.  I think we should add
an explicit DEFAULT clause to those columns with something like
nextoid('tablename', 'name_of_index') or such.


Adding a function to get the next OID sounds like a good solution for 
both our catalog and legacy applications. The only potential 
disadvantage that I see is that this function becomes something we need 
to maintain if we ever decide to move from OIDs to sequences.



3) The quickest way to deal with the bootstrap code was to just assign
all oids for oid carrying tables that don't yet have any assigned.
That doesn't generally seem terrible, although it's certainly badly
implemented right now.  That'd mean we'd have three ranges of oids
probably, unless we somehow have the bootstrap code advance the
in-database oid counter to a right state before we start with
post-bootstrap work.  I like the idea of all bootstrap time oids
being determined by genbki.pl (I think that'll allow to remove some
code too).


Agreed, having genbki.pl determine all oids in the bootstrap data sounds 
ideal.


Andreas




Re: Performance improvements for src/port/snprintf.c

2018-10-03 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-02 17:54:31 -0400, Tom Lane wrote:
>> Here's a version of this patch rebased over commit 625b38ea0.

> Cool.  Let's get that in...

Cool, I'll push it shortly.

>> While there might be value in implementing our own float printing code,
>> I have a pretty hard time getting excited about the cost/benefit ratio
>> of that.  I think that what we probably really ought to do here is hack
>> float4out/float8out to bypass the extra overhead, as in the 0002 patch
>> below.

> I'm thinking we should do a bit more than just that hack. I'm thinking
> of something (barely tested) like

Meh.  The trouble with that is that it relies on the platform's snprintf,
not sprintf, and that brings us right back into a world of portability
hurt.  I don't feel that the move to C99 gets us out of worrying about
noncompliant snprintfs --- we're only requiring a C99 *compiler*, not
libc.  See buildfarm member gharial for a counterexample.

I'm happy to look into whether using strfromd when available buys us
anything over using sprintf.  I'm not entirely convinced that it will,
because of the need to ASCII-ize and de-ASCII-ize the precision, but
it's worth checking.

> FWIW, I think there's still a significant argument to be made that we
> should work on our floating point IO performance. Both on the input and
> output side. It's a significant practical problem. But both a fix like
> you describe, and my proposal, should bring us to at least the previous
> level of performance for the hot paths. So that'd then just be an
> independent consideration.

Well, an independent project anyway.  I concur that it would have value;
but whether it's worth the effort, and the possible behavioral changes,
is not very clear to me.

regards, tom lane



Re: partition tree inspection functions

2018-10-03 Thread Jesper Pedersen

Hi Michael,

On 10/2/18 11:37 PM, Michael Paquier wrote:

isleaf is not particularly useful as it can just be fetched with a join
on pg_class/relkind.  So I think that we had better remove it.



Removing isleaf would require extra round trips to the server to get 
that information. So, I think we should keep it.


Best regards,
 Jesper



Re[2]: Alter index rename concurrently to

2018-10-03 Thread Andrey Klychkov

> Based on these assertions, here is my proposed patch. It lowers the
> lock level for index renaming to ShareUpdateExclusiveLock.

Hi, Peter

I made small review for your patch:
Server source code got from https://github.com/postgres/postgres.git
1. Patch was applied without any errors except a part related to documentation:
error: patch failed: doc/src/sgml/ref/alter_index.sgml:50
error: doc/src/sgml/ref/alter_index.sgml: patch does not apply
2. The code has been compiled successfully, configured by:
# ./configure CFLAGS="-O0" --enable-debug --enable-cassert --enable-depend 
--without-zlib
3. 'make' / 'make install' successfully made and complete.
4. The compiled instance has been started without any errors.
5. I realized several tests by the pgbench (with -c 4 -j 4 -T 360) that 
modified data into columns, indexed by pk and common btree.
In the same time there was a running script that was making renaming indexes 
multiple times in transactions with pg_sleep(1).
After several tests no errors were found.
6. pg_upgrade from 10 to 12 (patched) has been done without any errors / 
warnings
7. Code style:
+RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal, 
bool is_index)
This line is longer than 80 chars.
Thank you
>Вторник, 14 августа 2018, 9:33 +03:00 от Peter Eisentraut 
>:
>
>On 31/07/2018 23:10, Peter Eisentraut wrote:
>> On 27/07/2018 16:16, Robert Haas wrote:
>>> With respect to this particular patch, I don't know whether there are
>>> any hazards of the second type.  What I'd check, if it were me, is
>>> what structures in the index's relcache entry are going to get rebuilt
>>> when the index name changes.  If any of those look like things that
>>> something that somebody could hold a pointer to during the course of
>>> query execution, or more generally be relying on not to change during
>>> the course of query execution, then you've got a problem.
>> 
>> I have investigated this, and I think it's safe.  Relcache reloads for
>> open indexes are already handled specially in RelationReloadIndexInfo().
>>  The only pointers that get replaced there are rd_amcache and
>> rd_options.  I have checked where those are used, and it looks like they
>> are not used across possible relcache reloads.  The code structure in
>> those two cases make this pretty unlikely even in the future.  Also,
>> violations would probably be caught by CLOBBER_CACHE_ALWAYS.
>
>Based on these assertions, here is my proposed patch.  It lowers the
>lock level for index renaming to ShareUpdateExclusiveLock.
>
>-- 
>Peter Eisentraut  http://www.2ndQuadrant.com/
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Regards,
Andrey Klychkov


shared buffer manager problems and redesign

2018-10-03 Thread Jamison, Kirk
Hello, hackers

(Actually I’m not sure if I should post it here or in pgsql-general mailing 
list)
It's been discussed again a few times recently regarding the time-consuming 
behavior when mapping shared buffers
that happens in TRUNCATE, VACUUM when deleting the trailing empty pages in the 
shared buffer [1],
data corruption on file truncation error [2], etc.

Buffer Manager redesign/restructure

Andres Freund has been working on this before and described design and methods 
on how the buffer radix tree can be implemented. [4]
I think it is worth considering because there were a lot of proposed solutions 
in previous threads [1] [2] [3] [4],
but we could not arrive at consensus, as it has been pointed out that some 
methods lead to more complexities.
The ordered buffer mapping implementation (changing the current buffer mapping 
implementation) would always pop out
in these discussions as it would potentially address the problems.

But before we can work on POC patch, I think we should start a common 
discussion for potential

a.)data structure design of the modified buffer manager: open relations 
hash table, buffer radix tree, etc.

b.)buffer tag, locks

c.)implementation, operations (loading pages, flushing/writing out 
buffers), complexities, etc.

However, from what I understood, realistically speaking, it’s not possible to 
have it committed by PG12 given the complexity and time.
There is also question of how to resolve some/part of these problems like a 
potential solution without redesigning the shared buffer manager.
So, I really find it really important to be discussed soon.

I hope to hear more insights, ideas, suggestions, truth-bombs, or so. :)

Thank you very much.

Regards,
Kirk

References
[1] "reloption to prevent VACUUM from truncating empty pages at the end of 
relation"
https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com
[2] "Truncation failure in autovacuum results in data corruption (duplicate 
keys)"
https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC@tunaPC
[3] "WIP: long transactions on hot standby feedback replica / proof of concept"
https://www.postgresql.org/message-id/flat/c9374921e50a5e8fb1ecf04eb8c6ebc3%40postgrespro.ru
[4] "Reducing the size of BufferTag & remodeling forks"
https://www.postgresql.org/message-id/flat/20150702133619.GB16267%40alap3.anarazel.de




Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2

2018-10-03 Thread Chris Travers
On Wed, Oct 3, 2018 at 9:41 AM Chris Travers 
wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:tested, failed
>
> I am hoping I am not out of order in writing this before the commitfest
> starts.  The patch is big and long and so wanted to start on this while
> traffic is slow.
>
> I find this patch quite welcome and very close to a minimum viable
> version.  The few significant limitations can be resolved later.  One thing
> I may have missed in the documentation is a discussion of the limits of the
> current approach.  I think this would be important to document because the
> caveats of the current approach are significant, but the people who need it
> will have the knowledge to work with issues if they come up.
>
> The major caveat I see in our past discussions and (if I read the patch
> correctly) is that the resolver goes through global transactions
> sequentially and does not move on to the next until the previous one is
> resolved.  This means that if I have a global transaction on server A, with
> foreign servers B and C, and I have another one on server A with foreign
> servers C and D, if server B goes down at the wrong moment, the background
> worker does not look like it will detect the failure and move on to try to
> resolve the second, so server D will have a badly set vacuum horizon until
> this is resolved.  Also if I read the patch correctly, it looks like one
> can invoke SQL commands to remove the bad transaction to allow processing
> to continue and manual resolution (this is good and necessary because in
> this area there is no ability to have perfect recoverability without
> occasional administrative action).  I would really like to see more
> documentation of failure cases and appropriate administrative action at
> present.  Otherwise this is I think a minimum viable addition and I think
> we want it.
>
> It is possible i missed that in the documentation.  If so, my objection
> stands aside.  If it is welcome I am happy to take a first crack at such
> docs.
>

After further testing I am pretty sure I misread the patch.  It looks like
one can have multiple resolvers which can, in fact, work through a queue
together solving this problem.  So the objection above is not valid and I
withdraw that objection.  I will re-review the docs in light of the
experience.


>
> To my mind thats the only blocker in the code (but see below).  I can say
> without a doubt that I would expect we would use this feature once
> available.
>
> --
>
> Testing however failed.
>
> make installcheck-world fails with errors like the following:
>
>  -- Modify foreign server and raise an error
>   BEGIN;
>   INSERT INTO ft7_twophase VALUES(8);
> + ERROR:  prepread foreign transactions are disabled
> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>   INSERT INTO ft8_twophase VALUES(NULL); -- violation
> ! ERROR:  current transaction is aborted, commands ignored until end of
> transaction block
>   ROLLBACK;
>   SELECT * FROM ft7_twophase;
> ! ERROR:  prepread foreign transactions are disabled
> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>   SELECT * FROM ft8_twophase;
> ! ERROR:  prepread foreign transactions are disabled
> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>   -- Rollback foreign transaction that involves both 2PC-capable
>   -- and 2PC-non-capable foreign servers.
>   BEGIN;
>   INSERT INTO ft8_twophase VALUES(7);
> + ERROR:  prepread foreign transactions are disabled
> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>   INSERT INTO ft9_not_twophase VALUES(7);
> + ERROR:  current transaction is aborted, commands ignored until end of
> transaction block
>   ROLLBACK;
>   SELECT * FROM ft8_twophase;
> ! ERROR:  prepread foreign transactions are disabled
> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>
> make installcheck in the contrib directory shows the same, so that's the
> easiest way of reproducing, at least on a new installation.  I think the
> test cases will have to handle that sort of setup.
>
> make check in the contrib directory passes.
>
> For reasons of test failures, I am setting this back to waiting on author.
>
> --
> I had a few other thoughts that I figure are worth sharing with the
> community on this patch with the idea that once it is in place, this may
> open up more options for collaboration in the area of federated and
> distributed storage generally.  I could imagine other foreign data wrappers
> using this API, and folks might want to refactor out the atomic handling
> part so that extensions that do not use the foreign data wrapper structure
> could use it as well (while this looks like a classic SQL/MED issue, I am
> not sure that only foreign data wrappers woul

Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2

2018-10-03 Thread Chris Travers
On Wed, Oct 3, 2018 at 9:56 AM Chris Travers 
wrote:

>
>
> On Wed, Oct 3, 2018 at 9:41 AM Chris Travers 
> wrote:
>
>>
>> (errmsg("preparing foreign transactions
> (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers
> > 0")));
>

Two more critical notes here which I think are small blockers.

The error message above references a config variable that does not exist.

The correct name of the config parameter is
max_foreign_transaction_resolvers

 Setting that along with the following to 10 caused the tests to pass, but
again it fails on default configs:

max_prepared_foreign_transactions, max_prepared_transactions

>
>
>
> --
> Best Regards,
> Chris Travers
> Head of Database
>
> Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
> Saarbrücker Straße 37a, 10405 Berlin
>
>

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2

2018-10-03 Thread Chris Travers
On Wed, Oct 3, 2018 at 9:41 AM Chris Travers 
wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:tested, failed
>

Also one really minor point:  I think this is a typo (maX vs max)?

(errmsg("preparing foreign transactions (max_prepared_foreign_transactions
> 0) requires maX_foreign_xact_resolvers > 0")));




-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2018-10-03 Thread Michael Paquier
On Wed, Oct 03, 2018 at 04:37:29PM +0900, Masahiko Sawada wrote:
> Thank you for the comment. Attached the updated patch.

Thanks for the patch.  This looks clean to me at quick glance, not
tested though..
--
Michael


signature.asc
Description: PGP signature


Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-03 Thread Michael Paquier
On Fri, Sep 28, 2018 at 12:17:00PM +0900, Michael Paquier wrote:
> I think that Alvaro should definitely look at this patch to be sure, or
> I could do it, but I would need to spend way more time on this and check
> event trigger interactions. 
> 
> Anyway, I was struggling a bit regarding the location where adding a
> regression test.  event_trigger.sql makes the most sense but in tests
> for drops the objects are created before the event trigger is defined,
> so that would need to move around so as the original problem is
> reproducible.  Perhaps you have an idea for that?

Okay.  I have spent more time on this issue, and I have been able to
integrate a test in the existing event_trigger.sql which is able to
reproduce the reported failure.  Attached is what I am finishing with.

I still want to do more testing on it, and the day is ending here.

Thoughts?
--
Michael
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9229f619d2..b0cb5514d3 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "commands/tablecmds.h"
+#include "commands/event_trigger.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -212,7 +213,8 @@ relationHasPrimaryKey(Relation rel)
 void
 index_check_primary_key(Relation heapRel,
 		IndexInfo *indexInfo,
-		bool is_alter_table)
+		bool is_alter_table,
+		IndexStmt *stmt)
 {
 	List	   *cmds;
 	int			i;
@@ -280,7 +282,11 @@ index_check_primary_key(Relation heapRel,
 	 * unduly.
 	 */
 	if (cmds)
+	{
+		EventTriggerAlterTableStart((Node *) stmt);
 		AlterTableInternal(RelationGetRelid(heapRel), cmds, true);
+		EventTriggerAlterTableEnd();
+	}
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ab3d9a0a48..4fc279e86f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -666,7 +666,7 @@ DefineIndex(Oid relationId,
 	 * Extra checks when creating a PRIMARY KEY index.
 	 */
 	if (stmt->primary)
-		index_check_primary_key(rel, indexInfo, is_alter_table);
+		index_check_primary_key(rel, indexInfo, is_alter_table, stmt);
 
 	/*
 	 * If this table is partitioned and we're creating a unique index or a
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c145385f84..27f97fdff3 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7074,7 +7074,7 @@ ATExecAddIndexConstraint(AlteredTableInfo *tab, Relation rel,
 
 	/* Extra checks needed if making primary key */
 	if (stmt->primary)
-		index_check_primary_key(rel, indexInfo, true);
+		index_check_primary_key(rel, indexInfo, true, stmt);
 
 	/* Note we currently don't support EXCLUSION constraints here */
 	if (stmt->primary)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index f20c5f789b..35a29f3498 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -40,7 +40,8 @@ typedef enum
 
 extern void index_check_primary_key(Relation heapRel,
 		IndexInfo *indexInfo,
-		bool is_alter_table);
+		bool is_alter_table,
+		IndexStmt *stmt);
 
 #define	INDEX_CREATE_IS_PRIMARY(1 << 0)
 #define	INDEX_CREATE_ADD_CONSTRAINT			(1 << 1)
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 6175a10d77..15528a7cb2 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -349,6 +349,18 @@ CREATE SCHEMA evttrig
 	CREATE TABLE one (col_a SERIAL PRIMARY KEY, col_b text DEFAULT 'forty two')
 	CREATE INDEX one_idx ON one (col_b)
 	CREATE TABLE two (col_c INTEGER CHECK (col_c > 0) REFERENCES one DEFAULT 42);
+-- Partitioned tables with shared indexes
+CREATE TABLE evttrig.parted (
+id int PRIMARY KEY)
+PARTITION BY RANGE (id);
+CREATE TABLE evttrig.part_1_10 PARTITION OF evttrig.parted (id)
+  FOR VALUES FROM (1) TO (10);
+CREATE TABLE evttrig.part_10_20 PARTITION OF evttrig.parted (id)
+  FOR VALUES FROM (10) TO (20) PARTITION BY RANGE (id);
+CREATE TABLE evttrig.part_10_15 PARTITION OF evttrig.part_10_20 (id)
+  FOR VALUES FROM (10) TO (15);
+CREATE TABLE evttrig.part_15_20 PARTITION OF evttrig.part_10_20 (id)
+  FOR VALUES FROM (15) TO (20);
 ALTER TABLE evttrig.two DROP COLUMN col_c;
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=table column identity=evttrig.two.col_c name={evttrig,two,col_c} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=table constraint identity=two_col_c_check on evttrig.two name={evttrig,two,two_col_c_check} args={}
@@ -359,14 +371,20 @@ NOTICE:  NORMAL: orig=t normal=f istemp=f type=table constraint identity=one_pke
 DROP INDEX evttrig.one_idx;
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=index identity=evttrig.one_idx name={evttrig,one_idx} args={}
 DROP SCHEMA evttrig CASCADE;
-NOTICE:  drop cascades to 2 other objects
+NOT

Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2

2018-10-03 Thread Chris Travers
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, failed

I am hoping I am not out of order in writing this before the commitfest starts. 
 The patch is big and long and so wanted to start on this while traffic is slow.

I find this patch quite welcome and very close to a minimum viable version.  
The few significant limitations can be resolved later.  One thing I may have 
missed in the documentation is a discussion of the limits of the current 
approach.  I think this would be important to document because the caveats of 
the current approach are significant, but the people who need it will have the 
knowledge to work with issues if they come up.

The major caveat I see in our past discussions and (if I read the patch 
correctly) is that the resolver goes through global transactions sequentially 
and does not move on to the next until the previous one is resolved.  This 
means that if I have a global transaction on server A, with foreign servers B 
and C, and I have another one on server A with foreign servers C and D, if 
server B goes down at the wrong moment, the background worker does not look 
like it will detect the failure and move on to try to resolve the second, so 
server D will have a badly set vacuum horizon until this is resolved.  Also if 
I read the patch correctly, it looks like one can invoke SQL commands to remove 
the bad transaction to allow processing to continue and manual resolution (this 
is good and necessary because in this area there is no ability to have perfect 
recoverability without occasional administrative action).  I would really like 
to see more documentation of failure cases and appropriate administrative 
action at present.  Otherwise this is I think a minimum viable addition and I 
think we want it.

It is possible i missed that in the documentation.  If so, my objection stands 
aside.  If it is welcome I am happy to take a first crack at such docs.

To my mind thats the only blocker in the code (but see below).  I can say 
without a doubt that I would expect we would use this feature once available.

--

Testing however failed.

make installcheck-world fails with errors like the following:

 -- Modify foreign server and raise an error
  BEGIN;
  INSERT INTO ft7_twophase VALUES(8);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft8_twophase VALUES(NULL); -- violation
! ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
  ROLLBACK;
  SELECT * FROM ft7_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  -- Rollback foreign transaction that involves both 2PC-capable
  -- and 2PC-non-capable foreign servers.
  BEGIN;
  INSERT INTO ft8_twophase VALUES(7);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft9_not_twophase VALUES(7);
+ ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
  ROLLBACK;
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.

make installcheck in the contrib directory shows the same, so that's the 
easiest way of reproducing, at least on a new installation.  I think the test 
cases will have to handle that sort of setup.

make check in the contrib directory passes.

For reasons of test failures, I am setting this back to waiting on author.

--
I had a few other thoughts that I figure are worth sharing with the community 
on this patch with the idea that once it is in place, this may open up more 
options for collaboration in the area of federated and distributed storage 
generally.  I could imagine other foreign data wrappers using this API, and 
folks might want to refactor out the atomic handling part so that extensions 
that do not use the foreign data wrapper structure could use it as well (while 
this looks like a classic SQL/MED issue, I am not sure that only foreign data 
wrappers would be interested in the API.

The new status of this patch is: Waiting on Author


Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2018-10-03 Thread Masahiko Sawada
On Tue, Oct 2, 2018 at 11:15 PM Tom Lane  wrote:
>
> Michael Paquier  writes:
> > My brain is rather fried for the rest of the day...  But we could just
> > be looking at using USE_ASSERT_CHECKING.  Thoughts from other are
> > welcome.
>
> I'd go with folding the condition into a plain Assert.  Then it's
> obvious that no code is added in a non-assert build.  I can see that
> some cases might be so complicated that that isn't nice, but this
> case doesn't seem to qualify.
>

Thank you for the comment. Attached the updated patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_autovacuum_log_v2.patch
Description: Binary data


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-03 Thread Michael Paquier
On Tue, Oct 02, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote:
> I see.  Peter is proposing to have a fourth mode, essentially
> --transfer-mode=clone-or-copy.

Yes, a mode which depends on what the file system supports.  Perhaps
"safe" or "fast" could be another name, in the shape of the fastest
method available which does not destroy the existing cluster's data.

> Thinking about a generic tool that wraps pg_upgrade (say, Debian's
> wrapper for it) this makes sense: just use the fastest non-destructive
> mode available.  Which ones are available depends on what the underlying
> filesystem is, so it's not up to the tool's writer to decide which to
> use ahead of time.

This could have merit.  Now, it seems to me that we have two separate
concepts here, which should be addressed separately:
1) cloning file mode, which is the new actual feature.
2) automatic mode, which is a subset of the copy mode and the new clone
mode.  What I am not sure when it comes to that stuff is if we should
consider in some way the link mode as being part of this automatic
selection concept..
--
Michael


signature.asc
Description: PGP signature