Re: Skylake-S warning
On 10/3/18 11:29 PM, Daniel Wood wrote: > If running benchmarks or you are a customer which is currently impacted by > GetSnapshotData() on high end multisocket systems be wary of Skylake-S. > > > Performance differences of nearly 2X can be seen on select only pgbench due to > nothing else but unlucky choices for max_connections. Scale 1000, 192 local > clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) > system. > pgbench -S > Hi, Could it be related to : https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net ? signature.asc Description: OpenPGP digital signature
RE: Protect syscache from bloating with negative cache entries
Hi, thank you for the explanation. >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] >> >> Can I confirm about catcache pruning? >> syscache_memory_target is the max figure per CatCache. >> (Any CatCache has the same max value.) So the total max size of >> catalog caches is estimated around or slightly more than # of SysCache >> array times syscache_memory_target. > >Right. > >> If correct, I'm thinking writing down the above estimation to the >> document would help db administrators with estimation of memory usage. >> Current description might lead misunderstanding that >> syscache_memory_target is the total size of catalog cache in my impression. > >Honestly I'm not sure that is the right design. Howerver, I don't think >providing such >formula to users helps users, since they don't know exactly how many CatCaches >and >brothres live in their server and it is a soft limit, and finally only few or >just one catalogs >can reach the limit. Yeah, I agree with that kind of formula is not suited for the document. But if users don't know how many catcaches and brothers is used at postgres, then how about changing syscache_memory_target as total soft limit of catcache, rather than size limit of individual catcache. Internally syscache_memory_target can be divided by the number of Syscache and does its work. The total amount would be easier to understand for users who don't know the detailed contents of catalog caches. Or if user can tell how many/what kind of catcaches exists, for instance by using the system view you provided in the previous email, the current design looks good to me. >The current design based on the assumption that we would have only one >extremely-growable cache in one use case. > >> Related to the above I just thought changing sysycache_memory_target >> per CatCache would make memory usage more efficient. > >We could easily have per-cache settings in CatCache, but how do we provide the >knobs >for them? I can guess only too much solutions for that. Agreed. >> Though I haven't checked if there's a case that each system catalog >> cache memory usage varies largely, pg_class cache might need more memory than >others and others might need less. >> But it would be difficult for users to check each CatCache memory >> usage and tune it because right now postgresql hasn't provided a handy way to >check them. > >I supposed that this is used without such a means. Someone suffers syscache >bloat >just can set this GUC to avoid the bloat. End. Yeah, I took the purpose wrong. >Apart from that, in the current patch, syscache_memory_target is not exact at >all in >the first place to avoid overhead to count the correct size. The major >difference comes >from the size of cache tuple itself. But I came to think it is too much to >omit. > >As a *PoC*, in the attached patch (which applies to current master), size of >CTups are >counted as the catcache size. > >It also provides pg_catcache_size system view just to give a rough idea of how >such >view looks. I'll consider more on that but do you have any opinion on this? > >=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order >by size >desc; > relid | indid | size >-+---+-- >-+---+-- >-+---+-- >-+---+-- > pg_class| pg_class_oid_index| 131072 > pg_class| pg_class_relname_nsp_index| 131072 > pg_cast | pg_cast_source_target_index | 5504 > pg_operator | pg_operator_oprname_l_r_n_index | 4096 > pg_statistic| pg_statistic_relid_att_inh_index | 2048 > pg_proc | pg_proc_proname_args_nsp_index| 2048 >.. Great! I like this view. One of the extreme idea would be adding all the members printed by CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at this moment. All of the members seems too much for customers who tries to change the cache limit size But it may be some of the members are useful because for example cc_hits would indicate that current cache limit size is too small. >> Another option is that users only specify the total memory target size >> and postgres dynamically change each CatCache memory target size according >> to a >certain metric. >> (, which still seems difficult and expensive to develop per benefit) >> What do you think about this? > >Given that few caches bloat at once, it's effect is not so different from the >current >design. Yes agreed. >> As you commented here, guc variable syscache_memory_target and >> syscache_prune_min_age are used for both syscache and relcache (HTAB), r
Re: Shouldn't ExecShutdownNode() be called from standard_ExecutorFinish()?
On Thu, Oct 4, 2018 at 12:33 AM Andres Freund wrote: > > Hi, > > There's a few cases where by the time ExecutorFinish() is called, > ExecShutdownNode() has not yet been called. As, among other reasons, > ExecShutdownNode() also collects instrumentation, I think that's > problematic. > > In ExplainOnePlan() we call > > /* run cleanup too */ > ExecutorFinish(queryDesc); > > and then print the majority of the explain data: > > /* Create textual dump of plan tree */ > ExplainPrintPlan(es, queryDesc); > > and only then shut down the entire query: > > ExecutorEnd(queryDesc); > > which seems to mean that if a node hasn't yet been shut down for some > reason, we'll not have information that's normally collected in > ExecShutdownNode(). > Ideally, ExecShutdownNode would have been called by the end of ExecutorRun phase. Can you tell which particular case you are bothered about? > ISTM, we should have a new EState member that runs ExecShutdownNode() if > in standard_ExecutorEnd() if not already done. > Even if it wasn't called before due to any reason, I think the relevant work should be done via standard_ExecutorEnd->ExecEndPlan->ExecEndNode. For example, for gather node, it will call ExecEndGather to perform the necessary work. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Secondary index access optimizations
On 12 September 2018 at 08:32, Konstantin Knizhnik wrote: > Also the patch proposed by you is much simple and does mostly the same. Yes, > it is not covering CHECK constraints, I started to look at this and found a problem in regards to varno during the predicate_implied_by() test. The problem is that the partition bound is always stored as varno=1 (For example, see how get_qual_for_list() calls makeVar()). This causes the patch to fail in cases where the partitioned table is not varno=1. You're also incorrectly using rinfo->clause to pass to predicate_implied_by(). This is a problem because stored here have not been translated to have the child varattnos. childqual is the correct thing to use as that's just been translated. You may have not used it as the varnos will have been converted to the child's varno, which will never be varno=1, so you might have found that not to work due to the missing code to change the varnos to 1. I've attached the diff for allpaths.c (only) that I ended up with to make it work. This causes the output of many other regression test to change, so you'll need to go over these and verify everything is correct again. Please, can you also add a test which tests this code which has a partition with columns in a different order than it's parent. Having an INT and a TEXT column is best as if the translations are done incorrectly it's likely to result in a crash which will alert us to the issue. It would be good to also verify the test causes a crash if you temporarily put the code back to using the untranslated qual. Thanks for working on this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services skip_implied_child_quals_allpaths.diff Description: Binary data
Re: SerializeParamList vs machines with strict alignment
On Wed, Oct 3, 2018 at 10:02 AM Amit Kapila wrote: > > > > Now chipmunk also failed for the same test. > > > > > What I find more interesting is > > > that both of the live Sparc critters are happy --- so despite > > > Thomas' statements upthread, they're coping with unaligned accesses. > > > Maybe you should have back-patched the test to older branches so > > > we could see what castoroides and protosciurus would do. But it's > > > probably not worth additional delay. > > > > > > > Agreed, I will push the code-fix on HEAD and code+test in back-branches. > > > > Pushed, let's wait and see if this can make all the failing buidfarm > members (due to this issue) happy. > All the affected members (gharial, chipmunk, anole) are happy. It feels good to see chipmunk becoming green after so many days. Thanks a lot, Tom for your assistance. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Add SKIP LOCKED to VACUUM and ANALYZE
On 10/3/18, 7:10 PM, "Michael Paquier" wrote: > Thanks, I have committed the patch, after minor tweaks to the docs. Thanks! > Mainly, in the VACUUM page, I have added a mention that VACUUM ANALYZE > can block for row sampling. I have also removed for now the set of > recommendations the patch added, as we still gather statistics on the > partitioned tables from the row samples gathered from the partitions, so > recommending to list only the leaves is not completely correct either > all the time. Good catch. The committed version of the documentation looks good to me. > We could perhaps add something about such recommendations. A separate > section covering more general things may be interesting. Agreed, this could be a good addition to the "Notes" sections. Nathan
RE: Function for listing archive_status directory
Hi Christoph, I think it is convenient to be able to check the archive_status directory contents information. I reviewed patch. It applies and passes regression test. I checked the code. It refers to the patch which added pg_ls_waldir() and pg_ls_logdir(), so I think it is good. There is one point I care about. All similar function are named pg_ls_***dir. It is clear these functions return directory contents information. If the new function intends to display the contents of the directory, pg_ls_***dir style might be better (e.g. pg_ls_archive_statusdir). But everyone know archive_status is a directory... If you want to follow the standard naming, then you may add the dir. Do you watch this thread? https://www.postgresql.org/message-id/flat/92f458a2-6459-44b8-a7f2-2add32250...@amazon.com They are also discussing about generic pg_ls function. Regards, Aya Iwata
Re: Skylake-S warning
On 2018-10-03 17:01:44 -0700, Daniel Wood wrote: > > > On October 3, 2018 at 3:55 PM Andres Freund wrote: > > > In the thread around > > https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de > > I'd found doing more aggressive padding helped a lot. Unfortunately I > > didn't pursue this further :( > > Interesting. Looks like you saw the same thing I see now. > > > I really suspect we're going to have to change the layout of PGXACT data > > in a way that makes a contiguous scan possible. That'll probably require > > some uglyness because a naive "use first free slot" scheme obviously is > > sensitive to there being holes. But I suspect that a scheme were > > backends would occasionally try to move themselves to an earlier slot > > wouldn't be too hard to implement. > > I've also thought of this. But someone could create a thousand connections > and > then disconnect all but the first and last creating a huge hole. Yea, that's what I was suggesting to address with "a scheme where backends would occasionally try to move to move themselves to an earlier slot". I was basically thinking that in occasions where a backend holds ProcArrayLock exclusively it could just compare the current number of connections and its slot and move itself earlier if it's more than, say, 10% after the #num-connection's slot. Or something. The advantage of that approach would be that the size of the change is probably fairly manageable. Greetings, Andres Freund
Re: Skylake-S warning
One other thought. Could we update pgxact->xmin less often? What would be the impact of this lower bound being lower than it would normally be with the existing scheme. Yes, it needs to be moved forward "occasionally". FYI, be careful with padding PGXACT's to a full cache line. With 1024 clients you'd completely blow out the L1 cache. The benefits with less than 200 clients is dubious. One problem with multiple(5) pgxact's in a single cache line is that you may get a miss fetching xmin and then a second miss fetching xid because an invalidation occurs between the 2 fetches from updating any of the other 5 pgxact's on the same cache line. I've found some improvement fetching both xmin and xid with a single 64 bit fetch. This prevents the invalidation between the two fetches. Similarily updating them with a single 64 bit write helps. Yes, this is uber tweaky.
Re: partition tree inspection functions
On Wed, Oct 03, 2018 at 08:12:59AM -0400, Jesper Pedersen wrote: > Removing isleaf would require extra round trips to the server to get > that information. So, I think we should keep it. I don't really get your point about extra round trips with the server, and getting the same level of information is as simple as a join between the result set of pg_partition_tree() and pg_class (better to add schema qualification and aliases to relations by the way): =# SELECT relid::regclass, parentrelid::regclass, level, relkind != 'p' AS isleaf FROM pg_partition_tree('ptif_test'::regclass), pg_class WHERE oid = relid; relid| parentrelid | level | isleaf -+-+---+ ptif_test | null| 0 | f ptif_test0 | ptif_test | 1 | f ptif_test1 | ptif_test | 1 | f ptif_test2 | ptif_test | 1 | t ptif_test01 | ptif_test0 | 2 | t ptif_test11 | ptif_test1 | 2 | t (6 rows) -- Michael signature.asc Description: PGP signature
Re: [RFC] Removing "magic" oids
On 30/09/18 05:48, Andres Freund wrote: > 5a) The fact that system table oids don't show up in selects by default >makes it more work than necessary to look at catalogs. As a user, this is a big pain point for me. +1 for getting rid of it. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Re: Add SKIP LOCKED to VACUUM and ANALYZE
On Wed, Oct 03, 2018 at 04:20:42PM +, Bossart, Nathan wrote: > Here is what I have so far for the docs: > > [snip] Thanks, I have committed the patch, after minor tweaks to the docs. Mainly, in the VACUUM page, I have added a mention that VACUUM ANALYZE can block for row sampling. I have also removed for now the set of recommendations the patch added, as we still gather statistics on the partitioned tables from the row samples gathered from the partitions, so recommending to list only the leaves is not completely correct either all the time. We could perhaps add something about such recommendations. A separate section covering more general things may be interesting. -- Michael signature.asc Description: PGP signature
Re: Skylake-S warning
> On October 3, 2018 at 3:55 PM Andres Freund wrote: > In the thread around > https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de > I'd found doing more aggressive padding helped a lot. Unfortunately I > didn't pursue this further :( Interesting. Looks like you saw the same thing I see now. > I really suspect we're going to have to change the layout of PGXACT data > in a way that makes a contiguous scan possible. That'll probably require > some uglyness because a naive "use first free slot" scheme obviously is > sensitive to there being holes. But I suspect that a scheme were > backends would occasionally try to move themselves to an earlier slot > wouldn't be too hard to implement. I've also thought of this. But someone could create a thousand connections and then disconnect all but the first and last creating a huge hole. One thought would be to have a bit array of used entries and to bias toward using free lower indexed entries. Only two cache lines of 1024 bits would need to be scanned to handle 1024 connections. ProcArrayAdd() would just do an atomic xchg to set the used bit. I was even thinking of decomposing PGXACT into separate arrays of XID's, XMIN's, etc. Then have a bit map of those entries where the XID was valid. When you set/clear your XID you'd also set/clear this bit. With the select only workload the XID are all "Invalid" thus scanning this bit array of zeroed 64 bit long entries would be quite efficient. The vacuum and logical decoding flags could be directly done as a bit map. Having a array of 1 byte subtxn counts creates another opportunity for improvement. Scanning for a non-zero in env's which use few subtxn's is efficient. If subtxn's are used a lot then the repeated PGXACT cache line invalidations when setting nxids is offset by the fact that you can grab 8 of them in a single fetch. - Dan
Re: Skylake-S warning
Hi, On 2018-10-03 14:29:39 -0700, Daniel Wood wrote: > If running benchmarks or you are a customer which is currently > impacted by GetSnapshotData() on high end multisocket systems be wary > of Skylake-S. > Performance differences of nearly 2X can be seen on select only > pgbench due to nothing else but unlucky choices for max_connections. > Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon > Platinum 8175M @ 2.50-GHz) system. pgbench -S FWIW, I've seen performance differences of that magnitude going back to at least nehalem. But not on every larger system, interestingly. > If this is indeed just disadvantageous placement of structures/arrays > in memory then you might also find that after upgrading a previous > good choice for max_connections becomes a bad choice if things move > around. In the thread around https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de I'd found doing more aggressive padding helped a lot. Unfortunately I didn't pursue this further :( > NOTE2: It is unclear why PG needs to support over 64K sessions. At > about 10MB per backend(at the low end) the empty backends alone would > consume 640GB's of memory! Changing pgprocnos from int to short gives > me the following results. I've argued this before. After that we reduced MAX_BACKENDS to 0x3 - I personaly think we could go to 16bit without it being a practical problem. I really suspect we're going to have to change the layout of PGXACT data in a way that makes a contiguous scan possible. That'll probably require some uglyness because a naive "use first free slot" scheme obviously is sensitive to there being holes. But I suspect that a scheme were backends would occasionally try to move themselves to an earlier slot wouldn't be too hard to implement. Greetings, Andres Freund
DROP DATABASE doesn't force other backends to close FDs
Hi, Joerg reported on IRC that after a DROP DATABASE the space of the dropped database wasn't available to the OS until he killed a session that was active from before the drop. I was kinda incredulous at first, the code looks sane... Thomas figured out significant parts of this. dropdb() does a fairly careful dance to ensure processes close FDs for about-to-be-removed files: /* * Drop pages for this database that are in the shared buffer cache. This * is important to ensure that no remaining backend tries to write out a * dirty buffer to the dead database later... */ DropDatabaseBuffers(db_id); /* * Tell the stats collector to forget it immediately, too. */ pgstat_drop_database(db_id); /* * Tell checkpointer to forget any pending fsync and unlink requests for * files in the database; else the fsyncs will fail at next checkpoint, or * worse, it will delete files that belong to a newly created database * with the same OID. */ ForgetDatabaseFsyncRequests(db_id); /* * Force a checkpoint to make sure the checkpointer has received the * message sent by ForgetDatabaseFsyncRequests. On Windows, this also * ensures that background procs don't hold any open files, which would * cause rmdir() to fail. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); /* * Remove all tablespace subdirs belonging to the database. */ remove_dbtablespaces(db_id); By my code reading that ensures that FDs held open by checkpointer and bgwriter are closed in a relatively timely fashion (although I don't think there's a nice time-bound guarantee for bgwriter, which could be problematic for windows). That unfortunately disregards that normal backends could have plenty files open for the to-be-dropped database, if there's any sort of shared_buffer pressure. Whenever a backend writes out a dirty victim page belonging to another database, it'll also open files therein. I thought that'd not be a problem because surely we do a smgrclose() at some proper point to avoid that issue. But we don't. I don't quite know how to trivially [1] guarantee that an invalidation we send out is processed by all backends before continuing. But even if we just send out an invalidation to all backends that instruct them to do an smgrcloseall() the next time they do invalidation processing, we'd be much better off than what are now (where we'll never close those files until there's a lot of fd pressure or the backend exits). I wonder if this is responsible for some issues we had with windows in the past... [1] I think we need something like an acknowledge protocol for a few things, most recently onlining enabling checksums anyway. What I had in mind is simple: 1) one global 64bit request counter in shared memory 2) per-backend counter in shared memory, that acknowledge up to which point requests have been processed. 3) per-purpose procsignal that request e.g. that checksums get enabled, that smgrcloseall has been processed, etc. 4) Function that sends out such a signal to all backends and then waits till all such requests have been processed. This obviously shouldn't be used for very frequent occurances, but the use-cases I have in mind should never be that. Greetings, Andres Freund
Skylake-S warning
If running benchmarks or you are a customer which is currently impacted by GetSnapshotData() on high end multisocket systems be wary of Skylake-S. Performance differences of nearly 2X can be seen on select only pgbench due to nothing else but unlucky choices for max_connections. Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. pgbench -S Results from 5 runs varying max_connections from 400 to 405: max conn TPS 400 677639 4011146776 4021122140 403 765664 404 671455 4051190277 ... perf top shows about 21% GetSnapshotData() with the good numbers and 48% with the bad numbers. This problem is not seen on a 2 socket 32 core Haswell system. Being a one man show I lack some of the diagnostic tools to drill down further. My suspicion is that the fact that Intel has lowered the L2 associativity from 8(Haswell) to 4(Skylake-S) may be the cause. The other possibility is that at higher core counts the shared 16-way inclusive associative L3 cache becomes insufficient. Perhaps that is why Intel has moved to an exclusive L3 cache on Skylake-SP. If this is indeed just disadvantageous placement of structures/arrays in memory then you might also find that after upgrading a previous good choice for max_connections becomes a bad choice if things move around. NOTE: int pgprocno = pgprocnos[index]; is where the big increase in time occurs in GetSnapshotData() This is largely read-only, once all connections are established, and easily fits in the L1, and is not next to anything else causing invalidations. NOTE2: It is unclear why PG needs to support over 64K sessions. At about 10MB per backend(at the low end) the empty backends alone would consume 640GB's of memory! Changing pgprocnos from int to short gives me the following results. max conn TPS 400 780119 4011129286 4021263093 403 887021 404 679891 4051218118 While this change is significant on large Skylake systems it is likely just a trivial improvement on other systems or workloads.
Re: executor relation handling
I wrote: > Amit Langote writes: >> Should this check that we're not in a parallel worker process? > Hmm. I've not seen any failures in the parallel parts of the regular > regression tests, but maybe I'd better do a force_parallel_mode > run before committing. > In general, I'm not on board with the idea that parallel workers don't > need to get their own locks, so I don't really want to exclude parallel > workers from this check. But if it's not safe for that today, fixing it > is beyond the scope of this particular patch. So the place where that came out in the wash is the commit I just made (9a3cebeaa) to change the executor from taking table locks to asserting that somebody else took them already. To make that work, I had to make both ExecOpenScanRelation and relation_open skip checking for lock-held if IsParallelWorker(). This makes me entirely uncomfortable with the idea that parallel workers can be allowed to not take any locks of their own. There is no basis for arguing that we have field proof that that's safe, because *up to now, parallel workers in fact did take their own locks*. And it seems unsafe on its face, because there's nothing that really guarantees that the parent process won't go away while children are still running. (elog(FATAL) seems like a counterexample, for instance.) I think that we ought to adjust parallel query to insist that children do take locks, and then revert the IsParallelWorker() exceptions I made here. I plan to leave that point in abeyance till we've got the rest of these changes in place, though. The easiest way to do it will doubtless change once we've centralized the executor's table-opening logic, so trying to code it right now seems like a waste of effort. regards, tom lane
Re: Query is over 2x slower with jit=on
Hi, On 2018-10-01 00:32:18 -0700, Lukas Fittl wrote: > On Tue, Sep 25, 2018 at 1:17 PM, Andres Freund wrote: > > > > I've pushed the change without that bit - it's just a few additional > > lines if we want to change that. > > > > It seems that since this commit, JIT statistics are now only being printed > for parallel query plans. This is due to ExplainPrintJIT being called > before ExecutorEnd, so in a non-parallel query, > queryDesc->estate->es_jit_combined_instr will never get set. Ugh. > Attached a patch that instead introduces a new ExplainPrintJITSummary > method that summarizes the statistics before they get printed. Yea, I had something like that, and somehow concluded that it wasn't needed, and moved it to the wrong place :/. > I've also removed an (I believe) unnecessary "if (estate->es_instrument)" > check that prevented EXPLAIN without ANALYZE from showing whether JIT would > be used or not. > > In addition this also updates a missed section in the documentation with > the new stats output format. Thanks a lot for the patch! I've pushed it with some mild changes (renamed es_jit_combined_stats to worker_stats; changed ExplainPrintJITSummary to not look at es_jit, but es_jit_flags as theoretically es_jit might not be allocated; very minor comment changes). - Andres
Shouldn't ExecShutdownNode() be called from standard_ExecutorFinish()?
Hi, There's a few cases where by the time ExecutorFinish() is called, ExecShutdownNode() has not yet been called. As, among other reasons, ExecShutdownNode() also collects instrumentation, I think that's problematic. In ExplainOnePlan() we call /* run cleanup too */ ExecutorFinish(queryDesc); and then print the majority of the explain data: /* Create textual dump of plan tree */ ExplainPrintPlan(es, queryDesc); and only then shut down the entire query: ExecutorEnd(queryDesc); which seems to mean that if a node hasn't yet been shut down for some reason, we'll not have information that's normally collected in ExecShutdownNode(). ISTM, we should have a new EState member that runs ExecShutdownNode() if in standard_ExecutorEnd() if not already done. Am I missing something? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 14:01:35 -0400, Tom Lane wrote: >> BTW, so far as I can tell on F28, strfromd isn't exposed without >> "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; >> what else does that affect? > So I don't think anything's needed to enable that in pg, given that we > define _GNU_SOURCE Ah, OK. I thought my test case had _GNU_SOURCE defined already, but it didn't. You might want to do something like what I stuck in for strchrnul, though: /* * glibc's declares strchrnul only if _GNU_SOURCE is defined. * While we typically use that on glibc platforms, configure will set * HAVE_STRCHRNUL whether it's used or not. Fill in the missing declaration * so that this file will compile cleanly with or without _GNU_SOURCE. */ #ifndef _GNU_SOURCE extern char *strchrnul(const char *s, int c); #endif regards, tom lane
Re: Performance improvements for src/port/snprintf.c
I wrote: >> Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How >> about we simply add a static assert that long long isn't bigger than >> int64? > WFM, I'll make it happen. Actually, while writing a comment to go with that assertion, I decided this was dumb. If we're expecting the compiler to have "long long", and if we're convinced that no platforms define "long long" as wider than 64 bits, we may as well go with the s/int64/long long/g solution. That should result in no code change on any platform today. And it will still work correctly, if maybe a bit inefficiently, on some hypothetical future platform where long long is wider. We (or our successors) can worry about optimizing that when the time comes. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 14:01:35 -0400, Tom Lane wrote: > Andres Freund writes: > > So when using pg's snprintf() to print a single floating point number > > with precision, we get nearly a 10% boost. > > I just tested that using my little standalone testbed, and I failed > to replicate the result. I do see that strfromd is slightly faster, > but it's just a few percent measuring snprintf.c in isolation --- in > the overall context of COPY, I don't see how you get to 10% net savings. I just tested your patch, and I see (best of three): master: 16224.727 ms hack-use-of-strfromd.patch: 14944.927 ms So not quite 10%, but pretty close. CREATE TABLE somefloats(id serial, data1 float8, data2 float8, data3 float8); INSERT INTO somefloats(data1, data2, data3) SELECT random(), random(), random() FROM generate_series(1, 1000); VACUUM FREEZE somefloats; COPY somefloats TO '/dev/null'; What difference do you see? > So I continue to think there's something fishy about your test case. > > BTW, so far as I can tell on F28, strfromd isn't exposed without > "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; > what else does that affect? My copy says: #undef __GLIBC_USE_IEC_60559_BFP_EXT #if defined __USE_GNU || defined __STDC_WANT_IEC_60559_BFP_EXT__ # define __GLIBC_USE_IEC_60559_BFP_EXT 1 #else # define __GLIBC_USE_IEC_60559_BFP_EXT 0 #endif And __USE_GNU is enabled by #ifdef _GNU_SOURCE # define __USE_GNU 1 #endif So I don't think anything's needed to enable that in pg, given that we define _GNU_SOURCE Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
On 2018-Oct-03, Tom Lane wrote: > Andres Freund writes: > BTW, so far as I can tell on F28, strfromd isn't exposed without > "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; > what else does that affect? https://en.cppreference.com/w/c/experimental/fpext1 -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > So when using pg's snprintf() to print a single floating point number > with precision, we get nearly a 10% boost. I just tested that using my little standalone testbed, and I failed to replicate the result. I do see that strfromd is slightly faster, but it's just a few percent measuring snprintf.c in isolation --- in the overall context of COPY, I don't see how you get to 10% net savings. So I continue to think there's something fishy about your test case. BTW, so far as I can tell on F28, strfromd isn't exposed without "-D__STDC_WANT_IEC_60559_BFP_EXT__", which seems fairly scary; what else does that affect? regards, tom lane diff --git a/src/port/snprintf.c b/src/port/snprintf.c index b9b6add..f75369c 100644 --- a/src/port/snprintf.c +++ b/src/port/snprintf.c @@ -1137,17 +1137,19 @@ fmtfloat(double value, char type, int forcesign, int leftjust, zeropadlen = precision - prec; fmt[0] = '%'; fmt[1] = '.'; - fmt[2] = '*'; - fmt[3] = type; - fmt[4] = '\0'; - vallen = sprintf(convert, fmt, prec, value); + fmt[2] = (prec / 100) + '0'; + fmt[3] = ((prec % 100) / 10) + '0'; + fmt[4] = (prec % 10) + '0'; + fmt[5] = type; + fmt[6] = '\0'; + vallen = strfromd(convert, sizeof(convert), fmt, value); } else { fmt[0] = '%'; fmt[1] = type; fmt[2] = '\0'; - vallen = sprintf(convert, fmt, value); + vallen = strfromd(convert, sizeof(convert), fmt, value); } if (vallen < 0) goto fail;
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 13:18:35 -0400, Tom Lane wrote: >> Having said that, maybe there's a case for changing the type spec >> in only the va_arg() call, and leaving snprintf's related local >> variables as int64. (Is that what you actually meant?) Then, >> if long long really is different from int64, at least we have >> predictable truncation behavior after fetching the value, rather >> than undefined behavior while fetching it. > Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How > about we simply add a static assert that long long isn't bigger than > int64? WFM, I'll make it happen. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
On 2018-10-03 13:40:03 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-10-03 13:31:09 -0400, Tom Lane wrote: > >> I do not see the point of messing with snprintf.c here. I doubt that > >> strfromd is faster than the existing sprintf call (because the latter > >> can use ".*" instead of serializing and deserializing the precision). > > > I'm confused, the numbers I posted clearly show that it's faster? > > Those were in the context of whether float8out went through snprintf.c > or directly to strfromd, no? I measured both, changing float8out directly, and just adapting snprintf.c: > snprintf using sprintf via pg_double_to_string: > 16195.787 > > snprintf using strfromd via pg_double_to_string: > 14856.974 ms > > float8out using sprintf via pg_double_to_string: > 16176.169 > > float8out using strfromd via pg_double_to_string: > 13532.698 So when using pg's snprintf() to print a single floating point number with precision, we get nearly a 10% boost. The win unsurprisingly is bigger when not going through snprintf.c. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 13:18:35 -0400, Tom Lane wrote: > I wrote: > > Andres Freund writes: > >> - I know it's not new, but is it actually correct to use va_arg(args, > >> int64) > >> for ATYPE_LONGLONG? > > > Well, the problem with just doing s/int64/long long/g is that the > > code would then fail on compilers without a "long long" type. > > We could ifdef our way around that, but I don't think the code would > > end up prettier. > > I spent a bit more time thinking about that point. My complaint about > lack of long long should be moot given that we're now requiring C99. True, I didn't think of that. > So the two cases we need to worry about are (1) long long exists and > is 64 bits, and (2) long long exists and is wider than 64 bits. In > case (1) there's nothing actively wrong with the code as it stands. > In case (2), if we were to fix the problem by s/int64/long long/g, > the result would be that we'd be doing the arithmetic for all > integer-to-text conversions in 128 bits, which seems likely to be > pretty expensive. Yea, that seems quite undesirable. > So a "real" fix would probably require having separate versions of > fmtint for long and long long. I'm not terribly excited about > going there. I can see it happening some day when/if we need to > use 128-bit math more extensively than today, but I do not think > that day is close. Right, that seems a bit off. > (Are there *any* platforms where "long long" is 128 bits today?) Not that I'm aware off. > Having said that, maybe there's a case for changing the type spec > in only the va_arg() call, and leaving snprintf's related local > variables as int64. (Is that what you actually meant?) Then, > if long long really is different from int64, at least we have > predictable truncation behavior after fetching the value, rather > than undefined behavior while fetching it. Hm. I guess that'd be a bit better, but I'm not sure it's worth it. How about we simply add a static assert that long long isn't bigger than int64? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 13:31:09 -0400, Tom Lane wrote: >> I do not see the point of messing with snprintf.c here. I doubt that >> strfromd is faster than the existing sprintf call (because the latter >> can use ".*" instead of serializing and deserializing the precision). > I'm confused, the numbers I posted clearly show that it's faster? Those were in the context of whether float8out went through snprintf.c or directly to strfromd, no? regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 13:31:09 -0400, Tom Lane wrote: > I do not see the point of messing with snprintf.c here. I doubt that > strfromd is faster than the existing sprintf call (because the latter > can use ".*" instead of serializing and deserializing the precision). I'm confused, the numbers I posted clearly show that it's faster? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 12:54:52 -0400, Tom Lane wrote: >> (1) The need to use sprintf for portability means that we need very >> tight constraints on the precision spec *and* the buffer size *and* >> the format type (%f pretty much destroys certainty about how long the >> output string is). So this isn't going to be general purpose code. >> I think just writing it into float[48]out is sufficient. > Well, the numbers suggest it's also useful to do so from snprintf - it's > not that rare that we output floating point numbers from semi > performance critical code, even leaving aside float[48]out. So I'm not > convinced that we shouldn't do this from within snprintf.c too. Now we > could open-code it twice, but i'm not sure I see the point. I do not see the point of messing with snprintf.c here. I doubt that strfromd is faster than the existing sprintf call (because the latter can use ".*" instead of serializing and deserializing the precision). Even if it is, I do not want to expose an attractive-nuisance API in a header, and I think this would be exactly that. > If we just define the API as having to guarantee there's enough space > for the output format, I think it'll work well enough for now? No, because that's a recipe for buffer-overflow bugs. It's *hard* to be sure the buffer is big enough, and easy to make breakable assumptions. > snprintf.c already assumes everything floating point can be output in > 1024 chars, no? Indeed, and it's got hacks like a forced limit to precision 350 in order to make that safe. I don't want to be repeating the reasoning in fmtfloat() in a bunch of other places. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
I wrote: > Andres Freund writes: >> - I know it's not new, but is it actually correct to use va_arg(args, int64) >> for ATYPE_LONGLONG? > Well, the problem with just doing s/int64/long long/g is that the > code would then fail on compilers without a "long long" type. > We could ifdef our way around that, but I don't think the code would > end up prettier. I spent a bit more time thinking about that point. My complaint about lack of long long should be moot given that we're now requiring C99. So the two cases we need to worry about are (1) long long exists and is 64 bits, and (2) long long exists and is wider than 64 bits. In case (1) there's nothing actively wrong with the code as it stands. In case (2), if we were to fix the problem by s/int64/long long/g, the result would be that we'd be doing the arithmetic for all integer-to-text conversions in 128 bits, which seems likely to be pretty expensive. So a "real" fix would probably require having separate versions of fmtint for long and long long. I'm not terribly excited about going there. I can see it happening some day when/if we need to use 128-bit math more extensively than today, but I do not think that day is close. (Are there *any* platforms where "long long" is 128 bits today?) Having said that, maybe there's a case for changing the type spec in only the va_arg() call, and leaving snprintf's related local variables as int64. (Is that what you actually meant?) Then, if long long really is different from int64, at least we have predictable truncation behavior after fetching the value, rather than undefined behavior while fetching it. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 12:54:52 -0400, Tom Lane wrote: > Andres Freund writes: > > It seems the general "use strfromd if available" approach is generally > > useful, even if we need to serialize the precision. > > Agreed. > > > Putting it into an > > inline appears to be helpful, avoids some of the otherwise precision > > related branches. Do you have any feelings about which header to put > > the code in? I used common/string.h so far. > > I do not think it should be in a header, for two reasons: > > (1) The need to use sprintf for portability means that we need very > tight constraints on the precision spec *and* the buffer size *and* > the format type (%f pretty much destroys certainty about how long the > output string is). So this isn't going to be general purpose code. > I think just writing it into float[48]out is sufficient. Well, the numbers suggest it's also useful to do so from snprintf - it's not that rare that we output floating point numbers from semi performance critical code, even leaving aside float[48]out. So I'm not convinced that we shouldn't do this from within snprintf.c too. Now we could open-code it twice, but i'm not sure I see the point. If we just define the API as having to guarantee there's enough space for the output format, I think it'll work well enough for now? snprintf.c already assumes everything floating point can be output in 1024 chars, no? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > It seems the general "use strfromd if available" approach is generally > useful, even if we need to serialize the precision. Agreed. > Putting it into an > inline appears to be helpful, avoids some of the otherwise precision > related branches. Do you have any feelings about which header to put > the code in? I used common/string.h so far. I do not think it should be in a header, for two reasons: (1) The need to use sprintf for portability means that we need very tight constraints on the precision spec *and* the buffer size *and* the format type (%f pretty much destroys certainty about how long the output string is). So this isn't going to be general purpose code. I think just writing it into float[48]out is sufficient. (2) It's already the case that most code trying to emit floats ought to go through float[48]out, in order to have standardized treatment of Inf and NaN. Providing some other API in a common header would just create a temptation to break that policy. Now, if we did write our own float output code then we would standardize Inf/NaN outputs inside that, and both of these issues would go away ... but I think what we'd do is provide something strfromd-like as an alternate API for that code, so we still won't need a wrapper. And anyway it doesn't sound like either of us care to jump that hurdle right now. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 11:59:27 -0400, Tom Lane wrote: >> I experimented with adding an initial check for "format is exactly %s" >> at the top of dopr(), and couldn't get excited about that. Instrumenting >> things showed that the optimization fired in only 1.8% of the calls >> during a run of our core regression tests. Now, that might not count >> as a really representative workload, but it doesn't make me think that >> the case is worth optimizing for us. > Seems right. I also have a hard time to believe that any of those "%s" > printfs are performance critical - we'd hopefully just have avoided the > sprintf in that case. Yup, that's probably a good chunk of the reason why there aren't very many. But we *do* use %s a lot to assemble multiple strings or combine them with fixed text, which is why the other form of the optimization is useful. >> But then it occurred to me that there's more than one way to skin this >> cat. We could, for an even cheaper extra test, detect that any one >> format specifier is just "%s", and use the same kind of fast-path >> within the loop. With the same sort of instrumentation, I found that >> a full 45% of the format specs executed in the core regression tests >> are just %s. That makes me think that a patch along the lines of the >> attached is a good win for our use-cases. Comparing to Fedora 28's >> glibc, this gets us to > Hm, especially if we special case the float->string conversions directly > at the hot callsites, that seems reasonable. I kinda wish we could just > easily move the format string processing to compile-time, but given > translatability that won't be widely possible even if it were otherwise > feasible. Yeah, there's a mighty big pile of infrastructure that depends on the way *printf works. I agree that one way or another we're going to be special-casing float8out and float4out. BTW, I poked around in the related glibc sources the other day, and it seemed like they are doing some sort of quasi-compilation of format strings. I couldn't figure out how they made it pay, though --- without some way to avoid re-compiling the same format string over and over, seems like it couldn't net out as a win. But if they are avoiding that, I didn't find where. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 12:22:13 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-10-03 12:07:32 -0400, Tom Lane wrote: > >> [ scratches head ... ] How would that work? Seems like it necessarily > >> adds a strlen() call to whatever we'd be doing otherwise. palloc isn't > >> going to be any faster just from asking it for slightly fewer bytes. > >> I think there might be something wrong with your test scenario ... > >> or there's more noise in the numbers than you thought. > > > I guess the difference is that we're more likely to find reusable chunks > > in aset.c and/or need fewer OS allocations. As the memory is going to > > be touched again very shortly afterwards, the cache effects probably are > > neglegible. > > > The strlen definitely shows up in profiles, it just seems to save at > > least as much as it costs. > > > Doesn't strike me as THAT odd? > > What it strikes me as is excessively dependent on one particular test > scenario. I don't mind optimizations that are tradeoffs between > well-understood costs, but this smells like handwaving that's going to > lose as much or more often than winning, once it hits the real world. I'm not particularly wedded to doing the allocation differently - I was just mildly wondering if the increased size of the allocations could be problematic. And that lead me to testing that. And reporting it. I don't think the real-world test differences are that large in this specific case, but whatever. It seems the general "use strfromd if available" approach is generally useful, even if we need to serialize the precision. Putting it into an inline appears to be helpful, avoids some of the otherwise precision related branches. Do you have any feelings about which header to put the code in? I used common/string.h so far. Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 11:59:27 -0400, Tom Lane wrote: > I wrote: > > ... However, I did add recent glibc (Fedora 28) > > to the mix, and I was interested to discover that they seem to have > > added a fast-path for format strings that are exactly "%s", just as > > NetBSD did. I wonder if we should reconsider our position on doing > > that. It'd be a simple enough addition... > > I experimented with adding an initial check for "format is exactly %s" > at the top of dopr(), and couldn't get excited about that. Instrumenting > things showed that the optimization fired in only 1.8% of the calls > during a run of our core regression tests. Now, that might not count > as a really representative workload, but it doesn't make me think that > the case is worth optimizing for us. Seems right. I also have a hard time to believe that any of those "%s" printfs are performance critical - we'd hopefully just have avoided the sprintf in that case. > But then it occurred to me that there's more than one way to skin this > cat. We could, for an even cheaper extra test, detect that any one > format specifier is just "%s", and use the same kind of fast-path > within the loop. With the same sort of instrumentation, I found that > a full 45% of the format specs executed in the core regression tests > are just %s. That makes me think that a patch along the lines of the > attached is a good win for our use-cases. Comparing to Fedora 28's > glibc, this gets us to Hm, especially if we special case the float->string conversions directly at the hot callsites, that seems reasonable. I kinda wish we could just easily move the format string processing to compile-time, but given translatability that won't be widely possible even if it were otherwise feasible. Greetings, Andres Freund
Re: libpq host/hostaddr/conninfo inconsistencies
Sorry for late answer. On 9/30/18 10:21 AM, Fabien COELHO wrote: sh> psql "host=127.0.0.2 hostaddr=127.0.0.1" I'm not sure that is is the issue. User defined the host name and psql show it. The issue is that "host" is an ip, "\conninfo" will inform wrongly that you are connected to "127.0.0.2", but the actual connection is really to "127.0.0.1", this is plain misleading, and I consider this level of unhelpfullness more a bug than a feature. I didn't think that this is an issue, because I determined "host" as just a host's display name when "hostaddr" is defined. So user may determine 127.0.0.1 (hostaddr) as "happy_host", for example. It shouldn't be a real host. I searched for another use cases of PQhost(). In PostgreSQL source code I found that it is used in pg_dump and psql to connect to some instance. There is the next issue with PQhost() and psql (pg_dump could have it too, see CloneArchive() in pg_backup_archiver.c and _connectDB() in pg_backup_db.c): $ psql "host=host_1,host_2 hostaddr=127.0.0.1,127.0.0.3 dbname=postgres" =# \conninfo You are connected to database "postgres" as user "artur" on host "host_1" at port "5432". =# \connect test could not translate host name "host_1" to address: Неизвестное имя или служба Previous connection kept So in the example above you cannot reuse connection string with \connect. What do you think? I cannot agree with you. When I've learned libpq before I found host/hostaddr rules description useful. And I disagree that it is good to remove it (as the patch does). Of course it is only my point of view and others may have another opinion. I'm not sure I understand your concern. Do you mean that you would prefer the document to keep describing that host/hostaddr/port accepts one value, and then have in some other place or at the end of the option documentation a line that say, "by the way, we really accept lists, and they must be somehow consistent between host/hostaddr/port"? I wrote about the following part of the documentation: -Using hostaddr instead of host allows the -application to avoid a host name look-up, which might be important -in applications with time constraints. However, a host name is -required for GSSAPI or SSPI authentication -methods, as well as for verify-full SSL -certificate verification. The following rules are used: - > ... So I think description of these rules is useful here and shouldn't be removed. Your patch removes it and maybe it shouldn't do that. But now I realised that the patch breaks this behavior and backward compatibility is broken. Patch gives me an error if I specified only hostaddr: psql -d "hostaddr=127.0.0.1" psql: host "/tmp" cannot have an hostaddr "127.0.0.1" This is the expected modified behavior: hostaddr can only be specified on a host when it is a name, which is not the case here. See the comment above about backward compatibility. psql without the patch can connect to an instance if I specify only hostaddr. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-03 12:07:32 -0400, Tom Lane wrote: >> [ scratches head ... ] How would that work? Seems like it necessarily >> adds a strlen() call to whatever we'd be doing otherwise. palloc isn't >> going to be any faster just from asking it for slightly fewer bytes. >> I think there might be something wrong with your test scenario ... >> or there's more noise in the numbers than you thought. > I guess the difference is that we're more likely to find reusable chunks > in aset.c and/or need fewer OS allocations. As the memory is going to > be touched again very shortly afterwards, the cache effects probably are > neglegible. > The strlen definitely shows up in profiles, it just seems to save at > least as much as it costs. > Doesn't strike me as THAT odd? What it strikes me as is excessively dependent on one particular test scenario. I don't mind optimizations that are tradeoffs between well-understood costs, but this smells like handwaving that's going to lose as much or more often than winning, once it hits the real world. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 12:07:32 -0400, Tom Lane wrote: > Andres Freund writes: > > FWIW, it seems that using a local buffer and than pstrdup'ing that in > > float8out_internal is a bit faster, and would probably save a bit of > > memory on average: > > float8out using sprintf via pg_double_to_string, pstrdup: > > 15370.774 > > float8out using strfromd via pg_double_to_string, pstrdup: > > 13498.331 > > [ scratches head ... ] How would that work? Seems like it necessarily > adds a strlen() call to whatever we'd be doing otherwise. palloc isn't > going to be any faster just from asking it for slightly fewer bytes. > I think there might be something wrong with your test scenario ... > or there's more noise in the numbers than you thought. I guess the difference is that we're more likely to find reusable chunks in aset.c and/or need fewer OS allocations. As the memory is going to be touched again very shortly afterwards, the cache effects probably are neglegible. The strlen definitely shows up in profiles, it just seems to save at least as much as it costs. Doesn't strike me as THAT odd? Greetings, Andres Freund
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > FWIW, it seems that using a local buffer and than pstrdup'ing that in > float8out_internal is a bit faster, and would probably save a bit of > memory on average: > float8out using sprintf via pg_double_to_string, pstrdup: > 15370.774 > float8out using strfromd via pg_double_to_string, pstrdup: > 13498.331 [ scratches head ... ] How would that work? Seems like it necessarily adds a strlen() call to whatever we'd be doing otherwise. palloc isn't going to be any faster just from asking it for slightly fewer bytes. I think there might be something wrong with your test scenario ... or there's more noise in the numbers than you thought. regards, tom lane
Re: Performance improvements for src/port/snprintf.c
I wrote: > ... However, I did add recent glibc (Fedora 28) > to the mix, and I was interested to discover that they seem to have > added a fast-path for format strings that are exactly "%s", just as > NetBSD did. I wonder if we should reconsider our position on doing > that. It'd be a simple enough addition... I experimented with adding an initial check for "format is exactly %s" at the top of dopr(), and couldn't get excited about that. Instrumenting things showed that the optimization fired in only 1.8% of the calls during a run of our core regression tests. Now, that might not count as a really representative workload, but it doesn't make me think that the case is worth optimizing for us. But then it occurred to me that there's more than one way to skin this cat. We could, for an even cheaper extra test, detect that any one format specifier is just "%s", and use the same kind of fast-path within the loop. With the same sort of instrumentation, I found that a full 45% of the format specs executed in the core regression tests are just %s. That makes me think that a patch along the lines of the attached is a good win for our use-cases. Comparing to Fedora 28's glibc, this gets us to Test case: %s snprintf time = 8.83615 ms total, 8.83615e-06 ms per iteration pg_snprintf time = 23.9372 ms total, 2.39372e-05 ms per iteration ratio = 2.709 Test case: %sx snprintf time = 59.4481 ms total, 5.94481e-05 ms per iteration pg_snprintf time = 29.8983 ms total, 2.98983e-05 ms per iteration ratio = 0.503 versus what we have as of this morning's commit: Test case: %s snprintf time = 7.7427 ms total, 7.7427e-06 ms per iteration pg_snprintf time = 26.2439 ms total, 2.62439e-05 ms per iteration ratio = 3.390 Test case: %sx snprintf time = 61.4452 ms total, 6.14452e-05 ms per iteration pg_snprintf time = 32.7516 ms total, 3.27516e-05 ms per iteration ratio = 0.533 The penalty for non-%s cases seems to be a percent or so, although it's barely above the noise floor in my tests. regards, tom lane diff --git a/src/port/snprintf.c b/src/port/snprintf.c index cad7345..b9b6add 100644 *** a/src/port/snprintf.c --- b/src/port/snprintf.c *** dopr(PrintfTarget *target, const char *f *** 431,436 --- 431,449 /* Process conversion spec starting at *format */ format++; + + /* Fast path for conversion spec that is exactly %s */ + if (*format == 's') + { + format++; + strvalue = va_arg(args, char *); + Assert(strvalue != NULL); + dostr(strvalue, strlen(strvalue), target); + if (target->failed) + break; + continue; + } + fieldwidth = precision = zpad = leftjust = forcesign = 0; longflag = longlongflag = pointflag = 0; fmtpos = accum = 0;
Re: Early WIP/PoC for inlining CTEs
On Tue, Oct 02, 2018 at 12:03:06PM +0200, Andreas Karlsson wrote: > Hi, > > Here is an updated patch which adds some simple syntax for adding the > optimization barrier. For example: > > WITH x AS MATERIALIZED ( >SELECT 1 > ) > SELECT * FROM x; > > Andreas This is great! Is there any meaningful distinction between "inlining," by which I mean converting to a subquery, and predicate pushdown, which would happen at least for a first cut, at the rewrite stage? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Performance improvements for src/port/snprintf.c
Hi, On 2018-10-03 08:20:14 -0400, Tom Lane wrote: > Andres Freund writes: > >> While there might be value in implementing our own float printing code, > >> I have a pretty hard time getting excited about the cost/benefit ratio > >> of that. I think that what we probably really ought to do here is hack > >> float4out/float8out to bypass the extra overhead, as in the 0002 patch > >> below. > > > I'm thinking we should do a bit more than just that hack. I'm thinking > > of something (barely tested) like > > Meh. The trouble with that is that it relies on the platform's snprintf, > not sprintf, and that brings us right back into a world of portability > hurt. I don't feel that the move to C99 gets us out of worrying about > noncompliant snprintfs --- we're only requiring a C99 *compiler*, not > libc. See buildfarm member gharial for a counterexample. Oh, we could just use sprintf() and tell strfromd the buffer is large enough. I only used snprintf because it seemed more symmetric, and because I was at most 1/3 awake. > I'm happy to look into whether using strfromd when available buys us > anything over using sprintf. I'm not entirely convinced that it will, > because of the need to ASCII-ize and de-ASCII-ize the precision, but > it's worth checking. It's definitely faster. It's not a full-blown format parser, so I guess the cost of the conversion isn't too bad: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/strfrom-skeleton.c;hb=HEAD#l68 CREATE TABLE somefloats(id serial, data1 float8, data2 float8, data3 float8); INSERT INTO somefloats(data1, data2, data3) SELECT random(), random(), random() FROM generate_series(1, 1000); VACUUM FREEZE somefloats; I'm comparing the times of: COPY somefloats TO '/dev/null'; master (including your commit): 16177.202 ms snprintf using sprintf via pg_double_to_string: 16195.787 snprintf using strfromd via pg_double_to_string: 14856.974 ms float8out using sprintf via pg_double_to_string: 16176.169 float8out using strfromd via pg_double_to_string: 13532.698 FWIW, it seems that using a local buffer and than pstrdup'ing that in float8out_internal is a bit faster, and would probably save a bit of memory on average: float8out using sprintf via pg_double_to_string, pstrdup: 15370.774 float8out using strfromd via pg_double_to_string, pstrdup: 13498.331 Greetings, Andres Freund
Re: BUG #15307: Low numerical precision of (Co-) Variance
On Wed, Oct 3, 2018 at 9:04 AM Dean Rasheed wrote: > > On Thu, 27 Sep 2018 at 14:22, Dean Rasheed wrote: > > I'll post an updated patch in a while. > > > > I finally got round to looking at this again. Here is an update, based > on the review comments. > > The next question is whether or not to back-patch this. Although this > was reported as a bug, my inclination is to apply this to HEAD only, > based on the lack of prior complaints. That also matches our decision > for other similar patches, e.g., 7d9a4737c2. This diff looks good to me. Also, it applies cleanly against abd9ca377d669a6e0560e854d7e987438d0e612e and passes `make check-world`. I agree that this is not suitable for a patch release. Madeleine
Re: BUG #15307: Low numerical precision of (Co-) Variance
Dean Rasheed writes: > The next question is whether or not to back-patch this. Although this > was reported as a bug, my inclination is to apply this to HEAD only, > based on the lack of prior complaints. That also matches our decision > for other similar patches, e.g., 7d9a4737c2. +1 for no back-patch. Even though the new results are hopefully more accurate, they're still different from before, and that might cause issues for somebody if it happens in a stable branch. regards, tom lane
Re: BUG #15307: Low numerical precision of (Co-) Variance
On Thu, 27 Sep 2018 at 14:22, Dean Rasheed wrote: > I'll post an updated patch in a while. > I finally got round to looking at this again. Here is an update, based on the review comments. The next question is whether or not to back-patch this. Although this was reported as a bug, my inclination is to apply this to HEAD only, based on the lack of prior complaints. That also matches our decision for other similar patches, e.g., 7d9a4737c2. Regards, Dean diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c new file mode 100644 index df35557..d1e12c1 --- a/src/backend/utils/adt/float.c +++ b/src/backend/utils/adt/float.c @@ -2401,13 +2401,39 @@ setseed(PG_FUNCTION_ARGS) * float8_stddev_samp - produce final result for float STDDEV_SAMP() * float8_stddev_pop - produce final result for float STDDEV_POP() * + * The naive schoolbook implementation of these aggregates works by + * accumulating sum(X) and sum(X^2). However, this approach suffers from + * large rounding errors in the final computation of quantities like the + * population variance (N*sum(X^2) - sum(X)^2) / N^2, since each of the + * intermediate terms is potentially very large, while the difference is often + * quite small. + * + * Instead we use the Youngs-Cramer algorithm [1] which works by accumulating + * Sx=sum(X) and Sxx=sum((X-Sx/N)^2), using a numerically stable algorithm to + * incrementally update those quantities. The final computations of each of + * the aggregate values is then trivial and gives more accurate results (for + * example, the population variance is just Sxx/N). This algorithm is also + * fairly easy to generalize to allow parallel execution without loss of + * precision (see, for example, [2]). For more details, and a comparison of + * this with other algorithms, see [3]. + * * The transition datatype for all these aggregates is a 3-element array - * of float8, holding the values N, sum(X), sum(X*X) in that order. + * of float8, holding the values N, Sx, Sxx in that order. * * Note that we represent N as a float to avoid having to build a special * datatype. Given a reasonable floating-point implementation, there should * be no accuracy loss unless N exceeds 2 ^ 52 or so (by which time the * user will have doubtless lost interest anyway...) + * + * [1] Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms, + * E. A. Youngs and E. M. Cramer, Technometrics Vol 13, No 3, August 1971. + * + * [2] Updating Formulae and a Pairwise Algorithm for Computing Sample + * Variances, T. F. Chan, G. H. Golub & R. J. LeVeque, COMPSTAT 1982. + * + * [3] Numerically Stable Parallel Computation of (Co-)Variance, Erich + * Schubert and Michael Gertz, Proceedings of the 30th International + * Conference on Scientific and Statistical Database Management, 2018. */ static float8 * @@ -2441,18 +2467,90 @@ float8_combine(PG_FUNCTION_ARGS) ArrayType *transarray2 = PG_GETARG_ARRAYTYPE_P(1); float8 *transvalues1; float8 *transvalues2; - - if (!AggCheckCallContext(fcinfo, NULL)) - elog(ERROR, "aggregate function called in non-aggregate context"); + float8 N1, +Sx1, +Sxx1, +N2, +Sx2, +Sxx2, +tmp, +N, +Sx, +Sxx; transvalues1 = check_float8_array(transarray1, "float8_combine", 3); transvalues2 = check_float8_array(transarray2, "float8_combine", 3); - transvalues1[0] = transvalues1[0] + transvalues2[0]; - transvalues1[1] = float8_pl(transvalues1[1], transvalues2[1]); - transvalues1[2] = float8_pl(transvalues1[2], transvalues2[2]); + N1 = transvalues1[0]; + Sx1 = transvalues1[1]; + Sxx1 = transvalues1[2]; - PG_RETURN_ARRAYTYPE_P(transarray1); + N2 = transvalues2[0]; + Sx2 = transvalues2[1]; + Sxx2 = transvalues2[2]; + + /* + * The transition values combine using a generalization of the + * Youngs-Cramer algorithm as follows: + * + * N = N1 + N2 + * Sx = Sx1 + Sx2 + * Sxx = Sxx1 + Sxx2 + N1 * N2 * (Sx1/N1 - Sx2/N2)^2 / N; + * + * It's worth handling the special cases N1 = 0 and N2 = 0 separately + * since those cases are trivial, and we then don't need to worry about + * division-by-zero errors in the general case. + * + */ + if (N1 == 0.0) + { + N = N2; + Sx = Sx2; + Sxx = Sxx2; + } + else if (N2 == 0.0) + { + N = N1; + Sx = Sx1; + Sxx = Sxx1; + } + else + { + N = N1 + N2; + Sx = float8_pl(Sx1, Sx2); + tmp = Sx1 / N1 - Sx2 / N2; + Sxx = Sxx1 + Sxx2 + N1 * N2 * tmp * tmp / N; + check_float8_val(Sxx, isinf(Sxx1) || isinf(Sxx2), true); + } + + /* + * If we're invoked as an aggregate, we can cheat and modify our first + * parameter in-place to reduce palloc overhead. Otherwise we construct a + * new array with the updated transition data and return it. + */ + if (AggCheckCallContext(fcinfo, NULL)) + { + transvalues1[0] = N; + transvalues1[1] = Sx; + transvalues1[2] = Sxx; + + PG_RETURN_ARRAYTYPE_P(transarray1); + } + else + { + Datum transdatums[3]; + ArrayType
Re: [RFC] Removing "magic" oids
On 09/30/2018 05:48 AM, Andres Freund wrote:> I think we should drop WITH OIDs support. +1 2) We need to be able to manually insert into catalog tables. Both initdb and emergency surgery. My current hack is doing so directly in nodeModifyTable.c but that's beyond ugly. I think we should add an explicit DEFAULT clause to those columns with something like nextoid('tablename', 'name_of_index') or such. Adding a function to get the next OID sounds like a good solution for both our catalog and legacy applications. The only potential disadvantage that I see is that this function becomes something we need to maintain if we ever decide to move from OIDs to sequences. 3) The quickest way to deal with the bootstrap code was to just assign all oids for oid carrying tables that don't yet have any assigned. That doesn't generally seem terrible, although it's certainly badly implemented right now. That'd mean we'd have three ranges of oids probably, unless we somehow have the bootstrap code advance the in-database oid counter to a right state before we start with post-bootstrap work. I like the idea of all bootstrap time oids being determined by genbki.pl (I think that'll allow to remove some code too). Agreed, having genbki.pl determine all oids in the bootstrap data sounds ideal. Andreas
Re: Performance improvements for src/port/snprintf.c
Andres Freund writes: > On 2018-10-02 17:54:31 -0400, Tom Lane wrote: >> Here's a version of this patch rebased over commit 625b38ea0. > Cool. Let's get that in... Cool, I'll push it shortly. >> While there might be value in implementing our own float printing code, >> I have a pretty hard time getting excited about the cost/benefit ratio >> of that. I think that what we probably really ought to do here is hack >> float4out/float8out to bypass the extra overhead, as in the 0002 patch >> below. > I'm thinking we should do a bit more than just that hack. I'm thinking > of something (barely tested) like Meh. The trouble with that is that it relies on the platform's snprintf, not sprintf, and that brings us right back into a world of portability hurt. I don't feel that the move to C99 gets us out of worrying about noncompliant snprintfs --- we're only requiring a C99 *compiler*, not libc. See buildfarm member gharial for a counterexample. I'm happy to look into whether using strfromd when available buys us anything over using sprintf. I'm not entirely convinced that it will, because of the need to ASCII-ize and de-ASCII-ize the precision, but it's worth checking. > FWIW, I think there's still a significant argument to be made that we > should work on our floating point IO performance. Both on the input and > output side. It's a significant practical problem. But both a fix like > you describe, and my proposal, should bring us to at least the previous > level of performance for the hot paths. So that'd then just be an > independent consideration. Well, an independent project anyway. I concur that it would have value; but whether it's worth the effort, and the possible behavioral changes, is not very clear to me. regards, tom lane
Re: partition tree inspection functions
Hi Michael, On 10/2/18 11:37 PM, Michael Paquier wrote: isleaf is not particularly useful as it can just be fetched with a join on pg_class/relkind. So I think that we had better remove it. Removing isleaf would require extra round trips to the server to get that information. So, I think we should keep it. Best regards, Jesper
Re[2]: Alter index rename concurrently to
> Based on these assertions, here is my proposed patch. It lowers the > lock level for index renaming to ShareUpdateExclusiveLock. Hi, Peter I made small review for your patch: Server source code got from https://github.com/postgres/postgres.git 1. Patch was applied without any errors except a part related to documentation: error: patch failed: doc/src/sgml/ref/alter_index.sgml:50 error: doc/src/sgml/ref/alter_index.sgml: patch does not apply 2. The code has been compiled successfully, configured by: # ./configure CFLAGS="-O0" --enable-debug --enable-cassert --enable-depend --without-zlib 3. 'make' / 'make install' successfully made and complete. 4. The compiled instance has been started without any errors. 5. I realized several tests by the pgbench (with -c 4 -j 4 -T 360) that modified data into columns, indexed by pk and common btree. In the same time there was a running script that was making renaming indexes multiple times in transactions with pg_sleep(1). After several tests no errors were found. 6. pg_upgrade from 10 to 12 (patched) has been done without any errors / warnings 7. Code style: +RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal, bool is_index) This line is longer than 80 chars. Thank you >Вторник, 14 августа 2018, 9:33 +03:00 от Peter Eisentraut >: > >On 31/07/2018 23:10, Peter Eisentraut wrote: >> On 27/07/2018 16:16, Robert Haas wrote: >>> With respect to this particular patch, I don't know whether there are >>> any hazards of the second type. What I'd check, if it were me, is >>> what structures in the index's relcache entry are going to get rebuilt >>> when the index name changes. If any of those look like things that >>> something that somebody could hold a pointer to during the course of >>> query execution, or more generally be relying on not to change during >>> the course of query execution, then you've got a problem. >> >> I have investigated this, and I think it's safe. Relcache reloads for >> open indexes are already handled specially in RelationReloadIndexInfo(). >> The only pointers that get replaced there are rd_amcache and >> rd_options. I have checked where those are used, and it looks like they >> are not used across possible relcache reloads. The code structure in >> those two cases make this pretty unlikely even in the future. Also, >> violations would probably be caught by CLOBBER_CACHE_ALWAYS. > >Based on these assertions, here is my proposed patch. It lowers the >lock level for index renaming to ShareUpdateExclusiveLock. > >-- >Peter Eisentraut http://www.2ndQuadrant.com/ >PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Regards, Andrey Klychkov
shared buffer manager problems and redesign
Hello, hackers (Actually I’m not sure if I should post it here or in pgsql-general mailing list) It's been discussed again a few times recently regarding the time-consuming behavior when mapping shared buffers that happens in TRUNCATE, VACUUM when deleting the trailing empty pages in the shared buffer [1], data corruption on file truncation error [2], etc. Buffer Manager redesign/restructure Andres Freund has been working on this before and described design and methods on how the buffer radix tree can be implemented. [4] I think it is worth considering because there were a lot of proposed solutions in previous threads [1] [2] [3] [4], but we could not arrive at consensus, as it has been pointed out that some methods lead to more complexities. The ordered buffer mapping implementation (changing the current buffer mapping implementation) would always pop out in these discussions as it would potentially address the problems. But before we can work on POC patch, I think we should start a common discussion for potential a.)data structure design of the modified buffer manager: open relations hash table, buffer radix tree, etc. b.)buffer tag, locks c.)implementation, operations (loading pages, flushing/writing out buffers), complexities, etc. However, from what I understood, realistically speaking, it’s not possible to have it committed by PG12 given the complexity and time. There is also question of how to resolve some/part of these problems like a potential solution without redesigning the shared buffer manager. So, I really find it really important to be discussed soon. I hope to hear more insights, ideas, suggestions, truth-bombs, or so. :) Thank you very much. Regards, Kirk References [1] "reloption to prevent VACUUM from truncating empty pages at the end of relation" https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com [2] "Truncation failure in autovacuum results in data corruption (duplicate keys)" https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC@tunaPC [3] "WIP: long transactions on hot standby feedback replica / proof of concept" https://www.postgresql.org/message-id/flat/c9374921e50a5e8fb1ecf04eb8c6ebc3%40postgrespro.ru [4] "Reducing the size of BufferTag & remodeling forks" https://www.postgresql.org/message-id/flat/20150702133619.GB16267%40alap3.anarazel.de
Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2
On Wed, Oct 3, 2018 at 9:41 AM Chris Travers wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: not tested > Spec compliant: not tested > Documentation:tested, failed > > I am hoping I am not out of order in writing this before the commitfest > starts. The patch is big and long and so wanted to start on this while > traffic is slow. > > I find this patch quite welcome and very close to a minimum viable > version. The few significant limitations can be resolved later. One thing > I may have missed in the documentation is a discussion of the limits of the > current approach. I think this would be important to document because the > caveats of the current approach are significant, but the people who need it > will have the knowledge to work with issues if they come up. > > The major caveat I see in our past discussions and (if I read the patch > correctly) is that the resolver goes through global transactions > sequentially and does not move on to the next until the previous one is > resolved. This means that if I have a global transaction on server A, with > foreign servers B and C, and I have another one on server A with foreign > servers C and D, if server B goes down at the wrong moment, the background > worker does not look like it will detect the failure and move on to try to > resolve the second, so server D will have a badly set vacuum horizon until > this is resolved. Also if I read the patch correctly, it looks like one > can invoke SQL commands to remove the bad transaction to allow processing > to continue and manual resolution (this is good and necessary because in > this area there is no ability to have perfect recoverability without > occasional administrative action). I would really like to see more > documentation of failure cases and appropriate administrative action at > present. Otherwise this is I think a minimum viable addition and I think > we want it. > > It is possible i missed that in the documentation. If so, my objection > stands aside. If it is welcome I am happy to take a first crack at such > docs. > After further testing I am pretty sure I misread the patch. It looks like one can have multiple resolvers which can, in fact, work through a queue together solving this problem. So the objection above is not valid and I withdraw that objection. I will re-review the docs in light of the experience. > > To my mind thats the only blocker in the code (but see below). I can say > without a doubt that I would expect we would use this feature once > available. > > -- > > Testing however failed. > > make installcheck-world fails with errors like the following: > > -- Modify foreign server and raise an error > BEGIN; > INSERT INTO ft7_twophase VALUES(8); > + ERROR: prepread foreign transactions are disabled > + HINT: Set max_prepared_foreign_transactions to a nonzero value. > INSERT INTO ft8_twophase VALUES(NULL); -- violation > ! ERROR: current transaction is aborted, commands ignored until end of > transaction block > ROLLBACK; > SELECT * FROM ft7_twophase; > ! ERROR: prepread foreign transactions are disabled > ! HINT: Set max_prepared_foreign_transactions to a nonzero value. > SELECT * FROM ft8_twophase; > ! ERROR: prepread foreign transactions are disabled > ! HINT: Set max_prepared_foreign_transactions to a nonzero value. > -- Rollback foreign transaction that involves both 2PC-capable > -- and 2PC-non-capable foreign servers. > BEGIN; > INSERT INTO ft8_twophase VALUES(7); > + ERROR: prepread foreign transactions are disabled > + HINT: Set max_prepared_foreign_transactions to a nonzero value. > INSERT INTO ft9_not_twophase VALUES(7); > + ERROR: current transaction is aborted, commands ignored until end of > transaction block > ROLLBACK; > SELECT * FROM ft8_twophase; > ! ERROR: prepread foreign transactions are disabled > ! HINT: Set max_prepared_foreign_transactions to a nonzero value. > > make installcheck in the contrib directory shows the same, so that's the > easiest way of reproducing, at least on a new installation. I think the > test cases will have to handle that sort of setup. > > make check in the contrib directory passes. > > For reasons of test failures, I am setting this back to waiting on author. > > -- > I had a few other thoughts that I figure are worth sharing with the > community on this patch with the idea that once it is in place, this may > open up more options for collaboration in the area of federated and > distributed storage generally. I could imagine other foreign data wrappers > using this API, and folks might want to refactor out the atomic handling > part so that extensions that do not use the foreign data wrapper structure > could use it as well (while this looks like a classic SQL/MED issue, I am > not sure that only foreign data wrappers woul
Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2
On Wed, Oct 3, 2018 at 9:56 AM Chris Travers wrote: > > > On Wed, Oct 3, 2018 at 9:41 AM Chris Travers > wrote: > >> >> (errmsg("preparing foreign transactions > (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > > 0"))); > Two more critical notes here which I think are small blockers. The error message above references a config variable that does not exist. The correct name of the config parameter is max_foreign_transaction_resolvers Setting that along with the following to 10 caused the tests to pass, but again it fails on default configs: max_prepared_foreign_transactions, max_prepared_transactions > > > > -- > Best Regards, > Chris Travers > Head of Database > > Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com > Saarbrücker Straße 37a, 10405 Berlin > > -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2
On Wed, Oct 3, 2018 at 9:41 AM Chris Travers wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: not tested > Spec compliant: not tested > Documentation:tested, failed > Also one really minor point: I think this is a typo (maX vs max)? (errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0"))); -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru
On Wed, Oct 03, 2018 at 04:37:29PM +0900, Masahiko Sawada wrote: > Thank you for the comment. Attached the updated patch. Thanks for the patch. This looks clean to me at quick glance, not tested though.. -- Michael signature.asc Description: PGP signature
Re: Segfault when creating partition with a primary key and sql_drop trigger exists
On Fri, Sep 28, 2018 at 12:17:00PM +0900, Michael Paquier wrote: > I think that Alvaro should definitely look at this patch to be sure, or > I could do it, but I would need to spend way more time on this and check > event trigger interactions. > > Anyway, I was struggling a bit regarding the location where adding a > regression test. event_trigger.sql makes the most sense but in tests > for drops the objects are created before the event trigger is defined, > so that would need to move around so as the original problem is > reproducible. Perhaps you have an idea for that? Okay. I have spent more time on this issue, and I have been able to integrate a test in the existing event_trigger.sql which is able to reproduce the reported failure. Attached is what I am finishing with. I still want to do more testing on it, and the day is ending here. Thoughts? -- Michael diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 9229f619d2..b0cb5514d3 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -50,6 +50,7 @@ #include "catalog/pg_type.h" #include "catalog/storage.h" #include "commands/tablecmds.h" +#include "commands/event_trigger.h" #include "commands/trigger.h" #include "executor/executor.h" #include "miscadmin.h" @@ -212,7 +213,8 @@ relationHasPrimaryKey(Relation rel) void index_check_primary_key(Relation heapRel, IndexInfo *indexInfo, - bool is_alter_table) + bool is_alter_table, + IndexStmt *stmt) { List *cmds; int i; @@ -280,7 +282,11 @@ index_check_primary_key(Relation heapRel, * unduly. */ if (cmds) + { + EventTriggerAlterTableStart((Node *) stmt); AlterTableInternal(RelationGetRelid(heapRel), cmds, true); + EventTriggerAlterTableEnd(); + } } /* diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index ab3d9a0a48..4fc279e86f 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -666,7 +666,7 @@ DefineIndex(Oid relationId, * Extra checks when creating a PRIMARY KEY index. */ if (stmt->primary) - index_check_primary_key(rel, indexInfo, is_alter_table); + index_check_primary_key(rel, indexInfo, is_alter_table, stmt); /* * If this table is partitioned and we're creating a unique index or a diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index c145385f84..27f97fdff3 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -7074,7 +7074,7 @@ ATExecAddIndexConstraint(AlteredTableInfo *tab, Relation rel, /* Extra checks needed if making primary key */ if (stmt->primary) - index_check_primary_key(rel, indexInfo, true); + index_check_primary_key(rel, indexInfo, true, stmt); /* Note we currently don't support EXCLUSION constraints here */ if (stmt->primary) diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h index f20c5f789b..35a29f3498 100644 --- a/src/include/catalog/index.h +++ b/src/include/catalog/index.h @@ -40,7 +40,8 @@ typedef enum extern void index_check_primary_key(Relation heapRel, IndexInfo *indexInfo, - bool is_alter_table); + bool is_alter_table, + IndexStmt *stmt); #define INDEX_CREATE_IS_PRIMARY(1 << 0) #define INDEX_CREATE_ADD_CONSTRAINT (1 << 1) diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out index 6175a10d77..15528a7cb2 100644 --- a/src/test/regress/expected/event_trigger.out +++ b/src/test/regress/expected/event_trigger.out @@ -349,6 +349,18 @@ CREATE SCHEMA evttrig CREATE TABLE one (col_a SERIAL PRIMARY KEY, col_b text DEFAULT 'forty two') CREATE INDEX one_idx ON one (col_b) CREATE TABLE two (col_c INTEGER CHECK (col_c > 0) REFERENCES one DEFAULT 42); +-- Partitioned tables with shared indexes +CREATE TABLE evttrig.parted ( +id int PRIMARY KEY) +PARTITION BY RANGE (id); +CREATE TABLE evttrig.part_1_10 PARTITION OF evttrig.parted (id) + FOR VALUES FROM (1) TO (10); +CREATE TABLE evttrig.part_10_20 PARTITION OF evttrig.parted (id) + FOR VALUES FROM (10) TO (20) PARTITION BY RANGE (id); +CREATE TABLE evttrig.part_10_15 PARTITION OF evttrig.part_10_20 (id) + FOR VALUES FROM (10) TO (15); +CREATE TABLE evttrig.part_15_20 PARTITION OF evttrig.part_10_20 (id) + FOR VALUES FROM (15) TO (20); ALTER TABLE evttrig.two DROP COLUMN col_c; NOTICE: NORMAL: orig=t normal=f istemp=f type=table column identity=evttrig.two.col_c name={evttrig,two,col_c} args={} NOTICE: NORMAL: orig=f normal=t istemp=f type=table constraint identity=two_col_c_check on evttrig.two name={evttrig,two,two_col_c_check} args={} @@ -359,14 +371,20 @@ NOTICE: NORMAL: orig=t normal=f istemp=f type=table constraint identity=one_pke DROP INDEX evttrig.one_idx; NOTICE: NORMAL: orig=t normal=f istemp=f type=index identity=evttrig.one_idx name={evttrig,one_idx} args={} DROP SCHEMA evttrig CASCADE; -NOTICE: drop cascades to 2 other objects +NOT
Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation:tested, failed I am hoping I am not out of order in writing this before the commitfest starts. The patch is big and long and so wanted to start on this while traffic is slow. I find this patch quite welcome and very close to a minimum viable version. The few significant limitations can be resolved later. One thing I may have missed in the documentation is a discussion of the limits of the current approach. I think this would be important to document because the caveats of the current approach are significant, but the people who need it will have the knowledge to work with issues if they come up. The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes through global transactions sequentially and does not move on to the next until the previous one is resolved. This means that if I have a global transaction on server A, with foreign servers B and C, and I have another one on server A with foreign servers C and D, if server B goes down at the wrong moment, the background worker does not look like it will detect the failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until this is resolved. Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad transaction to allow processing to continue and manual resolution (this is good and necessary because in this area there is no ability to have perfect recoverability without occasional administrative action). I would really like to see more documentation of failure cases and appropriate administrative action at present. Otherwise this is I think a minimum viable addition and I think we want it. It is possible i missed that in the documentation. If so, my objection stands aside. If it is welcome I am happy to take a first crack at such docs. To my mind thats the only blocker in the code (but see below). I can say without a doubt that I would expect we would use this feature once available. -- Testing however failed. make installcheck-world fails with errors like the following: -- Modify foreign server and raise an error BEGIN; INSERT INTO ft7_twophase VALUES(8); + ERROR: prepread foreign transactions are disabled + HINT: Set max_prepared_foreign_transactions to a nonzero value. INSERT INTO ft8_twophase VALUES(NULL); -- violation ! ERROR: current transaction is aborted, commands ignored until end of transaction block ROLLBACK; SELECT * FROM ft7_twophase; ! ERROR: prepread foreign transactions are disabled ! HINT: Set max_prepared_foreign_transactions to a nonzero value. SELECT * FROM ft8_twophase; ! ERROR: prepread foreign transactions are disabled ! HINT: Set max_prepared_foreign_transactions to a nonzero value. -- Rollback foreign transaction that involves both 2PC-capable -- and 2PC-non-capable foreign servers. BEGIN; INSERT INTO ft8_twophase VALUES(7); + ERROR: prepread foreign transactions are disabled + HINT: Set max_prepared_foreign_transactions to a nonzero value. INSERT INTO ft9_not_twophase VALUES(7); + ERROR: current transaction is aborted, commands ignored until end of transaction block ROLLBACK; SELECT * FROM ft8_twophase; ! ERROR: prepread foreign transactions are disabled ! HINT: Set max_prepared_foreign_transactions to a nonzero value. make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a new installation. I think the test cases will have to handle that sort of setup. make check in the contrib directory passes. For reasons of test failures, I am setting this back to waiting on author. -- I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that once it is in place, this may open up more options for collaboration in the area of federated and distributed storage generally. I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic handling part so that extensions that do not use the foreign data wrapper structure could use it as well (while this looks like a classic SQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API. The new status of this patch is: Waiting on Author
Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru
On Tue, Oct 2, 2018 at 11:15 PM Tom Lane wrote: > > Michael Paquier writes: > > My brain is rather fried for the rest of the day... But we could just > > be looking at using USE_ASSERT_CHECKING. Thoughts from other are > > welcome. > > I'd go with folding the condition into a plain Assert. Then it's > obvious that no code is added in a non-assert build. I can see that > some cases might be so complicated that that isn't nice, but this > case doesn't seem to qualify. > Thank you for the comment. Attached the updated patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center fix_autovacuum_log_v2.patch Description: Binary data
Re: file cloning in pg_upgrade and CREATE DATABASE
On Tue, Oct 02, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote: > I see. Peter is proposing to have a fourth mode, essentially > --transfer-mode=clone-or-copy. Yes, a mode which depends on what the file system supports. Perhaps "safe" or "fast" could be another name, in the shape of the fastest method available which does not destroy the existing cluster's data. > Thinking about a generic tool that wraps pg_upgrade (say, Debian's > wrapper for it) this makes sense: just use the fastest non-destructive > mode available. Which ones are available depends on what the underlying > filesystem is, so it's not up to the tool's writer to decide which to > use ahead of time. This could have merit. Now, it seems to me that we have two separate concepts here, which should be addressed separately: 1) cloning file mode, which is the new actual feature. 2) automatic mode, which is a subset of the copy mode and the new clone mode. What I am not sure when it comes to that stuff is if we should consider in some way the link mode as being part of this automatic selection concept.. -- Michael signature.asc Description: PGP signature