Re: INSERT INTO SELECT, Why Parallelism is not selected?
On Fri, Jul 24, 2020 at 7:36 PM Tom Lane wrote: > > Robert Haas writes: > > Well, I think the comments could be more clear - for the insert case > > specifically - about which cases you think are and are not safe. > Okay, I'll update the patch accordingly. > Yeah, the proposed comment changes don't actually add much. Also > please try to avoid inserting non-ASCII into the source code; > at least in my mail reader, that attachment has some. > I don't see any non-ASCII characters in the patch. I have applied and checked (via vi editor) the patch as well but don't see any non-ASCII characters. How can I verify that? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Row estimates for empty tables
so 25. 7. 2020 v 0:34 odesílatel Tom Lane napsal: > [ redirecting to -hackers ] > > I wrote: > > The core issue here is "how do we know whether the table is likely to > stay > > empty?". I can think of a couple of more or less klugy solutions: > For these special cases is probably possible to ensure ANALYZE before any SELECT. When the table is created, then it is analyzed, and after that it is published and used for SELECT. Usually this table is not modified ever. Because it is a special case, then it is not necessarily too sophisticated a solution. But for built in solution it can be designed more goneral > > 1. Arrange to send out a relcache inval when adding the first page to > > a table, and then remove the planner hack for disbelieving relpages = 0. > > I fear this'd be a mess from a system structural standpoint, but it might > > work fairly transparently. > > I experimented with doing this. It's not hard to code, if you don't mind > having RelationGetBufferForTuple calling CacheInvalidateRelcache. I'm not > sure whether that code path might cause any long-term problems, but it > seems to work OK right now. However, this solution causes massive > "failures" in the regression tests as a result of plans changing. I'm > sure that's partly because we use so many small tables in the tests. > Nonetheless, it's not promising from the standpoint of not causing > unexpected problems in the real world. > > > 2. Establish the convention that vacuuming or analyzing an empty table > > is what you do to tell the system that this state is going to persist. > > That's more or less what the existing comments in plancat.c envision, > > but we never made a definition for how the occurrence of that event > > would be recorded in the catalogs, other than setting relpages > 0. > > Rather than adding another pg_class column, I'm tempted to say that > > vacuum/analyze should set relpages to a minimum of 1, even if the > > relation has zero pages. > > I also tried this, and it seems a lot more promising: no existing > regression test cases change. So perhaps we should do the attached > or something like it. > I am sending a patch that is years used in GoodData. I am not sure if the company uses 0 or 1, but I can ask. Regards Pavel > regards, tom lane > > diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index c814733b22..d9ca9904e1 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -33,6 +33,7 @@ /* GUC variables */ char *default_table_access_method = DEFAULT_TABLE_ACCESS_METHOD; bool synchronize_seqscans = true; +int fake_pages = 10; /* @@ -591,10 +592,10 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths, * Totally empty parent tables are quite common, so we should be willing * to believe that they are empty. */ - if (curpages < 10 && + if (curpages < fake_pages && relpages == 0 && !rel->rd_rel->relhassubclass) - curpages = 10; + curpages = fake_pages; /* report estimated # pages */ *pages = curpages; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 11b3f050bc..6954274c28 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2195,6 +2195,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"fake_pages", PGC_USERSET, QUERY_TUNING_OTHER, + gettext_noop("Sets a number of pages for planner when relation has no pages."), + NULL, + GUC_UNIT_BLOCKS + }, + _pages, + 10, 0, INT_MAX, + NULL, NULL, NULL + }, + { {"max_standby_archive_delay", PGC_SIGHUP, REPLICATION_STANDBY, gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing archived WAL data."), diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index eb18739c36..f997f9dbc4 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -30,6 +30,8 @@ extern char *default_table_access_method; extern bool synchronize_seqscans; +extern int fake_pages; + struct BulkInsertStateData; struct IndexInfo;
Re: handle a ECPG_bytea typo
On Fri, Jul 24, 2020 at 1:35 PM Wang, Shenhao wrote: > > Hi, hackers > > The source looks like: > > case ECPGt_bytea: > { > struct ECPGgeneric_varchar *variable = > (struct ECPGgeneric_varchar *) (var->value); > > .. > } > > I think the developer intend to use struct ECPGgeneric_bytea instead of > struct ECPGgeneric_varchar > > Is this thoughts right? > > I have wrote a patch to fix this typo I felt the changes look correct. The reason it might be working earlier is because the structure members are the same for both the data structures ECPGgeneric_bytea & ECPGgeneric_varchar. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel worker hangs while handling errors.
On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy wrote: > > On Thu, Jul 23, 2020 at 12:54 PM vignesh C wrote: > > > > Thanks for reviewing and adding your thoughts, My comments are inline. > > > > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy > > wrote: > > > > > > The same hang issue can occur(though I'm not able to back it up with a > > > use case), in the cases from wherever the EmitErrorReport() is called > > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as > > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and > > > postgres.c. > > > > > > > I'm not sure if this can occur in other cases. > > > > I checked further on this point: Yes, it can't occur for the other > cases, as mq_putmessage() gets only used for parallel > workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()). > > > > > > Note that, in all sigsetjmp blocks, we intentionally > > > HOLD_INTERRUPTS(), to not cause any issues while performing error > > > handling, I'm concerned here that now, if we directly call > > > BackgroundWorkerUnblockSignals() which will open up all the signals > > > and our main intention of holding interrupts or block signals may go > > > away. > > > > > > Since the main problem for this hang issue is because of blocking > > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, > > > instead of unblocking all signals? I tried this with parallel copy > > > hang, the issue is resolved. > > > > > > > On putting further thoughts on this, I feel just unblocking SIGUSR1 > > would be the right approach in this case. I'm attaching a new patch > > which unblocks SIGUSR1 signal. I have verified that the original issue > > with WIP parallel copy patch gets fixed. I have made changes only in > > bgworker.c as we require the parallel worker to receive this signal > > and continue processing. I have not included the changes for other > > processes as I'm not sure if this scenario is applicable for other > > processes. > > > > Since we are sure that this hang issue can occur only for parallel > workers, and the change in StartBackgroundWorker's sigsetjmp's block > should only be made for parallel worker cases. And also there can be a > lot of other callbacks execution and processing in proc_exit() for > which we might not need the SIGUSR1 unblocked. So, let's undo the > unblocking right after EmitErrorReport() to not cause any new issues. > > Attaching a modified v2 patch: it has the unblocking for only for > parallel workers, undoing it after EmitErrorReport(), and some > adjustments in the comment. > I have made slight changes on top of the patch to remove duplicate code, attached v3 patch for the same. The parallel worker hang issue gets resolved, make check & make check-world passes. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com From b48849d091302dfdac4fc004195b3c532fdb9dcb Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Sat, 25 Jul 2020 06:38:40 +0530 Subject: [PATCH v3] Fix for Parallel worker hangs while handling errors. Worker is not able to receive the signals while processing error flow. Worker hangs in this case because when the worker is started the signals will be masked using sigprocmask. Unblocking of signals is done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error handling the worker has jumped to setjmp in StartBackgroundWorker function. Here the signals are in blocked state, hence the signal is not received by the worker process. Authors: Vignesh C, Bharath Rupireddy --- src/backend/postmaster/bgworker.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index beb5e85..0b9f214 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -671,6 +671,23 @@ bgworker_sigusr1_handler(SIGNAL_ARGS) } /* + * update_parallel_worker_sigmask - add or remove a signal from sigmask. + */ +static void +update_parallel_worker_sigmask(BackgroundWorker *worker, int signum, + bool isadd) +{ + if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) + { + if (isadd) + sigaddset(, signum); + else + sigdelset(, signum); + PG_SETMASK(); + } +} + +/* * Start a new background worker * * This is the main entry point for background worker, to be called from @@ -747,6 +764,16 @@ StartBackgroundWorker(void) */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) { + /* + * In case of parallel workers, unblock SIGUSR1 signal, it was blocked + * when the postmaster forked us. Leader process will send SIGUSR1 signal + * to the worker process(worker process will be in waiting state as + * there is no space available) to indicate shared memory space is freed + * up. Once the signal is received worker process will start populating + * the error message further. + */ + update_parallel_worker_sigmask(worker, SIGUSR1, false); + /* Since not using PG_TRY, must
Re: Include access method in listTables output
On Thu, Jul 16, 2020 at 7:35 PM Georgios wrote: > > > > ‐‐‐ Original Message ‐‐‐ > On Saturday, July 11, 2020 3:16 PM, vignesh C wrote: > > > On Mon, Jul 6, 2020 at 1:24 PM Georgios gkokola...@protonmail.com wrote: > > > > > ‐‐‐ Original Message ‐‐‐ > > > On Monday, July 6, 2020 3:12 AM, Michael Paquier mich...@paquier.xyz > > > wrote: > > > > > > > On Sun, Jul 05, 2020 at 07:13:10AM +0530, vignesh C wrote: > > > > > > > > > I'm not sure if we should include showViews, I had seen that the > > > > > access method was not getting selected for view. > > > > > > > > +1. These have no physical storage, so you are looking here for > > > > relkinds that satisfy RELKIND_HAS_STORAGE(). > > > > > > Thank you for the review. > > > Find attached v4 of the patch. > > > > Thanks for fixing the comments. > > Patch applies cleanly, make check & make check-world passes. > > I was not sure if any documentation change is required or not for this > > in doc/src/sgml/ref/psql-ref.sgml. Thoughts? > > > > Thank you for the comment. I added a line in docs. Admittedly, might need > tweaking. > > Please find version 5 of the patch attached. Most changes looks fine, minor comments: \pset tuples_only false -- check conditional tableam display --- Create a heap2 table am handler with heapam handler +\pset expanded off +CREATE SCHEMA conditional_tableam_display; +CREATE ROLE conditional_tableam_display_role; +ALTER SCHEMA conditional_tableam_display OWNER TO conditional_tableam_display_role; +SET search_path TO conditional_tableam_display; CREATE ACCESS METHOD heap_psql TYPE TABLE HANDLER heap_tableam_handler; This comment might have been removed unintentionally, do you want to add it back? +-- access method column should not be displayed for sequences +\ds+ +List of relations + Schema | Name | Type | Owner | Persistence | Size | Description ++--+--+---+-+--+- +(0 rows) We can include one test for view. + if (pset.sversion >= 12 && !pset.hide_tableam && + (showTables || showMatViews || showIndexes)) + appendPQExpBuffer(, + ",\n am.amname as \"%s\"", + gettext_noop("Access Method")); + /* * As of PostgreSQL 9.0, use pg_table_size() to show a more accurate * size of a table, including FSM, VM and TOAST tables. @@ -3772,6 +3778,12 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys appendPQExpBufferStr(, "\nFROM pg_catalog.pg_class c" "\n LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace"); + + if (pset.sversion >= 12 && !pset.hide_tableam && + (showTables || showMatViews || showIndexes)) If conditions in both the places are the same, do you want to make it a macro? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Row estimates for empty tables
On Sat, 25 Jul 2020 at 10:34, Tom Lane wrote: > I wrote: > > 1. Arrange to send out a relcache inval when adding the first page to > > a table, and then remove the planner hack for disbelieving relpages = 0. > > I fear this'd be a mess from a system structural standpoint, but it might > > work fairly transparently. > > I experimented with doing this. It's not hard to code, if you don't mind > having RelationGetBufferForTuple calling CacheInvalidateRelcache. I'm not > sure whether that code path might cause any long-term problems, but it > seems to work OK right now. However, this solution causes massive > "failures" in the regression tests as a result of plans changing. I'm > sure that's partly because we use so many small tables in the tests. > Nonetheless, it's not promising from the standpoint of not causing > unexpected problems in the real world. I guess all these changes would be the planner moving towards a plan that suits having fewer rows for the given table better. If so, that does seem quite scary as we already have enough problems from the planner choosing poor plans when it thinks there are fewer rows than there actually are. Don't we need to keep something like the 10-page estimate there so safer plans are produced before auto-vacuum gets in and gathers some proper stats? I think if anything we'd want to move in the direction of producing more cautious plans when the estimated number of rows is low. Perhaps especially so for when the planner opts to do things like perform a non-parameterized nested loop join when it thinks the RelOptInfo with, say 3, unbeknown-to-the-planner, correlated, base restrict quals that are thought to produce just 1 row, but actually produce many more. > > 2. Establish the convention that vacuuming or analyzing an empty table > > is what you do to tell the system that this state is going to persist. > > That's more or less what the existing comments in plancat.c envision, > > but we never made a definition for how the occurrence of that event > > would be recorded in the catalogs, other than setting relpages > 0. > > Rather than adding another pg_class column, I'm tempted to say that > > vacuum/analyze should set relpages to a minimum of 1, even if the > > relation has zero pages. > > I also tried this, and it seems a lot more promising: no existing > regression test cases change. So perhaps we should do the attached > or something like it. This sounds like a more plausible solution. At least this way there's an escape hatch for people who suffer due to this. David
Re: Improving connection scalability: GetSnapshotData()
On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote: > Em sex., 24 de jul. de 2020 às 14:16, Andres Freund > escreveu: > > > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > > > Latest Postgres > > > Windows 64 bits > > > msvc 2019 64 bits > > > > > > Patches applied v12-0001 to v12-0007: > > > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning > > C4013: > > > 'GetOldestXmin' indefinido; assumindo extern retornando int > > > [C:\dll\postgres > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning > > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > > > [C:\dll\postgres\pg_visibility. > > > vcxproj] > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: > > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > [C:\dll\postgres\pgstattuple.vcxproj] > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > > I don't know that's about - there's no call to GetOldestXmin() in > > pgstatapprox and pg_visibility after patch 0002? And similarly, the > > PROCARRAY_* references are also removed in the same patch? > > > Maybe need to remove them from these places, not? > C:\dll\postgres\contrib>grep -d GetOldestXmin *.c > File pgstattuple\pgstatapprox.c: > OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); > File pg_visibility\pg_visibility.c: > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); > * deadlocks, because surely > GetOldestXmin() should never take > RecomputedOldestXmin = GetOldestXmin(NULL, > PROCARRAY_FLAGS_VACUUM); The 0002 patch changed those files: diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index 68d580ed1e0..37206c50a21 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); TransactionId OldestXmin = InvalidTransactionId; - if (all_visible) - { - /* Don't pass rel; that will fail in recovery. */ - OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); - } - rel = relation_open(relid, AccessShareLock); /* Only some relkinds have a visibility map */ check_relation_relkind(rel); + if (all_visible) + OldestXmin = GetOldestNonRemovableTransactionId(rel); + nblocks = RelationGetNumberOfBlocks(rel); /* @@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) * From a concurrency point of view, it sort of sucks to * retake ProcArrayLock here while we're holding the buffer * exclusively locked, but it should be safe against -* deadlocks, because surely GetOldestXmin() should never take -* a buffer lock. And this shouldn't happen often, so it's -* worth being careful so as to avoid false positives. +* deadlocks, because surely GetOldestNonRemovableTransactionId() +* should never take a buffer lock. And this shouldn't happen +* often, so it's worth being careful so as to avoid false +* positives. */ - RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); + RecomputedOldestXmin = GetOldestNonRemovableTransactionId(rel); if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin)) record_corrupt_item(items, _self); diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index dbc0fa11f61..3a99333d443 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat) BufferAccessStrategy bstrategy; TransactionId OldestXmin; - OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); + OldestXmin = GetOldestNonRemovableTransactionId(rel); bstrategy = GetAccessStrategy(BAS_BULKREAD); nblocks = RelationGetNumberOfBlocks(rel); Greetings, Andres Freund
Re: hashagg slowdown due to spill changes
Hi, On 2020-06-12 14:37:15 -0700, Andres Freund wrote: > On 2020-06-11 11:14:02 -0700, Jeff Davis wrote: > > On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote: > > > Did you run any performance tests? > > > > Yes, I reproduced your ~12% regression from V12, and this patch nearly > > eliminated it for me. > > I spent a fair bit of time looking at the difference. Jeff had let me > know on chat that he was still seeing some difference, but couldn't > quite figure out where that was. > > Trying it out myself, I observed that the patch helped, but not that > much. After a bit I found one major reason for why: > LookupTupleHashEntryHash() assigned the hash to pointer provided by the > caller's before doing the insertion. That ended up causing a pipeline > stall (I assume it's store forwarding, but not sure). Moving the > assignment to the caller variable to after the insertion got rid of > that. This is still not resolved. We're right now slower than 12. It's effectively not possible to do performance comparisons right now. This was nearly two months ago. Greetings, Andres Freund
Re: Row estimates for empty tables
[ redirecting to -hackers ] I wrote: > The core issue here is "how do we know whether the table is likely to stay > empty?". I can think of a couple of more or less klugy solutions: > 1. Arrange to send out a relcache inval when adding the first page to > a table, and then remove the planner hack for disbelieving relpages = 0. > I fear this'd be a mess from a system structural standpoint, but it might > work fairly transparently. I experimented with doing this. It's not hard to code, if you don't mind having RelationGetBufferForTuple calling CacheInvalidateRelcache. I'm not sure whether that code path might cause any long-term problems, but it seems to work OK right now. However, this solution causes massive "failures" in the regression tests as a result of plans changing. I'm sure that's partly because we use so many small tables in the tests. Nonetheless, it's not promising from the standpoint of not causing unexpected problems in the real world. > 2. Establish the convention that vacuuming or analyzing an empty table > is what you do to tell the system that this state is going to persist. > That's more or less what the existing comments in plancat.c envision, > but we never made a definition for how the occurrence of that event > would be recorded in the catalogs, other than setting relpages > 0. > Rather than adding another pg_class column, I'm tempted to say that > vacuum/analyze should set relpages to a minimum of 1, even if the > relation has zero pages. I also tried this, and it seems a lot more promising: no existing regression test cases change. So perhaps we should do the attached or something like it. regards, tom lane diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 1bbc4598f7..243cdaa409 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -594,6 +594,19 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId; new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId; + /* + * Another corner case is that if the relation is physically zero-length, + * we don't want to record relpages=reltuples=0 in pg_class; what we want + * to record is relpages=1, reltuples=0. This tells the planner that the + * relation has been seen to be empty by VACUUM or ANALYZE, so it should + * not override a small measured table size. + */ + if (new_rel_pages == 0) + { + Assert(new_live_tuples == 0); + new_rel_pages = 1; + } + vac_update_relstats(onerel, new_rel_pages, new_live_tuples, diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index 4b2bb29559..0a92450e8a 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -585,11 +585,9 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths, * doesn't happen instantaneously, and it won't happen at all for cases * such as temporary tables.) * - * We approximate "never vacuumed" by "has relpages = 0", which means this - * will also fire on genuinely empty relations. Not great, but - * fortunately that's a seldom-seen case in the real world, and it - * shouldn't degrade the quality of the plan too much anyway to err in - * this direction. + * We test "never vacuumed" by seeing whether relpages = 0. VACUUM and + * ANALYZE will never set relpages to less than 1, even if the relation is + * in fact zero length. * * If the table has inheritance children, we don't apply this heuristic. * Totally empty parent tables are quite common, so we should be willing diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 924ef37c81..f0247e350f 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -620,6 +620,16 @@ do_analyze_rel(Relation onerel, VacuumParams *params, visibilitymap_count(onerel, , NULL); + /* + * As in VACUUM, replace relpages=reltuples=0 with relpages=1, + * reltuples=0. + */ + if (relpages == 0) + { + Assert(totalrows == 0); + relpages = 1; + } + vac_update_relstats(onerel, relpages, totalrows,
Re: Improving connection scalability: GetSnapshotData()
Em sex., 24 de jul. de 2020 às 14:16, Andres Freund escreveu: > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > > Latest Postgres > > Windows 64 bits > > msvc 2019 64 bits > > > > Patches applied v12-0001 to v12-0007: > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning > C4013: > > 'GetOldestXmin' indefinido; assumindo extern retornando int > > [C:\dll\postgres > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > > [C:\dll\postgres\pg_visibility. > > vcxproj] > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > [C:\dll\postgres\pgstattuple.vcxproj] > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > [C:\dll\postgres\pg_visibility.vcxproj] > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > [C:\dll\postgres\pg_visibility.vcxproj] > > I don't know that's about - there's no call to GetOldestXmin() in > pgstatapprox and pg_visibility after patch 0002? And similarly, the > PROCARRAY_* references are also removed in the same patch? > Maybe need to remove them from these places, not? C:\dll\postgres\contrib>grep -d GetOldestXmin *.c File pgstattuple\pgstatapprox.c: OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); File pg_visibility\pg_visibility.c: OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); * deadlocks, because surely GetOldestXmin() should never take RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); regards, Ranier Vilela
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 12:16 PM Tomas Vondra wrote: > Maybe, but we're nowhere close to these limits. See this table which I > posted earlier: > >2MB Planned Partitions: 64HashAgg Batches: 4160 >4MB Planned Partitions: 128HashAgg Batches: 16512 >8MB Planned Partitions: 256HashAgg Batches: 21488 > 64MB Planned Partitions: 32HashAgg Batches: 2720 > 256MB Planned Partitions: 8HashAgg Batches: 8 > > This is from the non-parallel runs on the i5 machine with 32GB data set, > the first column is work_mem. We're nowhere near the 1024 limit, and the > cardinality estimates are pretty good. > > OTOH the number o batches is much higher, so clearly there was some > recursive spilling happening. What I find strange is that this grows > with work_mem and only starts dropping after 64MB. Could that be caused by clustering in the data? If the input data is in totally random order then we have a good chance of never having to spill skewed "common" values. That is, we're bound to encounter common values before entering spill mode, and so those common values will continue to be usefully aggregated until we're done with the initial groups (i.e. until the in-memory hash table is cleared in order to process spilled input tuples). This is great because the common values get aggregated without ever spilling, and most of the work is done before we even begin with spilled tuples. If, on the other hand, the common values are concentrated together in the input... Assuming that I have this right, then I would also expect simply having more memory to ameliorate the problem. If you only have/need 4 or 8 partitions then you can fit a higher proportion of the total number of groups for the whole dataset in the hash table (at the point when you first enter spill mode). I think it follows that the "nailed" hash table entries/groupings will "better characterize" the dataset as a whole. > Also, how could the amount of I/O be almost constant in all these cases? > Surely more recursive spilling should do more I/O, but the Disk Usage > reported by explain analyze does not show anything like ... Not sure, but might that just be because of the fact that logtape.c can recycle disk space? As I said in my last e-mail, it's pretty reasonable to assume that the vast majority of external sorts are one-pass. It follows that disk usage can be thought of as almost the same thing as total I/O for tuplesort. But the same heuristic isn't reasonable when thinking about hash agg. Hash agg might write out much less data than the total memory used for the equivalent "peak optimal nospill" hash agg case -- or much more. (Again, reiterating what I said in my last e-mail.) -- Peter Geoghegan
Re: [PATCH] fix GIN index search sometimes losing results
Pavel Borisov writes: > ср, 22 июл. 2020 г. в 19:10, Tom Lane : >> The other issue we have to agree on is whether we want to sneak this >> fix into v13, or wait another year for it. I feel like it's pretty >> late to be making potentially API-breaking changes, but on the other >> hand this is undoubtedly a bug fix. > I am convinced patch 0001 is necessary and enough to fix a bug, so I think > it's very much worth adding it to v13. Agreed, and done. > As for 0002 I see the beauty of this change but I also see the value of > leaving defaults as they were before. > The change of CALC_NOT behavior doesn't seem to be a source of big changes, > though. I'm just not convinced it is very much needed. > The best way I think is to leave 0002 until the next version and add > commentary in the code that this default behavior of NOT > doing TS_EXEC_CALC_NOT is inherited from past, so basically any caller > should set this flag (see patch 0003-add-comments-on-calc-not. I don't think it's a great plan to make these two changes in two successive versions. They're going to be affecting basically the same set of outside callers, at least if you assume that every TS_execute caller will be supplying its own callback function. So we might as well force people to make both updates at once. Also, if there is anyone who thinks they need "skip NOT" behavior, this'd be a great time to reconsider. I revised 0002 to still define a flag for skipping NOTs, but it's not the default and is indeed unused in the core code now. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 11:03:54AM -0700, Peter Geoghegan wrote: On Fri, Jul 24, 2020 at 8:19 AM Robert Haas wrote: This is all really good analysis, I think, but this seems like the key finding. It seems like we don't really understand what's actually getting written. Whether we use hash or sort doesn't seem like it should have this kind of impact on how much data gets written, and whether we use CP_SMALL_TLIST or project when needed doesn't seem like it should matter like this either. Isn't this more or less the expected behavior in the event of partitions that are spilled recursively? The case that Tomas tested were mostly cases where work_mem was tiny relative to the data being aggregated. The following is an extract from commit 1f39bce0215 showing some stuff added to the beginning of nodeAgg.c: + * We also specify a min and max number of partitions per spill. Too few might + * mean a lot of wasted I/O from repeated spilling of the same tuples. Too + * many will result in lots of memory wasted buffering the spill files (which + * could instead be spent on a larger hash table). + */ +#define HASHAGG_PARTITION_FACTOR 1.50 +#define HASHAGG_MIN_PARTITIONS 4 +#define HASHAGG_MAX_PARTITIONS 1024 Maybe, but we're nowhere close to these limits. See this table which I posted earlier: 2MB Planned Partitions: 64HashAgg Batches: 4160 4MB Planned Partitions: 128HashAgg Batches: 16512 8MB Planned Partitions: 256HashAgg Batches: 21488 64MB Planned Partitions: 32HashAgg Batches: 2720 256MB Planned Partitions: 8HashAgg Batches: 8 This is from the non-parallel runs on the i5 machine with 32GB data set, the first column is work_mem. We're nowhere near the 1024 limit, and the cardinality estimates are pretty good. OTOH the number o batches is much higher, so clearly there was some recursive spilling happening. What I find strange is that this grows with work_mem and only starts dropping after 64MB. Also, how could the amount of I/O be almost constant in all these cases? Surely more recursive spilling should do more I/O, but the Disk Usage reported by explain analyze does not show anything like ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Jul 24, 2020 at 7:34 AM Robert Haas wrote: > > On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty > wrote: > > In the read-only level I was suggesting, I wasn't suggesting that we > > stop WAL flushes, in fact we should flush the WAL before we mark the > > system as read-only. Once the system declares itself as read-only, it > > will not perform any more on-disk changes; It may perform all the > > flushes it needs as a part of the read-only request handling. > > I think that's already how the patch works, or at least how it should > work. You stop new writes, flush any existing WAL, and then declare > the system read-only. That can all be done quickly. > True, except for the fact that it allows dirty buffers to be flushed after the ALTER command returns. > > What I am saying is it doesn't have to be just the queries. I think we > > can cater to all the other use cases simply by forcing a checkpoint > > before marking the system as read-only. > > But that part can't, which means that if we did that, it would break > the feature for the originally intended use case. I'm not on board > with that. > Referring to the options you presented in [1]: I am saying that we should allow for both: with a checkpoint (#2) (can also be #3) and without a checkpoint (#1) before having the ALTER command return, by having different levels of read-onlyness. We should have syntax variants for these. The syntax should not be an ALTER SYSTEM SET as you have pointed out before. Perhaps: ALTER SYSTEM READ ONLY; -- #2 or #3 ALTER SYSTEM READ ONLY WAL; -- #1 ALTER SYSTEM READ WRITE; or even: ALTER SYSTEM FREEZE; -- #2 or #3 ALTER SYSTEM FREEZE WAL; -- #1 ALTER SYSTEM UNFREEZE; Regards, Soumyadeep (VMware) [1] http://postgr.es/m/ca+tgmoz-c3dz9qwhwmm4bc36n4u0xz2oyenewmf+bwokbyd...@mail.gmail.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Jul 24, 2020 at 7:32 AM Robert Haas wrote: > > On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty > wrote: > > So if we are not going to address those cases, we should change the > > syntax and remove the notion of read-only. It could be: > > > > ALTER SYSTEM SET wal_writes TO off|on; > > or > > ALTER SYSTEM SET prohibit_wal TO off|on; > > This doesn't really work because of the considerations mentioned in > http://postgr.es/m/ca+tgmoakctzozr0xeqalfimbcje2rgcbazf4eybpxjtnetp...@mail.gmail.com Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't think we should say "READ ONLY" if we still allow on-disk file changes after the ALTER SYSTEM command returns (courtesy dirty buffer flushes) because it does introduce confusion, especially to an audience not privy to this thread. When people hear "read-only" they may think of static on-disk files immediately. > Contrary to what you write, I don't think either #2 or #3 is > sufficient to enable checksums, at least not without some more > engineering, because the server would cache the state from the control > file, and a bunch of blocks from the database. I guess it would work > if you did a server restart afterward, but I think there are better > ways of supporting online checksum enabling that don't require > shutting down the server, or even making it read-only; and there's > been significant work done on those already. Agreed. As you mentioned, if we did do #2 or #3, we would be able to do pg_checksums on a server that was shut down or that had crashed while it was in a read-only state, which is what Michael was asking for in [1]. I think it's just cleaner if we allow for this. I don't have enough context to enumerate use cases for the advantages or opportunities that would come with an assurance that the cluster's files are frozen (and not covered by any existing utilities), but surely there are some? Like the possibility of pg_upgrade on a running server while it can entertain read-only queries? Surely, that's a nice one! Of course, some or all of these utilities would need to be taught about read-only mode. Regards, Soumyadeep [1] http://postgr.es/m/20200626095921.gf1...@paquier.xyz
Re: Making CASE error handling less surprising
pá 24. 7. 2020 v 19:46 odesílatel Tom Lane napsal: > Andres Freund writes: > > Wouldn't the rule that I proposed earlier, namely that sub-expressions > > that involve only "proper" constants continue to get evaluated even > > within CASE, largely address that? > > The more I think about that the less I like it. It'd make the behavior > even harder to reason about than it is now, and it doesn't fix the issue > for subquery pullup cases. > > Basically this seems like a whole lot of thrashing to try to preserve > all the details of a behavior that is kind of accidental to begin with. > The argument that it's a performance issue seems hypothetical too, > rather than founded on any observed results. > > BTW, to the extent that there is a performance issue, we could perhaps > fix it if we resurrected the "cache stable subexpressions" patch that > was kicking around a year or two ago. That'd give us both > at-most-one-evaluation and no-evaluation-until-necessary behaviors, > if we made sure to apply it to stable CASE arms. > +1 regards Pavel > regards, tom lane > > >
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 1:40 AM Tomas Vondra wrote: > Maybe, not sure what exactly you think is pathological? The trouble is > hashagg has to spill input tuples but the memory used in no-spill case > represents aggregated groups, so I'm not sure how you could extrapolate > from that ... Yeah, but when hash agg enters spill mode it will continue to advance the transition states for groups already in the hash table, which could be quite a significant effect. The peak memory usage for an equivalent no-spill hash agg is therefore kind of related to the amount of I/O needed for spilling. I suppose that you mostly tested cases where memory was in very short supply, where that breaks down completely. Perhaps it wasn't helpful for me to bring that factor into this discussion -- it's not as if there is any doubt that hash agg is spilling a lot more here in any case. > Not sure, but I think we need to spill roughly as much as sort, so it > seems a bit strange that (a) we're spilling 2x as much data and yet the > cost is so much lower. ISTM that the amount of I/O that hash agg performs can vary *very* widely for the same data. This is mostly determined by work_mem, but there are second order effects. OTOH, the amount of I/O that a sort must do is practically fixed. You can quibble with that characterisation a bit because of multi-pass sorts, but not really -- multi-pass sorts are generally quite rare. I think that we need a more sophisticated cost model for this in cost_agg(). Maybe the "pages_written" calculation could be pessimized. However, it doesn't seem like this is precisely an issue with I/O costs. -- Peter Geoghegan
Re: Missing CFI in hlCover()?
I tried to duplicate a multiple-second ts_headline call here, and failed to, so there must be something I'm missing. Can you provide a concrete example? I'd like to do some analysis with perf. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 8:19 AM Robert Haas wrote: > This is all really good analysis, I think, but this seems like the key > finding. It seems like we don't really understand what's actually > getting written. Whether we use hash or sort doesn't seem like it > should have this kind of impact on how much data gets written, and > whether we use CP_SMALL_TLIST or project when needed doesn't seem like > it should matter like this either. Isn't this more or less the expected behavior in the event of partitions that are spilled recursively? The case that Tomas tested were mostly cases where work_mem was tiny relative to the data being aggregated. The following is an extract from commit 1f39bce0215 showing some stuff added to the beginning of nodeAgg.c: + * We also specify a min and max number of partitions per spill. Too few might + * mean a lot of wasted I/O from repeated spilling of the same tuples. Too + * many will result in lots of memory wasted buffering the spill files (which + * could instead be spent on a larger hash table). + */ +#define HASHAGG_PARTITION_FACTOR 1.50 +#define HASHAGG_MIN_PARTITIONS 4 +#define HASHAGG_MAX_PARTITIONS 1024 -- Peter Geoghegan
Re: Missing CFI in hlCover()?
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> Hm. I'd vote for a CFI within the recursion in TS_execute(), if there's >> not one there yet. Maybe hlFirstIndex needs one too --- if there are >> a lot of words in the query, maybe that could be slow? Did you pin the >> blame down any more precisely than hlCover? > I've definitely seen hlFirstIndex take a few seconds to run (while > running this under gdb and stepping through), so that could be a good > choice to place one (perhaps even additionally to this...). I have to > admit to wondering if we shouldn't consider having one in > check_stack_depth() to try and reduce the risk of us forgetting to have > one in sensible places, though I've not really looked at all the callers > and that might not be reasonable in some cases (though I wonder if maybe > we consider having a 'default' version that has a CFI, and an alternate > that doesn't...). Adding it to check_stack_depth doesn't really seem like a reasonable proposal to me; aside from failing to separate concerns, running a long time is quite distinct from taking a lot of stack. The reason I'm eyeing TS_execute is that it involves callbacks to functions that might be pretty complex in themselves, eg during index scans. So that would guard a lot of territory besides hlCover. But hlFirstIndex could use a CFI too, if you've seen it take that long. (I wonder if we need to try to make it faster. I'd supposed that the loop was cheap enough to be a non-problem, but with large enough documents maybe not? It seems like converting to a hash table could be worthwhile for a large doc.) regards, tom lane
display offset along with block number in vacuum errors
Hi hackers, We discussed in another email thread[1], that it will be helpful if we can display offset along with block number in vacuum error. Here, proposing a patch to add offset along with block number in vacuum errors. In commit b61d161(Introduce vacuum errcontext to display additional information), we added vacuum errcontext to display additional information(block number) so that in case of vacuum error, we can identify which block we are getting error. Addition to block number, if we can display offset, then it will be more helpful for users. So to display offset, here proposing two different methods(Thanks Robert for suggesting these 2 methods): *Method 1:* We can report the TID as well as the block number in errcontext. - errcontext("while scanning block %u of relation \"%s.%s\"", - errinfo->blkno, errinfo->relnamespace, errinfo->relname); + errcontext("while scanning block %u and offset %u of relation \"%s.%s\"", + errinfo->blkno, errinfo->offnum, errinfo->relnamespace, errinfo->relname); Above fix requires more calls to update_vacuum_error_info(). Attaching v01_0001 patch for this method. *Method 2: *We can improve the error messages by passing the relevant TID to heap_prepare_freeze_tuple and having it report the TID as part of the error message or in the error detail. ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), - errmsg_internal("found xmin %u from before relfrozenxid %u", + errmsg_internal("for block %u and offnum %u, found xmin %u from before relfrozenxid %u", + ItemPointerGetBlockNumber(tid), + ItemPointerGetOffsetNumber(tid), xid, relfrozenxid))); Attaching v01_0002 patch for this method. Please let me know your thoughts. [1] : http://postgr.es/m/20200713223822.az6fo3m2x4t42...@alap3.anarazel.de -- Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com v01_0001-Added-offset-with-block-number-in-vacuum-errcontext.patch Description: Binary data v01_0002-Added-block-and-offset-to-errors-of-heap_prepare_fre.patch Description: Binary data
Re: Making CASE error handling less surprising
Andres Freund writes: > Wouldn't the rule that I proposed earlier, namely that sub-expressions > that involve only "proper" constants continue to get evaluated even > within CASE, largely address that? The more I think about that the less I like it. It'd make the behavior even harder to reason about than it is now, and it doesn't fix the issue for subquery pullup cases. Basically this seems like a whole lot of thrashing to try to preserve all the details of a behavior that is kind of accidental to begin with. The argument that it's a performance issue seems hypothetical too, rather than founded on any observed results. BTW, to the extent that there is a performance issue, we could perhaps fix it if we resurrected the "cache stable subexpressions" patch that was kicking around a year or two ago. That'd give us both at-most-one-evaluation and no-evaluation-until-necessary behaviors, if we made sure to apply it to stable CASE arms. regards, tom lane
Re: Making CASE error handling less surprising
On July 24, 2020 10:30:37 AM PDT, Pavel Stehule wrote: >pá 24. 7. 2020 v 19:13 odesílatel Andres Freund >napsal: >> Your earlier example of a WHEN ... THEN upper('constant') ... would >> still have the upper('constant') be evaluated, because it doesn't >> involve a parameter. And e.g. THEN upper('constant') * $1 would also >> still have the upper('constant') be evaluated, just the >multiplication >> with $1 wouldn't get evaluated. >> >> >> I'm not sure what you're concerned about with the one-shot bit? >> > >Now query parameters are evaluated like constant. How's that related to oneeshot plans? >I can imagine WHERE clause like WHERE col = CASE $1 WHEN true THEN >upper($2) ELSE $2 END > >I remember applications that use these strange queries to support >parameterized behaviour - like case sensitive or case insensitive >searching. I don't buy this as a significant issue. This could much more sensibly be written as an OR. If the policy is that we cannot regress anything then there's no way we can improve at all. Andres Andres Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Making CASE error handling less surprising
pá 24. 7. 2020 v 19:13 odesílatel Andres Freund napsal: > Hi, > > On 2020-07-24 19:03:30 +0200, Pavel Stehule wrote: > > pá 24. 7. 2020 v 18:49 odesílatel Andres Freund > napsal: > > > Wouldn't the rule that I proposed earlier, namely that sub-expressions > > > that involve only "proper" constants continue to get evaluated even > > > within CASE, largely address that? > > > > > > > It doesn't solve a possible performance problem with one shot (EXECUTE > stmt > > plpgsql) queries, or with parameterized queries > > What precisely are you thinking of here? Most expressions involving > parameters would still get constant evaluated - it'd just be inside CASE > etc that they wouldn't anymore? Do you think it's that common to have a > parameter reference inside an expression inside a CASE where it's > crucial that that parameter reference gets constant evaluated? I'd think > that's a bit of a stretch. > > Your earlier example of a WHEN ... THEN upper('constant') ... would > still have the upper('constant') be evaluated, because it doesn't > involve a parameter. And e.g. THEN upper('constant') * $1 would also > still have the upper('constant') be evaluated, just the multiplication > with $1 wouldn't get evaluated. > > > I'm not sure what you're concerned about with the one-shot bit? > Now query parameters are evaluated like constant. I can imagine WHERE clause like WHERE col = CASE $1 WHEN true THEN upper($2) ELSE $2 END I remember applications that use these strange queries to support parameterized behaviour - like case sensitive or case insensitive searching. > Greetings, > > Andres Freund >
Re: Making CASE error handling less surprising
Hi, On 2020-07-23 22:34:53 -0400, Tom Lane wrote: > Andres Freund writes: > > I'm a bit worried about a case like: > > > CREATE FUNCTION yell(int, int) > > RETURNS int > > IMMUTABLE > > LANGUAGE SQL AS $$ > >SELECT CASE WHEN $1 != 0 THEN 17 / $2 ELSE NULL END > > $$; > > > EXPLAIN SELECT yell(g.i, 0) FROM generate_series(1, 10) g(i); > > > I don't think the parameters here would have been handled before > > inlining, right? > > Ah, I see what you mean. Yeah, that throws an error today, and it > still would with the patch I was envisioning (attached), because > inlining does Param substitution in a different way. I'm not > sure that we could realistically fix the inlining case with this > sort of approach. Thinking about it a bit it seems we could solve that fairly easy if we don't replace function parameter with actual Const nodes, but instead with a PseudoConst parameter. If we map that to the same expression evaluation step that should be fairly cheap to implement, basically adding a bunch of 'case PseudoConst:' to the Const cases. That way we could take the type of constness into account before constant folding. Alternatively we could add a new field to Const, indicating the 'source' or 'context of the Const, which we then could take into account during constant evaluation. > I think this bears out the comment I made before that this approach > still leaves us with a very complicated behavior. Maybe we should > stick with the previous approach, possibly supplemented with a > leakproofness exception. ISTM that most of the complication has to be dealt with in either approach? Greetings, Andres Freund
Re: Making CASE error handling less surprising
Robert Haas writes: > Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I > believe this. Pavel's example is a good one. The leakproof exception > helps, but it doesn't cover everything. Users I've encountered throw > things like date_trunc() and lpad() into SQL code and expect them to > behave (from a performance point of view) like constants, but they > also expect 1/0 not to get evaluated too early when e.g. CASE is used. > It's difficult to meet both sets of expectations at the same time and > we're probably never going to have a perfect solution, but I think > you're minimizing the concern too much here. I've quoted this point before, but ... we can make queries arbitrarily fast, if we don't have to give the right answer. I think we've seen enough complaints on this topic now to make it clear that what we're doing today with CASE is the wrong answer. The performance argument can be made to cut both ways, too. If somebody's got a very expensive function in a CASE arm that they don't expect to reach, having it be evaluated anyway because it's got constant inputs isn't going to make them happy. The real bottom line is: if you don't want to do this, how else do you want to fix the problem? I'm no longer willing to deny that there is a problem. > I don't think I believe this either. I don't think an average user is > going to expect to behave differently from (SELECT > ). Agreed, that's poorly (or not at all?) documented. But it's been true all along, and this patch isn't changing that behavior at all. I'm not sure if we should do anything more than improve the docs, but in any case it seems independent of the CASE issue. > The current behavior isn't great, but at least it handles these > cases consistently. Really? regards, tom lane
Re: Improving connection scalability: GetSnapshotData()
On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > Latest Postgres > Windows 64 bits > msvc 2019 64 bits > > Patches applied v12-0001 to v12-0007: > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013: > 'GetOldestXmin' indefinido; assumindo extern retornando int > [C:\dll\postgres > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > [C:\dll\postgres\pg_visibility. > vcxproj] > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > [C:\dll\postgres\pgstattuple.vcxproj] > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > [C:\dll\postgres\pg_visibility.vcxproj] > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > [C:\dll\postgres\pg_visibility.vcxproj] I don't know that's about - there's no call to GetOldestXmin() in pgstatapprox and pg_visibility after patch 0002? And similarly, the PROCARRAY_* references are also removed in the same patch? Greetings, Andres Freund
Re: Making CASE error handling less surprising
Hi, On 2020-07-24 19:03:30 +0200, Pavel Stehule wrote: > pá 24. 7. 2020 v 18:49 odesílatel Andres Freund napsal: > > Wouldn't the rule that I proposed earlier, namely that sub-expressions > > that involve only "proper" constants continue to get evaluated even > > within CASE, largely address that? > > > > It doesn't solve a possible performance problem with one shot (EXECUTE stmt > plpgsql) queries, or with parameterized queries What precisely are you thinking of here? Most expressions involving parameters would still get constant evaluated - it'd just be inside CASE etc that they wouldn't anymore? Do you think it's that common to have a parameter reference inside an expression inside a CASE where it's crucial that that parameter reference gets constant evaluated? I'd think that's a bit of a stretch. Your earlier example of a WHEN ... THEN upper('constant') ... would still have the upper('constant') be evaluated, because it doesn't involve a parameter. And e.g. THEN upper('constant') * $1 would also still have the upper('constant') be evaluated, just the multiplication with $1 wouldn't get evaluated. I'm not sure what you're concerned about with the one-shot bit? Greetings, Andres Freund
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jul 23, 2020 at 10:14 PM Amul Sul wrote: > > On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty > wrote: > > In case it is necessary, the patch set does not wait for the checkpoint to > > complete before marking the system as read-write. Refer: > > > > /* Set final state by clearing in-progress flag bit */ > > if (SetWALProhibitState(wal_state & > ~(WALPROHIBIT_TRANSITION_IN_PROGRESS))) > > { > > if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0) > > ereport(LOG, (errmsg("system is now read only"))); > > else > > { > > /* Request checkpoint */ > > RequestCheckpoint(CHECKPOINT_IMMEDIATE); > > ereport(LOG, (errmsg("system is now read write"))); > > } > > } > > > > We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before > > we SetWALProhibitState() and do the ereport(), if we have a read-write > > state change request. > > > +1, I too have the same question. > > > > FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it > think > it will be deadlock case -- checkpointer process waiting for itself. We should really just call CreateCheckPoint() here instead of RequestCheckpoint(). > > 3. Some of the functions that were added such as GetWALProhibitState(), > > IsWALProhibited() etc could be declared static inline. > > > IsWALProhibited() can be static but not GetWALProhibitState() since it > needed to > be accessible from other files. If you place a static inline function in a header file, it will be accessible from other files. E.g. pg_atomic_* functions. Regards, Soumyadeep
Re: Improving connection scalability: GetSnapshotData()
Latest Postgres Windows 64 bits msvc 2019 64 bits Patches applied v12-0001 to v12-0007: C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int [C:\dll\postgres C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int [C:\dll\postgres\pg_visibility. vcxproj] C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado [C:\dll\postgres\pgstattuple.vcxproj] C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado [C:\dll\postgres\pg_visibility.vcxproj] C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado [C:\dll\postgres\pg_visibility.vcxproj] regards, Ranier Vilela
Re: Making CASE error handling less surprising
pá 24. 7. 2020 v 18:49 odesílatel Andres Freund napsal: > Hi, > > On 2020-07-24 12:31:05 -0400, Robert Haas wrote: > > On Thu, Jul 23, 2020 at 12:57 PM Tom Lane wrote: > > > Every so often we get a complaint like [1] about how a CASE should have > > > prevented a run-time error and didn't, because constant-folding tried > > > to evaluate a subexpression that would not have been entered at > run-time. > > > > Yes, I've heard such complaints from other sources as well. > > > > > It struck me that it would not be hard to improve this situation a > great > > > deal. If, within a CASE subexpression that isn't certain to be > executed > > > at runtime, we refuse to pre-evaluate *any* function (essentially, > treat > > > them all as volatile), then we should largely get the semantics that > > > users expect. There's some potential for query slowdown if a CASE > > > contains a constant subexpression that we formerly reduced at plan time > > > and now do not, but that doesn't seem to me to be a very big deal. > > > > Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I > > believe this. Pavel's example is a good one. The leakproof exception > > helps, but it doesn't cover everything. Users I've encountered throw > > things like date_trunc() and lpad() into SQL code and expect them to > > behave (from a performance point of view) like constants, but they > > also expect 1/0 not to get evaluated too early when e.g. CASE is used. > > It's difficult to meet both sets of expectations at the same time and > > we're probably never going to have a perfect solution, but I think > > you're minimizing the concern too much here. > > Wouldn't the rule that I proposed earlier, namely that sub-expressions > that involve only "proper" constants continue to get evaluated even > within CASE, largely address that? > It doesn't solve a possible performance problem with one shot (EXECUTE stmt plpgsql) queries, or with parameterized queries > > > I don't think I believe this either. I don't think an average user is > > going to expect to behave differently from (SELECT > > ). This one actually bothers me more than the previous > > one. How would we even document it? Sometimes things get inlined, > > sometimes they don't. Sometimes subqueries get pulled up, sometimes > > not. The current behavior isn't great, but at least it handles these > > cases consistently. Getting the easy cases "right" while making the > > behavior in more complex cases harder to understand is not necessarily > > a win. > > Well, if we formalize the desired behaviour it's probably a lot easier > to work towards implementing it in additional cases (like > subselects). It doesn't seem to hard to keep track of whether a specific > subquery can be evaluate constants in a certain way, if that's what we > need. > > Greetings, > > Andres Freund > > >
Re: HOT vs freezing issue causing "cannot freeze committed xmax"
Hi, On 2020-07-24 11:06:58 -0400, Robert Haas wrote: > On Thu, Jul 23, 2020 at 2:10 PM Andres Freund wrote: > > In the case the HOT logic triggers, we'll call > > heap_prepare_freeze_tuple() even when the tuple is dead. > > I think this is very bad. I've always been confused about these > comments, but I couldn't quite put my finger on the problem. Now I > think I can: the comments here imagine that we have the option either > to set tupgone, causing the line pointer to be marked unused by an > eventual call to lazy_vacuum_page(), or that we can decline to set > tupgone, which will leave the tuple around to be handled by the next > vacuum. Yea. I think the only saving grace is that it's not obvious when the situation can arise without prior corruption. But even if that's actuall impossible, it seems extremely fragile. I stared at heap_prune_chain() for quite a while and couldn't convince myself either way. > However, we don't really have a choice at all. A choice implies that > either option is correct, and therefore we can elect the one we > prefer. But here, it's not just that one option is incorrect, but that > both options are incorrect. Setting tupgone controls whether or not > the tuple is considered for freezing. If we DON'T consider freezing > it, then we might manage to advance relfrozenxid while an older XID > still exists in the table. If we DO consider freezing it, we will > correctly conclude that it needs to be frozen, but then the freezing > code is in an impossible situation, because it has no provision for > getting rid of tuples, only for keeping them. Its logic assumes that > whenever we are freezing xmin or xmax we do that in a way that causes > the tuple to be visible to everyone, but this tuple should be visible > to no one. So with your changes it now errors out instead of > corrupting data, but that's just replacing one bad thing (data > corruption) with another (VACUUM failures). I suspect that the legitimate cases hitting this branch won't error out, because then xmin/xmax aren't old enough to need to be frozen. > I think the actual correct behavior here is to mark the line pointer > as dead. That's not trivial, because just doing that naively will break HOT. > The easiest way to accomplish that is probably to have > lazy_scan_heap() just emit an extra XLOG_HEAP2_CLEAN record beyond > whatever HOT-pruning already did, if it's necessary. A better solution > would probably be to merge HOT-pruning with setting things all-visible > and have a single function that does both, but that seems a lot more > invasive, and definitely unsuitable for back-patching. I suspect that merging pruning and this logic in lazy_scan_heap() really is the only proper way to solve this kind of issue. I wonder if, given we don't know if this issue can be hit in a real database, and given that it already triggers an error, the right way to deal with this in the back-branches is to emit a more precise error message. I.e. if we hit this branch, and either xmin/xmax are older than the cutoff, then we issue a more specific ERROR. Greetings, Andres Freund
Re: Making CASE error handling less surprising
Hi, On 2020-07-24 12:31:05 -0400, Robert Haas wrote: > On Thu, Jul 23, 2020 at 12:57 PM Tom Lane wrote: > > Every so often we get a complaint like [1] about how a CASE should have > > prevented a run-time error and didn't, because constant-folding tried > > to evaluate a subexpression that would not have been entered at run-time. > > Yes, I've heard such complaints from other sources as well. > > > It struck me that it would not be hard to improve this situation a great > > deal. If, within a CASE subexpression that isn't certain to be executed > > at runtime, we refuse to pre-evaluate *any* function (essentially, treat > > them all as volatile), then we should largely get the semantics that > > users expect. There's some potential for query slowdown if a CASE > > contains a constant subexpression that we formerly reduced at plan time > > and now do not, but that doesn't seem to me to be a very big deal. > > Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I > believe this. Pavel's example is a good one. The leakproof exception > helps, but it doesn't cover everything. Users I've encountered throw > things like date_trunc() and lpad() into SQL code and expect them to > behave (from a performance point of view) like constants, but they > also expect 1/0 not to get evaluated too early when e.g. CASE is used. > It's difficult to meet both sets of expectations at the same time and > we're probably never going to have a perfect solution, but I think > you're minimizing the concern too much here. Wouldn't the rule that I proposed earlier, namely that sub-expressions that involve only "proper" constants continue to get evaluated even within CASE, largely address that? > I don't think I believe this either. I don't think an average user is > going to expect to behave differently from (SELECT > ). This one actually bothers me more than the previous > one. How would we even document it? Sometimes things get inlined, > sometimes they don't. Sometimes subqueries get pulled up, sometimes > not. The current behavior isn't great, but at least it handles these > cases consistently. Getting the easy cases "right" while making the > behavior in more complex cases harder to understand is not necessarily > a win. Well, if we formalize the desired behaviour it's probably a lot easier to work towards implementing it in additional cases (like subselects). It doesn't seem to hard to keep track of whether a specific subquery can be evaluate constants in a certain way, if that's what we need. Greetings, Andres Freund
Re: Missing CFI in hlCover()?
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > I'm looking into an issue that we're seeing on the PG archives server > > with runaway queries that don't seem to ever want to end- and ignore > > signals. > > > This is PG11, 11.8-1.pgdg100+1 specifically on Debian/buster and what > > we're seeing is the loop in hlCover() (wparser_def.c:2071 to 2093) is > > lasting an awful long time without any CFI call. It's possible the CFI > > call should actually go elsewhere, but the complete lack of any CFI in > > wparser_def.c or tsvector_op.c seems a bit concerning. > > > I'm suspicious there's something else going on here that's causing this > > to take a long time but I don't have any smoking gun as yet. > > Hm. I'd vote for a CFI within the recursion in TS_execute(), if there's > not one there yet. Maybe hlFirstIndex needs one too --- if there are > a lot of words in the query, maybe that could be slow? Did you pin the > blame down any more precisely than hlCover? I've definitely seen hlFirstIndex take a few seconds to run (while running this under gdb and stepping through), so that could be a good choice to place one (perhaps even additionally to this...). I have to admit to wondering if we shouldn't consider having one in check_stack_depth() to try and reduce the risk of us forgetting to have one in sensible places, though I've not really looked at all the callers and that might not be reasonable in some cases (though I wonder if maybe we consider having a 'default' version that has a CFI, and an alternate that doesn't...). The depth of recursion for TS_execute_recurse() is actually not all that bad either though (only 6 or so, as the query string here is: "ERROR: The required file is not available"), so maybe that isn't really the right thing to be thinking here. Down in checkcondition_HL(), checkval->len is 213601, and it seems to pretty much always end up with a result of TS_NO, but doesn't seem to take all that long to arrive at that. Over in hlFirstIndex(): hlFirstIndex (prs=0x7ffc2d4b16c0, prs=0x7ffc2d4b16c0, pos=219518, query=0x559619f81528) at ./build/../src/backend/tsearch/wparser_def.c:2013 2013hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos) (gdb) n 2026if (item->type == QI_VAL && (gdb) 2029item++; (gdb) p pos $72 = 219518 (gdb) p prs->curwords $73 = 583766 (gdb) Here's a full backtrace down to the checkcondition_HL(): (gdb) i s #0 checkcondition_HL (opaque=0x7ffc2d4b11f0, val=0x559619f815c0, data=0x0) at ./build/../src/backend/tsearch/wparser_def.c:1981 #1 0x559617eced2b in TS_execute_recurse (curitem=0x559619f815c0, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1872 #2 0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f815a8, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1892 #3 0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81590, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1892 #4 0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81578, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1892 #5 0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81560, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1892 #6 0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81548, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1892 #7 0x559617ecedd1 in TS_execute_recurse (curitem=curitem@entry=0x559619f81530, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1892 #8 0x559617ed26d9 in TS_execute (curitem=curitem@entry=0x559619f81530, arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 ) at ./build/../src/backend/utils/adt/tsvector_op.c:1854 #9 0x559617df041e in hlCover (prs=prs@entry=0x7ffc2d4b16c0, query=query@entry=0x559619f81528, p=p@entry=0x7ffc2d4b12a0, q=q@entry=0x7ffc2d4b12a4) at ./build/../src/backend/tsearch/wparser_def.c:2075 #10 0x559617df1a2d in mark_hl_words (max_words=35, min_words=15, shortword=3, highlightall=, query=, prs=0x7ffc2d4b16c0) at
Re: Mark unconditionally-safe implicit coercions as leakproof
On Fri, Jul 24, 2020 at 12:17 PM Tom Lane wrote: > I went through the system's built-in implicit coercions to see > which ones are unconditionally successful. These could all be > marked leakproof, as per attached patch. This came up in the > context of the nearby discussion about CASE, but it seems like > an independent improvement. If you have a function f(int8) > that is leakproof, you don't want it to effectively become > non-leakproof when you apply it to an int4 or int2 column. > > One that I didn't mark leakproof is rtrim1(), which is the > infrastructure for char(n) to text coercion. It looks like it > actually does qualify right now, but the code is long enough and > complex enough that I think such a marking would be a bit unsafe. > > Any objections? IMHO, this is a nice improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making CASE error handling less surprising
On Thu, Jul 23, 2020 at 12:57 PM Tom Lane wrote: > Every so often we get a complaint like [1] about how a CASE should have > prevented a run-time error and didn't, because constant-folding tried > to evaluate a subexpression that would not have been entered at run-time. Yes, I've heard such complaints from other sources as well. > It struck me that it would not be hard to improve this situation a great > deal. If, within a CASE subexpression that isn't certain to be executed > at runtime, we refuse to pre-evaluate *any* function (essentially, treat > them all as volatile), then we should largely get the semantics that > users expect. There's some potential for query slowdown if a CASE > contains a constant subexpression that we formerly reduced at plan time > and now do not, but that doesn't seem to me to be a very big deal. Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I believe this. Pavel's example is a good one. The leakproof exception helps, but it doesn't cover everything. Users I've encountered throw things like date_trunc() and lpad() into SQL code and expect them to behave (from a performance point of view) like constants, but they also expect 1/0 not to get evaluated too early when e.g. CASE is used. It's difficult to meet both sets of expectations at the same time and we're probably never going to have a perfect solution, but I think you're minimizing the concern too much here. > This is not a complete fix, because if you write a sub-SELECT the > contents of the sub-SELECT are not processed by the outer query's > eval_const_expressions pass; instead, we look at it within the > sub-SELECT itself, and in that context there's no apparent reason > to avoid const-folding. So >CASE WHEN x < 0 THEN (SELECT 1/0) END > fails even if x is never less than zero. I don't see any great way > to avoid that, and I'm not particularly concerned about it anyhow; > usually the point of a sub-SELECT like this is to be decoupled from > outer query evaluation, so that the behavior should not be that > surprising. I don't think I believe this either. I don't think an average user is going to expect to behave differently from (SELECT ). This one actually bothers me more than the previous one. How would we even document it? Sometimes things get inlined, sometimes they don't. Sometimes subqueries get pulled up, sometimes not. The current behavior isn't great, but at least it handles these cases consistently. Getting the easy cases "right" while making the behavior in more complex cases harder to understand is not necessarily a win. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Missing CFI in hlCover()?
Stephen Frost writes: > I'm looking into an issue that we're seeing on the PG archives server > with runaway queries that don't seem to ever want to end- and ignore > signals. > This is PG11, 11.8-1.pgdg100+1 specifically on Debian/buster and what > we're seeing is the loop in hlCover() (wparser_def.c:2071 to 2093) is > lasting an awful long time without any CFI call. It's possible the CFI > call should actually go elsewhere, but the complete lack of any CFI in > wparser_def.c or tsvector_op.c seems a bit concerning. > I'm suspicious there's something else going on here that's causing this > to take a long time but I don't have any smoking gun as yet. Hm. I'd vote for a CFI within the recursion in TS_execute(), if there's not one there yet. Maybe hlFirstIndex needs one too --- if there are a lot of words in the query, maybe that could be slow? Did you pin the blame down any more precisely than hlCover? regards, tom lane
Mark unconditionally-safe implicit coercions as leakproof
I went through the system's built-in implicit coercions to see which ones are unconditionally successful. These could all be marked leakproof, as per attached patch. This came up in the context of the nearby discussion about CASE, but it seems like an independent improvement. If you have a function f(int8) that is leakproof, you don't want it to effectively become non-leakproof when you apply it to an int4 or int2 column. One that I didn't mark leakproof is rtrim1(), which is the infrastructure for char(n) to text coercion. It looks like it actually does qualify right now, but the code is long enough and complex enough that I think such a marking would be a bit unsafe. Any objections? regards, tom lane diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 4b5af32440..3fed5725cb 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -698,10 +698,10 @@ proname => 'dlog1', prorettype => 'float8', proargtypes => 'float8', prosrc => 'dlog1' }, { oid => '235', descr => 'convert int2 to float8', - proname => 'float8', prorettype => 'float8', proargtypes => 'int2', + proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'int2', prosrc => 'i2tod' }, { oid => '236', descr => 'convert int2 to float4', - proname => 'float4', prorettype => 'float4', proargtypes => 'int2', + proname => 'float4', proleakproof => 't', prorettype => 'float4', proargtypes => 'int2', prosrc => 'i2tof' }, { oid => '237', descr => 'convert float8 to int2', proname => 'int2', prorettype => 'int2', proargtypes => 'float8', @@ -879,25 +879,25 @@ proargtypes => 'float8 float8 float8 int4', prosrc => 'width_bucket_float8' }, { oid => '311', descr => 'convert float4 to float8', - proname => 'float8', prorettype => 'float8', proargtypes => 'float4', + proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'float4', prosrc => 'ftod' }, { oid => '312', descr => 'convert float8 to float4', proname => 'float4', prorettype => 'float4', proargtypes => 'float8', prosrc => 'dtof' }, { oid => '313', descr => 'convert int2 to int4', - proname => 'int4', prorettype => 'int4', proargtypes => 'int2', + proname => 'int4', proleakproof => 't', prorettype => 'int4', proargtypes => 'int2', prosrc => 'i2toi4' }, { oid => '314', descr => 'convert int4 to int2', proname => 'int2', prorettype => 'int2', proargtypes => 'int4', prosrc => 'i4toi2' }, { oid => '316', descr => 'convert int4 to float8', - proname => 'float8', prorettype => 'float8', proargtypes => 'int4', + proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'int4', prosrc => 'i4tod' }, { oid => '317', descr => 'convert float8 to int4', proname => 'int4', prorettype => 'int4', proargtypes => 'float8', prosrc => 'dtoi4' }, { oid => '318', descr => 'convert int4 to float4', - proname => 'float4', prorettype => 'float4', proargtypes => 'int4', + proname => 'float4', proleakproof => 't', prorettype => 'float4', proargtypes => 'int4', prosrc => 'i4tof' }, { oid => '319', descr => 'convert float4 to int4', proname => 'int4', prorettype => 'int4', proargtypes => 'float4', @@ -1150,16 +1150,16 @@ proname => 'text', prorettype => 'text', proargtypes => 'bpchar', prosrc => 'rtrim1' }, { oid => '406', descr => 'convert name to text', - proname => 'text', prorettype => 'text', proargtypes => 'name', + proname => 'text', proleakproof => 't', prorettype => 'text', proargtypes => 'name', prosrc => 'name_text' }, { oid => '407', descr => 'convert text to name', - proname => 'name', prorettype => 'name', proargtypes => 'text', + proname => 'name', proleakproof => 't', prorettype => 'name', proargtypes => 'text', prosrc => 'text_name' }, { oid => '408', descr => 'convert name to char(n)', proname => 'bpchar', prorettype => 'bpchar', proargtypes => 'name', prosrc => 'name_bpchar' }, { oid => '409', descr => 'convert char(n) to name', - proname => 'name', prorettype => 'name', proargtypes => 'bpchar', + proname => 'name', proleakproof => 't', prorettype => 'name', proargtypes => 'bpchar', prosrc => 'bpchar_name' }, { oid => '449', descr => 'hash', @@ -1338,10 +1338,10 @@ proname => 'int4', prorettype => 'int4', proargtypes => 'int8', prosrc => 'int84' }, { oid => '481', descr => 'convert int4 to int8', - proname => 'int8', prorettype => 'int8', proargtypes => 'int4', + proname => 'int8', proleakproof => 't', prorettype => 'int8', proargtypes => 'int4', prosrc => 'int48' }, { oid => '482', descr => 'convert int8 to float8', - proname => 'float8', prorettype => 'float8', proargtypes => 'int8', + proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'int8', prosrc => 'i8tod' }, { oid => '483', descr => 'convert float8 to int8', proname => 'int8', prorettype => 'int8', proargtypes => 'float8', @@ -1359,7
Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?
Hi, On 2020-07-24 11:33:52 -0400, Dave Cramer wrote: > For logical replication there is no need to implement this, but others are > using the pgoutput plugin for Change Data Capture. The reason they are > using pgoutput is because it is guaranteed to be available as it is in core > postgres. > > Implementing LogicalDecodeMessageCB provides some synchronization facility > that is not easily replicated. It's definitely useful. Probably needs to be parameter that signals whether they should be sent out? Greetings, Andres Freund
Missing CFI in hlCover()?
Greetings, I'm looking into an issue that we're seeing on the PG archives server with runaway queries that don't seem to ever want to end- and ignore signals. This is PG11, 11.8-1.pgdg100+1 specifically on Debian/buster and what we're seeing is the loop in hlCover() (wparser_def.c:2071 to 2093) is lasting an awful long time without any CFI call. It's possible the CFI call should actually go elsewhere, but the complete lack of any CFI in wparser_def.c or tsvector_op.c seems a bit concerning. I'm suspicious there's something else going on here that's causing this to take a long time but I don't have any smoking gun as yet. Thanks, Stephen signature.asc Description: PGP signature
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 11:18:48AM -0400, Robert Haas wrote: On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra wrote: 2MB 4MB8MB64MB256MB --- hash 6.716.70 6.736.44 5.81 hash CP_SMALL_TLIST 5.285.26 5.245.04 4.54 sort 3.413.41 3.413.57 3.45 So sort writes ~3.4GB of data, give or take. But hashagg/master writes almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still much more than the 3.4GB of data written by sort (which has to spill everything, while hashagg only spills rows not covered by the groups that fit into work_mem). I initially assumed this is due to writing the hash value to the tapes, and the rows are fairly narrow (only about 40B per row), so a 4B hash could make a difference - but certainly not this much. Moreover, that does not explain the difference between master and the now-reverted CP_SMALL_TLIST, I think. This is all really good analysis, I think, but this seems like the key finding. It seems like we don't really understand what's actually getting written. Whether we use hash or sort doesn't seem like it should have this kind of impact on how much data gets written, and whether we use CP_SMALL_TLIST or project when needed doesn't seem like it should matter like this either. I think for CP_SMALL_TLIST at least some of the extra data can be attributed to writing the hash value along with the tuple, which sort obviously does not do. With the 32GB data set (the i5 machine), there are ~20M rows in the lineitem table, and with 4B hash values that's about 732MB of extra data. That's about the 50% of the difference between sort and CP_SMALL_TLIST, and I'd dare to speculate the other 50% is due to LogicalTape internals (pointers to the next block, etc.) The question is why master has 2x the overhead of CP_SMALL_TLIST, if it's meant to write the same set of columns etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On Mon, Jun 29, 2020 at 12:31 PM Andres Freund wrote: > > This "custom compression methods" thread - vintage 2017 - Original > > code by Nikita Glukhov, later work by Ildus Kurbangaliev > > The "pluggable compression support" thread - vintage 2013 - Andres Freund > > The "plgz performance" thread - vintage 2019 - Petr Jelinek > > > > Anyone want to point to a FOURTH implementation of this feature? > > To be clear, I don't think the 2003 patch should be considered as being > "in the running". I guess you mean 2013, not 2003? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Any objections to implementing LogicalDecodeMessageCB for pgoutput?
For logical replication there is no need to implement this, but others are using the pgoutput plugin for Change Data Capture. The reason they are using pgoutput is because it is guaranteed to be available as it is in core postgres. Implementing LogicalDecodeMessageCB provides some synchronization facility that is not easily replicated. Thoughts ? Dave Cramer
Re: Making CASE error handling less surprising
I wrote: > Ah, I see what you mean. Yeah, that throws an error today, and it > still would with the patch I was envisioning (attached), because > inlining does Param substitution in a different way. I'm not > sure that we could realistically fix the inlining case with this > sort of approach. Here's another example that we can't possibly fix with Param substitution hacking, because there are no Params involved in the first place: select f1, case when f1 = 42 then 1/i else null end from (select f1, 0 as i from int4_tbl) ss; Pulling up the subquery results in "1/0", so this fails today, even though "f1 = 42" is never true. Attached is a v3 patch that incorporates the leakproofness idea. As shown in the new case.sql tests, this does fix both the SQL function and subquery-pullup cases. To keep the join regression test results the same, I marked int48() as leakproof, which is surely safe enough. Probably we should make a push to mark all unconditionally-safe implicit coercions as leakproof, but that's a separate matter. regards, tom lane diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index e04b144072..48e31b16d6 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -61,6 +61,16 @@ typedef struct AggClauseCosts *costs; } get_agg_clause_costs_context; +typedef enum +{ + /* Ordering is important here! */ + ece_eval_nothing, /* be unconditionally safe */ + ece_eval_leakproof, /* eval leakproof immutable functions */ + ece_eval_immutable, /* eval immutable functions too */ + ece_eval_stable, /* eval stable functions too */ + ece_eval_volatile /* eval volatile functions too */ +} ece_eval_mode; + typedef struct { ParamListInfo boundParams; @@ -68,6 +78,7 @@ typedef struct List *active_fns; Node *case_val; bool estimate; + ece_eval_mode eval_mode; } eval_const_expressions_context; typedef struct @@ -119,6 +130,8 @@ static Node *eval_const_expressions_mutator(Node *node, static bool contain_non_const_walker(Node *node, void *context); static bool ece_function_is_safe(Oid funcid, eval_const_expressions_context *context); +static bool ece_function_form_is_safe(Form_pg_proc func_form, + eval_const_expressions_context *context); static Node *apply_const_relabel(Node *arg, Oid rtype, int32 rtypmod, Oid rcollid, CoercionForm rformat, int rlocation); @@ -2264,6 +2277,7 @@ eval_const_expressions(PlannerInfo *root, Node *node) context.active_fns = NIL; /* nothing being recursively simplified */ context.case_val = NULL; /* no CASE being examined */ context.estimate = false; /* safe transformations only */ + context.eval_mode = ece_eval_immutable; /* eval immutable functions */ return eval_const_expressions_mutator(node, ); } @@ -2280,8 +2294,11 @@ eval_const_expressions(PlannerInfo *root, Node *node) * available by the caller of planner(), even if the Param isn't marked * constant. This effectively means that we plan using the first supplied * value of the Param. - * 2. Fold stable, as well as immutable, functions to constants. + * 2. Fold stable, as well as immutable, functions to constants. The risk + * that the result might change from planning time to execution time is + * worth taking, as we otherwise couldn't get an estimate at all. * 3. Reduce PlaceHolderVar nodes to their contained expressions. + * 4. Ignore domain constraints, assuming that CoerceToDomain will succeed. * */ Node * @@ -2295,6 +2312,7 @@ estimate_expression_value(PlannerInfo *root, Node *node) context.active_fns = NIL; /* nothing being recursively simplified */ context.case_val = NULL; /* no CASE being examined */ context.estimate = true; /* unsafe transformations OK */ + context.eval_mode = ece_eval_stable; /* eval stable functions */ return eval_const_expressions_mutator(node, ); } @@ -2961,7 +2979,22 @@ eval_const_expressions_mutator(Node *node, * opportunity to reduce constant test conditions. For * example this allows * CASE 0 WHEN 0 THEN 1 ELSE 1/0 END - * to reduce to 1 rather than drawing a divide-by-0 error. + * to reduce to 1 rather than drawing a divide-by-0 error, + * since we'll throw away the ELSE without processing it. + * + * Even in not-all-constant cases, the user might be expecting + * the CASE to prevent run-time errors, for example in + * CASE WHEN j > 0 THEN 1 ELSE 1/0 END + * Since division is immutable, we'd ordinarily simplify the + * division and hence draw the divide-by-zero error at plan + * time, even if j is never zero at run time. To avoid that, + * reduce eval_mode to ece_eval_leakproof while processing any + * test condition or result value that will not certainly be + * evaluated at run-time. We expect that leakproof immutable + * functions will not throw any errors (except perhaps in
Re: Default setting for enable_hashagg_disk
On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra wrote: >2MB 4MB8MB64MB256MB > --- > hash 6.716.70 6.736.44 5.81 > hash CP_SMALL_TLIST 5.285.26 5.245.04 4.54 > sort 3.413.41 3.413.57 3.45 > > So sort writes ~3.4GB of data, give or take. But hashagg/master writes > almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the > original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still > much more than the 3.4GB of data written by sort (which has to spill > everything, while hashagg only spills rows not covered by the groups > that fit into work_mem). > > I initially assumed this is due to writing the hash value to the tapes, > and the rows are fairly narrow (only about 40B per row), so a 4B hash > could make a difference - but certainly not this much. Moreover, that > does not explain the difference between master and the now-reverted > CP_SMALL_TLIST, I think. This is all really good analysis, I think, but this seems like the key finding. It seems like we don't really understand what's actually getting written. Whether we use hash or sort doesn't seem like it should have this kind of impact on how much data gets written, and whether we use CP_SMALL_TLIST or project when needed doesn't seem like it should matter like this either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: HOT vs freezing issue causing "cannot freeze committed xmax"
On Thu, Jul 23, 2020 at 2:10 PM Andres Freund wrote: > In the case the HOT logic triggers, we'll call > heap_prepare_freeze_tuple() even when the tuple is dead. I think this is very bad. I've always been confused about these comments, but I couldn't quite put my finger on the problem. Now I think I can: the comments here imagine that we have the option either to set tupgone, causing the line pointer to be marked unused by an eventual call to lazy_vacuum_page(), or that we can decline to set tupgone, which will leave the tuple around to be handled by the next vacuum. However, we don't really have a choice at all. A choice implies that either option is correct, and therefore we can elect the one we prefer. But here, it's not just that one option is incorrect, but that both options are incorrect. Setting tupgone controls whether or not the tuple is considered for freezing. If we DON'T consider freezing it, then we might manage to advance relfrozenxid while an older XID still exists in the table. If we DO consider freezing it, we will correctly conclude that it needs to be frozen, but then the freezing code is in an impossible situation, because it has no provision for getting rid of tuples, only for keeping them. Its logic assumes that whenever we are freezing xmin or xmax we do that in a way that causes the tuple to be visible to everyone, but this tuple should be visible to no one. So with your changes it now errors out instead of corrupting data, but that's just replacing one bad thing (data corruption) with another (VACUUM failures). I think the actual correct behavior here is to mark the line pointer as dead. The easiest way to accomplish that is probably to have lazy_scan_heap() just emit an extra XLOG_HEAP2_CLEAN record beyond whatever HOT-pruning already did, if it's necessary. A better solution would probably be to merge HOT-pruning with setting things all-visible and have a single function that does both, but that seems a lot more invasive, and definitely unsuitable for back-patching. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improve planner cost estimations for alternative subplans
Hi Tom, sorry for the delay, I experimented with adding a number-of-evaluations parameter to cost_qual_eval, and found that the majority of call sites do have something realistic they can pass. The attached very-much-WIP patch shows my results so far. There's a lot of loose ends: I like the idea, so if we alternative subplans remain there I think we should implement it. Best, Alex
Re: Improve planner cost estimations for alternative subplans
Hi Melanie, Sorry for the delay. I've just started looking at this patch today, but I was wondering if you might include a test case which minimally reproduces the original problem you had. I could reproduce it with an easier generated data set, please see attached. However, to be honest with you, while searching I encountered a few examples of the opposite behavior, when the patched version was slower than the master branch. So I'm not so sure whether we should use the patch, maybe we should rather consider Tom's approach. Best, Alex altplan-example.sql Description: application/sql
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy wrote: > I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also > if exists perf related resource, using LLVMOrcDisposeInstance() and > LLVMOrcUnregisterPerf(), that were dynamically allocated in > llvm_session_initialize through a JIT library function > LLVMOrcCreateInstance() [1]. > > It looks like there is no problem in moving llvm_shutdown to either > on_shmem_exit() or on_proc_exit(). If it doesn't involve shared memory, I guess it can be on_proc_exit() rather than on_shmem_exit(). I guess the other question is why we're doing it at all. What resources are being allocated that wouldn't be freed up by process exit anyway? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jul 23, 2020 at 10:04 PM Andres Freund wrote: > I think we should consider introducing XACTFATAL or such, guaranteeing > the transaction gets aborted, without requiring a FATAL. This has been > needed for enough cases that it's worthwhile. Seems like that would need a separate discussion, apart from this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty wrote: > In the read-only level I was suggesting, I wasn't suggesting that we > stop WAL flushes, in fact we should flush the WAL before we mark the > system as read-only. Once the system declares itself as read-only, it > will not perform any more on-disk changes; It may perform all the > flushes it needs as a part of the read-only request handling. I think that's already how the patch works, or at least how it should work. You stop new writes, flush any existing WAL, and then declare the system read-only. That can all be done quickly. > What I am saying is it doesn't have to be just the queries. I think we > can cater to all the other use cases simply by forcing a checkpoint > before marking the system as read-only. But that part can't, which means that if we did that, it would break the feature for the originally intended use case. I'm not on board with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty wrote: > So if we are not going to address those cases, we should change the > syntax and remove the notion of read-only. It could be: > > ALTER SYSTEM SET wal_writes TO off|on; > or > ALTER SYSTEM SET prohibit_wal TO off|on; This doesn't really work because of the considerations mentioned in http://postgr.es/m/ca+tgmoakctzozr0xeqalfimbcje2rgcbazf4eybpxjtnetp...@mail.gmail.com > If we are going to try to make it truly read-only, and cater to the > other use cases, we have to: > > Perform a checkpoint before declaring the system read-only (i.e. before > the command returns). This may be expensive of course, as Andres has > pointed out in this thread, but it is a price that has to be paid. If we > do this checkpoint, then we can avoid an additional shutdown checkpoint > and an end-of-recovery checkpoint (if we restart the primary after a > crash while in read-only mode). Also, we would have to prevent any > operation that touches control files, which I am not sure we do today in > the current patch. It's basically impossible to create a system for fast failover that involves a checkpoint. See my comments at http://postgr.es/m/ca+tgmoye8ucgtyfgfnv3vwpztygsdksu2f4mniqhkar_ukb...@mail.gmail.com - you can't achieve five nines or even four nines of availability if you have to wait for a checkpoint that might take twenty minutes. I have nothing against a feature that does what you're describing, but this feature is designed to make fast failover easier to accomplish, and it's not going to succeed if it involves a checkpoint. > Why not have the best of both worlds? Consider: > > ALTER SYSTEM SET read_only to {off, on, wal}; > > -- on: wal writes off + no writes to disk > -- off: default > -- wal: only wal writes off > > Of course, there can probably be better syntax for the above. There are a few things you can can imagine doing here: 1. Freeze WAL writes but allow dirty buffers to be flushed afterward. This is the most useful thing for fast failover, I would argue, because it's quick and the fact that some dirty buffers may not be written doesn't matter. 2. Freeze WAL writes except a final checkpoint which will flush dirty buffers along the way. This is like shutting the system down cleanly and bringing it back up as a standby, except without performing a shutdown. 3. Freeze WAL writes and write out all dirty buffers without actually checkpointing. This is sort of a hybrid of #1 and #2. It's probably not much faster than #2 but it avoids generating any more WAL. 4. Freeze WAL writes and just keep all the dirty buffers cached, without writing them out. This seems like a bad idea for the reasons mentioned in Amul's reply. The system might not be able to respond even to read-only queries any more if shared_buffers is full of unevictable dirty buffers. Either #2 or #3 is sufficient to take a filesystem level snapshot of the cluster while it's running, but I'm not sure why that's interesting. You can already do that sort of thing by using pg_basebackup or by running pg_start_backup() and pg_stop_backup() and copying the directory in the middle, and you can do all of that while the cluster is accepting writes, which seems like it will usually be more convenient. If you do want this, you have several options, like running a checkpoint immediately followed by ALTER SYSTEM READ ONLY (so that the amount of WAL generated during the backup is small but maybe not none); or shutting down the system cleanly and restarting it as a standby; or maybe using the proposed pg_ctl demote feature mentioned on a separate thread. Contrary to what you write, I don't think either #2 or #3 is sufficient to enable checksums, at least not without some more engineering, because the server would cache the state from the control file, and a bunch of blocks from the database. I guess it would work if you did a server restart afterward, but I think there are better ways of supporting online checksum enabling that don't require shutting down the server, or even making it read-only; and there's been significant work done on those already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making CASE error handling less surprising
On Fri, Jul 24, 2020 at 4:35 AM Tom Lane wrote: > Andres Freund writes: > > I'm a bit worried about a case like: > > > CREATE FUNCTION yell(int, int) > > RETURNS int > > IMMUTABLE > > LANGUAGE SQL AS $$ > >SELECT CASE WHEN $1 != 0 THEN 17 / $2 ELSE NULL END > > $$; > > > EXPLAIN SELECT yell(g.i, 0) FROM generate_series(1, 10) g(i); > > > I don't think the parameters here would have been handled before > > inlining, right? > > Ah, I see what you mean. Yeah, that throws an error today, and it > still would with the patch I was envisioning (attached), because > inlining does Param substitution in a different way. I'm not > sure that we could realistically fix the inlining case with this > sort of approach. > > I think this bears out the comment I made before that this approach > still leaves us with a very complicated behavior. Maybe we should > stick with the previous approach, possibly supplemented with a > leakproofness exception. > I am actually not so sure this is a good idea. Here are two doubts I have. 1. The problem of when a given SQL expression is evaluated crops up in a wide variety of different contexts and, worst case, causes far more damage than queries which always error. Removing the lower hanging fruit while leaving cases like: select lock_foo(id), * from foo where somefield > 100; -- which rows does lock_foo(id) run on? Does it matter? is going to legitimize these complaints in a way which will be very hard to do unless we also want to eventually be able to specify when volatile functions may be run. The two cases don't look the same but they are manifestations of the same problem which is that when you execute a SQL query you have no control over when expressions are actually run. 2. The refusal to fold immutables within case statements here mean either we do more tricks to get around the planner if we hit a pathological cases in performance. I am not convinced this is a net win. If we go this route, would it be too much to ask to allow a GUC variable to preserve the old behavior? > regards, tom lane > > -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
Re: INSERT INTO SELECT, Why Parallelism is not selected?
Robert Haas writes: > Well, I think the comments could be more clear - for the insert case > specifically - about which cases you think are and are not safe. Yeah, the proposed comment changes don't actually add much. Also please try to avoid inserting non-ASCII into the source code; at least in my mail reader, that attachment has some. regards, tom lane
Re: INSERT INTO SELECT, Why Parallelism is not selected?
On Fri, Jul 24, 2020 at 7:59 AM Amit Kapila wrote: > On Fri, Jul 17, 2020 at 11:24 AM Amit Kapila wrote: > > Do you have something else in mind? > > I am planning to commit the comments change patch attached in the > above email [1] next week sometime (probably Monday or Tuesday) unless > you have something more to add? Well, I think the comments could be more clear - for the insert case specifically - about which cases you think are and are not safe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: renaming configure.in to configure.ac
Peter Eisentraut writes: > On 2020-07-17 10:46, Peter Eisentraut wrote: >> v1-0001-Rename-configure.in-to-configure.ac.patch > I have committed that, and I have sent feedback to the Autoconf > developers about our concerns about the slowness of some of the new tests. Sounds good. Do we want to try Noah's idea of temporarily committing the remaining changes, to see if the buildfarm is happy? regards, tom lane
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 10:40:47AM +0200, Tomas Vondra wrote: On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote: On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra wrote: So let me share some fresh I/O statistics collected on the current code using iosnoop. I've done the tests on two different machines using the "aggregate part" of TPC-H Q17, i.e. essentially this: SELECT * FROM ( SELECT l_partkey AS agg_partkey, 0.2 * avg(l_quantity) AS avg_quantity FROM lineitem GROUP BY l_partkey OFFSET 10 ) part_agg; The OFFSET is there just to ensure we don't need to send anything to the client, etc. Thanks for testing this. So sort writes ~3.4GB of data, give or take. But hashagg/master writes almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still much more than the 3.4GB of data written by sort (which has to spill everything, while hashagg only spills rows not covered by the groups that fit into work_mem). What I find when I run your query (with my own TPC-H DB that is smaller than what you used here -- 59,986,052 lineitem tuples) is that the sort required about 7x more memory than the hash agg to do everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak hash agg memory usage. I'd be surprised if the ratio was very different for you -- but can you check? I can check, but it's not quite clear to me what are we looking for? Increase work_mem until there's no need to spill in either case? FWIW the hashagg needs about 4775953kB and the sort 33677586kB. So yeah, that's about 7x more. I think that's probably built into the TPC-H data set. It'd be easy to construct cases with much higher/lower factors. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services QUERY PLAN - Limit (cost=15493271.13..15493271.13 rows=1 width=36) (actual time=331351.099..331351.099 rows=0 loops=1) -> HashAggregate (cost=15186647.64..15493271.13 rows=20441566 width=36) (actual time=318190.465..330956.383 rows=1500 loops=1) Group Key: lineitem.l_partkey Peak Memory Usage: 4775953kB -> Seq Scan on lineitem (cost=0.00..12936556.76 rows=450018176 width=9) (actual time=0.156..56051.850 rows=450019701 loops=1) Planning Time: 0.151 ms Execution Time: 331931.239 ms (7 rows) QUERY PLAN --- Limit (cost=81298097.02..81298097.02 rows=1 width=36) (actual time=415344.639..415344.639 rows=0 loops=1) -> GroupAggregate (cost=77616337.21..81298097.02 rows=20441566 width=36) (actual time=209292.469..414951.954 rows=1500 loops=1) Group Key: lineitem.l_partkey -> Sort (cost=77616337.21..78741382.65 rows=450018176 width=9) (actual time=209292.435..329583.999 rows=450019701 loops=1) Sort Key: lineitem.l_partkey Sort Method: quicksort Memory: 33677586kB -> Seq Scan on lineitem (cost=0.00..12936556.76 rows=450018176 width=9) (actual time=0.096..72474.733 rows=450019701 loops=1) Planning Time: 0.145 ms Execution Time: 417157.598 ms (9 rows)
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Jul 24, 2020 at 7:34 AM Andres Freund wrote: > > Hi, Thanks for looking at the patch. > > > From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001 > > From: Amul Sul > > Date: Fri, 27 Mar 2020 05:05:38 -0400 > > Subject: [PATCH v3 2/6] Add alter system read only/write syntax > > > > Note that syntax doesn't have any implementation. > > --- > > src/backend/nodes/copyfuncs.c| 12 > > src/backend/nodes/equalfuncs.c | 9 + > > src/backend/parser/gram.y| 13 + > > src/backend/tcop/utility.c | 20 > > src/bin/psql/tab-complete.c | 6 -- > > src/include/nodes/nodes.h| 1 + > > src/include/nodes/parsenodes.h | 10 ++ > > src/tools/pgindent/typedefs.list | 1 + > > 8 files changed, 70 insertions(+), 2 deletions(-) > > Shouldn't there be at outfuncs support as well? Perhaps we even need > readfuncs, not immediately sure. Ok, can add that as well. > > > > > From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001 > > From: Amul Sul > > Date: Fri, 19 Jun 2020 06:29:36 -0400 > > Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global > > barrier. > > > > Implementation: > > > > 1. When a user tried to change server state to WAL-Prohibited using > > ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will > > emit > > PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait > > until the > > barrier has been absorbed by all the backends. > > > > 2. When a backend receives the WAL-Prohibited barrier, at that moment if > > it is already in a transaction and the transaction already assigned XID, > > then the backend will be killed by throwing FATAL(XXX: need more > > discussion > > on this) > > I think we should consider introducing XACTFATAL or such, guaranteeing > the transaction gets aborted, without requiring a FATAL. This has been > needed for enough cases that it's worthwhile. > As I am aware of, the existing code PostgresMain() uses FATAL to terminate the connection when protocol synchronization was lost. Currently, in a proposal, this and another one is "Terminate the idle sessions"[1] is using FATAL, afaik. > > There are several cases where we WAL log without having an xid > assigned. E.g. when HOT pruning during syscache lookups or such. Are > there any cases where the check for being in recovery is followed by a > CHECK_FOR_INTERRUPTS, before the WAL logging is done? > In case of operation without xid, an error will be raised just before the point where the wal record is expected. The places you are asking about, I haven't found in a glance, will try to search for that, but I am sure current implementation is not missing those places where it is supposed to check the prohibited state and complaint. Quick question, is it possible that pruning will happen with the SELECT query? It would be helpful if you or someone else could point me to the place where WAL can be generated even in the case of read-only queries. > > > > 3. Otherwise, if that backend running transaction which yet to get XID > > assigned we don't need to do anything special, simply call > > ResetLocalXLogInsertAllowed() so that any future WAL insert in will > > check > > XLogInsertAllowed() first which set ready only state appropriately. > > > > 4. A new transaction (from existing or new backend) starts as a read-only > > transaction. > > Why do we need 4)? And doesn't that have the potential to be > unnecessarily problematic if a the server is subsequently brought out of > the readonly state again? The transaction that was started in the read-only system state will be read-only until the end. I think that shouldn't be too problematic. > > > > 5. Auxiliary processes like autovacuum launcher, background writer, > > checkpointer and walwriter will don't do anything in WAL-Prohibited > > server state until someone wakes us up. E.g. a backend might later on > > request us to put the system back to read-write. > > Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do > anything in this state. bgwriter for example is even running entirely > normally in a hot standby node? I think I missed to update the description when I reverted the walwriter changes. The current version doesn't have any changes to the walwriter. And bgwriter too behaves the same as it on the recovery system. Will update this, sorry for the confusion. > > > > 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint > > and xlog rotation. Starting up again will perform crash recovery(XXX: > > need some discussion on this as well) > > > > 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server. > > > > 8. Only super user can toggle WAL-Prohibit state. > > > > 9. Add system_is_read_only GUC show the system state -- will true when > > system > > is wal prohibited or in
Re: INSERT INTO SELECT, Why Parallelism is not selected?
On Fri, Jul 17, 2020 at 11:24 AM Amit Kapila wrote: > > Do you have something else in mind? > I am planning to commit the comments change patch attached in the above email [1] next week sometime (probably Monday or Tuesday) unless you have something more to add? [1] - https://www.postgresql.org/message-id/CAA4eK1%2BRL7c_s%3D%2BTwAE6DJ1MmupbEiGCFLt97US%2BDMm6UxAjTA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
On Tue, Jul 21, 2020 at 1:17 AM Robert Haas wrote: > > On Tue, Jul 7, 2020 at 12:55 PM Andres Freund wrote: > > What are you proposing? For now we could easily enough work around this > > by just making it a on_proc_exit() callback, but that doesn't really > > change the fundamental issue imo. > > I think it would be more correct for it to be an on_proc_exit() > callback, because before_shmem_exit() callbacks can and do perform > actions which rely on an awful lot of the system being still in a > working state. RemoveTempRelationsCallback() is a good example: it > thinks it can start and end transactions and make a bunch of catalog > changes. I don't know that any of that could use JIT, but moving the > JIT cleanup to the on_shmem_exit() stage seems better. At that point, > there shouldn't be anybody doing anything that relies on being able to > perform logical changes to the database; we're just shutting down > low-level subsystems at that point, and thus presumably not doing > anything that could possibly need JIT. > I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also if exists perf related resource, using LLVMOrcDisposeInstance() and LLVMOrcUnregisterPerf(), that were dynamically allocated in llvm_session_initialize through a JIT library function LLVMOrcCreateInstance() [1]. It looks like there is no problem in moving llvm_shutdown to either on_shmem_exit() or on_proc_exit(). [1] - https://llvm.org/doxygen/OrcCBindings_8cpp_source.html > > But I also agree that what pg_start_backup() was doing before v13 was > wrong; that's why I committed > 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't > back-patch it is because the consequences are so minor I didn't think > it was worth worrying about. We could, though. I'd be somewhat > inclined to both do that and also change LLVM to use on_proc_exit() in > master, but I don't feel super-strongly about it. > Patch: v1-0001-Move-llvm_shutdown-to-on_proc_exit-list-from-befo.patch Moved llvm_shutdown to on_proc_exit() call back list. Request to consider this change for master, if possible <=13 versions. Basic JIT use cases and regression tests are working fine with the patch. Patches: PG11-0001-Fix-minor-problems-with-non-exclusive-backup-clea.patch and PG12-0001-Fix-minor-problems-with-non-exclusive-backup-cleanup.patch Request to consider the commit 303640199d0436c5e7acdf50b837a027b5726594(above two patches are for this commit) to versions < 13, to fix the abort issue. Please note that the above two patches have no difference in the code, just I made it applicable on PG11. Patch: v1-0001-Modify-cancel_before_shmem_exit-comments.patch This patch, modifies cancel_before_shmem_exit() function comment to reflect the safe usage of before_shmem_exit_list callback mechanism and also removes the point "For simplicity, only the latest entry can be removed*" as this gives a meaning that there is still scope for improvement in cancel_before_shmem_exit() search mechanism. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From 5fae4b3a046789fb7b54e689c979b01cb322aaf9 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 23 Jul 2020 17:56:21 +0530 Subject: [PATCH v1] Move llvm_shutdown to on_proc_exit list from before_shmem_exit. llvm_shutdown frees up dynamically allocated memory for jit compilers in a session/backend. Having this as on_proc_exit doesn't cause any harm. --- src/backend/jit/llvm/llvmjit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c index af8b34aaaf..43bed78a52 100644 --- a/src/backend/jit/llvm/llvmjit.c +++ b/src/backend/jit/llvm/llvmjit.c @@ -683,7 +683,7 @@ llvm_session_initialize(void) } #endif - before_shmem_exit(llvm_shutdown, 0); + on_proc_exit(llvm_shutdown, 0); llvm_session_initialized = true; -- 2.25.1 From 3b0a048afd7ad6d8564a54d13397c72f6eadc5da Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 24 Jul 2020 08:55:52 +0530 Subject: [PATCH v5] Fix minor problems with non-exclusive backup cleanup. The previous coding imagined that it could call before_shmem_exit() when a non-exclusive backup began and then remove the previously-added handler by calling cancel_before_shmem_exit() when that backup ended. However, this only works provided that nothing else in the system has registered a before_shmem_exit() hook in the interim, because cancel_before_shmem_exit() is documented to remove a callback only if it is the latest callback registered. It also only works if nothing can ERROR out between the time that sessionBackupState is reset and the time that cancel_before_shmem_exit(), which doesn't seem to be strictly true. To fix, leave the handler installed for the lifetime of the session, arrange to install it just once, and teach it to quietly do nothing if there isn't a non-exclusive backup in process. This is a bug, but for now I'm not going to
Re: renaming configure.in to configure.ac
On 2020-07-17 10:46, Peter Eisentraut wrote: v1-0001-Rename-configure.in-to-configure.ac.patch I have committed that, and I have sent feedback to the Autoconf developers about our concerns about the slowness of some of the new tests. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi All, Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that contains these functions. Please have a look and let me know your feedback. Thank you. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Jul 16, 2020 at 9:44 PM Robert Haas wrote: > On Thu, Jul 16, 2020 at 10:00 AM Robert Haas > wrote: > > I see your point, though: the tuple has to be able to survive > > HOT-pruning in order to cause a problem when we check whether it needs > > freezing. > > Here's an example where the new sanity checks fail on an invisible > tuple without any concurrent transactions: > > $ initdb > $ pg_ctl start -l ~/logfile > $ createdb > $ psql > > create table simpsons (a int, b text); > vacuum freeze; > > $ cat > txid.sql > select txid_current(); > $ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql > $ psql > > insert into simpsons values (1, 'homer'); > > $ pg_ctl stop > $ pg_resetwal -x 1000 $PGDATA > $ pg_ctl start -l ~/logfile > $ psql > > update pg_class set relfrozenxid = (relfrozenxid::text::integer + > 200)::text::xid where relname = 'simpsons'; > > rhaas=# select * from simpsons; > a | b > ---+--- > (0 rows) > > rhaas=# vacuum simpsons; > ERROR: found xmin 1049082 from before relfrozenxid 2000506 > CONTEXT: while scanning block 0 of relation "public.simpsons" > > This is a fairly insane situation, because we should have relfrozenxid > < tuple xid < xid counter, but instead we have xid counter < tuple xid > < relfrozenxid, but it demonstrates that it's possible to have a > database which is sufficiently corrupt that you can't escape from the > new sanity checks using only INSERT, UPDATE, and DELETE. > > Now, an even easier way to create a table with a tuple that prevents > vacuuming and also can't just be deleted is to simply remove a > required pg_clog file (and maybe restart the server to clear out any > cached data in the SLRUs). What we typically do with customers who > need to recover from that situation today is give them a script to > fabricate a bogus CLOG file that shows all transactions as committed > (or, perhaps, aborted). But I think that the tools proposed on this > thread might be a better approach in certain cases. If the problem is > that a pg_clog file vanished, then recreating it with whatever content > you think is closest to what was probably there before is likely the > best you can do. But if you've got some individual tuples with crazy > xmin values, you don't really want to drop matching files in pg_clog; > it's better to fix the tuples. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > From 75a6ba0e86b6d199201477c5ebd60258ef16f181 Mon Sep 17 00:00:00 2001 From: ashu Date: Fri, 24 Jul 2020 11:26:45 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on the damaged heap tables. This contrib module basically adds a couple of functions named heap_force_kill and heap_force_freeze that can be used in the scripts to perform surgery on the damaged heap tables. Ashutosh Sharma. --- contrib/Makefile | 1 + contrib/pg_surgery/Makefile| 23 ++ contrib/pg_surgery/expected/pg_surgery.out | 109 + contrib/pg_surgery/heap_surgery_funcs.c| 356 + contrib/pg_surgery/pg_surgery--1.0.sql | 14 ++ contrib/pg_surgery/pg_surgery.control | 5 + contrib/pg_surgery/sql/pg_surgery.sql | 72 ++ 7 files changed, 580 insertions(+) create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/pg_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery_funcs.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql diff --git a/contrib/Makefile b/contrib/Makefile index 1846d41..07d5734 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -35,6 +35,7 @@ SUBDIRS = \ pg_standby \ pg_stat_statements \ pg_trgm \ + pg_surgery \ pgcrypto \ pgrowlocks \ pgstattuple \ diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile new file mode 100644 index 000..baf2a88 --- /dev/null +++ b/contrib/pg_surgery/Makefile @@ -0,0 +1,23 @@ +# contrib/pg_surgery/Makefile + +MODULE_big = pg_surgery +OBJS = \ + $(WIN32RES) \ + heap_surgery_funcs.o + +EXTENSION = pg_surgery +DATA = pg_surgery--1.0.sql +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" + +REGRESS = pg_surgery + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_surgery +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git
Re: Default setting for enable_hashagg_disk
On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote: On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra wrote: So let me share some fresh I/O statistics collected on the current code using iosnoop. I've done the tests on two different machines using the "aggregate part" of TPC-H Q17, i.e. essentially this: SELECT * FROM ( SELECT l_partkey AS agg_partkey, 0.2 * avg(l_quantity) AS avg_quantity FROM lineitem GROUP BY l_partkey OFFSET 10 ) part_agg; The OFFSET is there just to ensure we don't need to send anything to the client, etc. Thanks for testing this. So sort writes ~3.4GB of data, give or take. But hashagg/master writes almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still much more than the 3.4GB of data written by sort (which has to spill everything, while hashagg only spills rows not covered by the groups that fit into work_mem). What I find when I run your query (with my own TPC-H DB that is smaller than what you used here -- 59,986,052 lineitem tuples) is that the sort required about 7x more memory than the hash agg to do everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak hash agg memory usage. I'd be surprised if the ratio was very different for you -- but can you check? I can check, but it's not quite clear to me what are we looking for? Increase work_mem until there's no need to spill in either case? I think that there is something pathological about this spill behavior, because it sounds like the precise opposite of what you might expect when you make a rough extrapolation of what disk I/O will be based on the memory used in no-spill cases (as reported by EXPLAIN ANALYZE). Maybe, not sure what exactly you think is pathological? The trouble is hashagg has to spill input tuples but the memory used in no-spill case represents aggregated groups, so I'm not sure how you could extrapolate from that ... FWIW one more suspicious thing that I forgot to mention is the behavior of the "planned partitions" depending on work_mem, which looks like this: 2MB Planned Partitions: 64HashAgg Batches: 4160 4MB Planned Partitions: 128HashAgg Batches: 16512 8MB Planned Partitions: 256HashAgg Batches: 21488 64MB Planned Partitions: 32HashAgg Batches: 2720 256MB Planned Partitions: 8HashAgg Batches: 8 I'd expect the number of planned partitions to decrease (slowly) as work_mem increases, but it seems to increase initially. Seems a bit strange, but maybe it's expected. What I find really surprising is the costing - despite writing about twice as much data, the hashagg cost is estimated to be much lower than the sort. For example on the i5 machine, the hashagg cost is ~10M, while sort cost is almost 42M. Despite using almost twice as much disk. And the costing is exactly the same for master and the CP_SMALL_TLIST. That does make it sound like the costs of the hash agg aren't being represented. I suppose it isn't clear if this is a costing issue because it isn't clear if the execution time performance itself is pathological or is instead something that must be accepted as the cost of spilling the hash agg in a general kind of way. Not sure, but I think we need to spill roughly as much as sort, so it seems a bit strange that (a) we're spilling 2x as much data and yet the cost is so much lower. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Implement UNLOGGED clause for COPY FROM
On Fri, 17 Jul 2020 at 13:23, osumi.takami...@fujitsu.com wrote: > > Hi, > > > AFAICS, we can already accomplish basically the same thing as what you want > > to > > do like this: > > > > alter table foo set unlogged; > > copy foo from ...; > > alter table foo set logged; > This didn't satisfy what I wanted. > In case that 'foo' has huge amount of rows at the beginning, > this example would spend much time to copy > the contents of 'foo' twice to swap relfilenodes atomically. > When that loaded data by COPY is big too, its execution time becomes much > longer. > > > You keep on ignoring the indexes... not to mention replication. > Sorry for having made you think like this. > > When the server crash occurs during data loading of COPY UNLOGGED, > it's a must to keep index consistent of course. > I'm thinking that to rebuild the indexes on the target table would work. > > In my opinion, UNLOGGED clause must be designed to guarantee that > where the data loaded by this clause is written starts from the end of all > other data blocks. > Plus, those blocks needs to be protected by any write of other transactions > during the copy. > Apart from that, the server must be aware of which block is the first block, > or the range about where it started or ended in preparation for the crash. > > During the crash recovery, those points are helpful to recognize and detach > such blocks > in order to solve a situation that the loaded data is partially synced to the > disk and the rest isn't. How do online backup and archive recovery work? Suppose that the user executes pg_basebackup during COPY UNLOGGED running, the physical backup might have the portion of tuples loaded by COPY UNLOGGED but these data are not recovered. It might not be a problem because the operation is performed without WAL records. But what if an insertion happens after COPY UNLOGGED but before pg_stop_backup()? I think that a new tuple could be inserted at the end of the table, following the data loaded by COPY UNLOGGED. With your approach described above, the newly inserted tuple will be recovered during archive recovery, but it either will be removed if we replay the insertion WAL then truncate the table or won’t be inserted due to missing block if we truncate the table then replay the insertion WAL, resulting in losing the tuple although the user got successful of insertion. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: deferred primary key and logical replication
The patch no longer applies, because of additions in the test source. Otherwise, I have tested the patch and confirmed that updates and deletes on tables with deferred primary keys work with logical replication. The new status of this patch is: Waiting on Author
Re: logical replication empty transactions
Sorry, I replied in the wrong thread. Please ignore above mail. > >
handle a ECPG_bytea typo
Hi, hackers The source looks like: case ECPGt_bytea: { struct ECPGgeneric_varchar *variable = (struct ECPGgeneric_varchar *) (var->value); .. } I think the developer intend to use struct ECPGgeneric_bytea instead of struct ECPGgeneric_varchar Is this thoughts right? I have wrote a patch to fix this typo 0001-Fix-ECPGt_bytea-typo.patch Description: 0001-Fix-ECPGt_bytea-typo.patch
Re: logical replication empty transactions
The patch no longer applies, because of additions in the test source. Otherwise, I have tested the patch and confirmed that updates and deletes on tables with deferred primary keys work with logical replication. The new status of this patch is: Waiting on Author
Re: Parallel Seq Scan vs kernel read ahead
Hi Soumyadeep, Thanks for re-running the tests. On Thu, 23 Jul 2020 at 06:01, Soumyadeep Chakraborty wrote: > On Tue, Jul 14, 2020 at 8:52 PM David Rowley wrote: > > It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET > > track_io_timing = on; for each value of max_parallel_workers. > > As for running EXPLAIN ANALYZE, running that on this system incurs a > non-trivial amount of overhead. The overhead is simply staggering. I didn't really intend for that to be used to get an accurate overall timing for the query. It was more to get an indication of the reads are actually hitting the disk or not. I mentioned to Kirk in [1] that his read speed might be a bit higher than what his disk can actually cope with. I'm not too sure on the HDD he mentions, but if it's a single HDD then reading at an average speed of 3457 MB/sec seems quite a bit too fast. It seems more likely, in his cases, that those reads are mostly coming from the kernel's cache. David [1] https://www.postgresql.org/message-id/CAApHDvoDzAzXEp+Ay2CfT3U=zcd5nld7k9_y936bnhjzs5j...@mail.gmail.com
Re: Parallel worker hangs while handling errors.
On Thu, Jul 23, 2020 at 12:54 PM vignesh C wrote: > > Thanks for reviewing and adding your thoughts, My comments are inline. > > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy > wrote: > > > > The same hang issue can occur(though I'm not able to back it up with a > > use case), in the cases from wherever the EmitErrorReport() is called > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and > > postgres.c. > > > > I'm not sure if this can occur in other cases. > I checked further on this point: Yes, it can't occur for the other cases, as mq_putmessage() gets only used for parallel workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()). > > > Note that, in all sigsetjmp blocks, we intentionally > > HOLD_INTERRUPTS(), to not cause any issues while performing error > > handling, I'm concerned here that now, if we directly call > > BackgroundWorkerUnblockSignals() which will open up all the signals > > and our main intention of holding interrupts or block signals may go > > away. > > > > Since the main problem for this hang issue is because of blocking > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, > > instead of unblocking all signals? I tried this with parallel copy > > hang, the issue is resolved. > > > > On putting further thoughts on this, I feel just unblocking SIGUSR1 > would be the right approach in this case. I'm attaching a new patch > which unblocks SIGUSR1 signal. I have verified that the original issue > with WIP parallel copy patch gets fixed. I have made changes only in > bgworker.c as we require the parallel worker to receive this signal > and continue processing. I have not included the changes for other > processes as I'm not sure if this scenario is applicable for other > processes. > Since we are sure that this hang issue can occur only for parallel workers, and the change in StartBackgroundWorker's sigsetjmp's block should only be made for parallel worker cases. And also there can be a lot of other callbacks execution and processing in proc_exit() for which we might not need the SIGUSR1 unblocked. So, let's undo the unblocking right after EmitErrorReport() to not cause any new issues. Attaching a modified v2 patch: it has the unblocking for only for parallel workers, undoing it after EmitErrorReport(), and some adjustments in the comment. I verified this fix for the parallel copy hang issue. And also make check and make check-world passes. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From 6bd924f2e7ff90d6c293f131b8ddb20898b9950d Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 24 Jul 2020 12:01:14 +0530 Subject: [PATCH v2] Fix for Parallel worker hangs while handling errors. Worker is not able to receive the signals while processing error flow. Worker hangs in this case because when the worker is started the signals will be masked using sigprocmask. Unblocking of signals is done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error handling the worker has jumped to setjmp in StartBackgroundWorker function. Here the signals are in blocked state, hence the signal is not received by the worker process. --- src/backend/postmaster/bgworker.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index beb5e85434..ee187689a4 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -747,6 +747,20 @@ StartBackgroundWorker(void) */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) { + /* + * In case of parallel workers, unblock SIGUSR1 signal, it was blocked + * when the postmaster forked us. Leader process will send SIGUSR1 signal + * to the worker process(worker process will be in waiting state as + * there is no space available) to indicate shared memory space is freed + * up. Once the signal is received worker process will start populating + * the error message further. + */ + if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) + { + sigdelset(, SIGUSR1); + PG_SETMASK(); + } + /* Since not using PG_TRY, must reset error stack by hand */ error_context_stack = NULL; @@ -756,6 +770,18 @@ StartBackgroundWorker(void) /* Report the error to the server log */ EmitErrorReport(); + /* + * Undo the unblocking of SIGUSR1 which was done above, as to + * not cause any further issues from unblocking SIGUSR1 during + * the execution of callbacks and other processing that will be + * done during proc_exit(). + */ + if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) + { + sigaddset(, SIGUSR1); + PG_SETMASK(); + } + /* * Do we need more cleanup here? For shmem-connected bgworkers, we * will call InitProcess below, which will install ProcKill as exit -- 2.25.1