Re: Fast DSM segments
On Mon, Jul 27, 2020 at 2:45 PM Thomas Munro wrote: > Here's a new version, using the name min_dynamic_shared_memory, which > sounds better to me. Any objections? I also fixed the GUC's maximum > setting so that it's sure to fit in size_t. I pushed it like that. Happy to rename the GUC if someone has a better idea. I don't really love the way dsm_create()'s code flows, but I didn't see another way to do this within the existing constraints. I think it'd be nice to rewrite this thing to get rid of the random number-based handles that are directly convertible to key_t/pathname, and instead use something holding {slot number, generation number}. Then you could improve that code flow and get rid of several cases of linear array scans under an exclusive lock. The underlying key_t/pathname would live in the slot. You'd need a new way to find the control segment itself after a restart, where dsm_cleanup_using_control_segment() cleans up after the previous incarnation, but I think that just requires putting the key_t/pathname directly in PGShmemHeader, instead of a new {slot number, generation number} style handle. Or maybe a separate mapped file opened by well known pathname, or something like that.
Re: Fix minor source code comment mistake
On Thu, Jul 30, 2020 at 05:57:40PM +0900, Michael Paquier wrote: > Indeed. Let's fix this. And done. -- Michael signature.asc Description: PGP signature
Re: Threading in BGWorkers (!)
> > I see no good reason to believe that the signal handler issue is the only > one. Even if it is, > not being able to call any postgres infrastructure is a pretty huge > handicap. > (changed emails to get rid of the nasty employer notice...) It's at least a workable handicap that I'm happy to deal with. I can say with 100% confidence that people coming from non C languages will be using threading in Postgres backends as interop matures (and it's happening fast now). A lot of the time they won't even know they are using threads as it will be done by libraries they make use of transparently. Let's help them to avoid unsafe code now, not wait until they show up on this list with a critical failure and tap at the big sign that says "NO THREADING". - james
RE: [Proposal] Add accumulated statistics for wait event
> This patch fails to apply to HEAD, please submit a rebased version. I've > marked this as as Waiting on Author. Sorry for my absence. Unfortunately I couldn't have time to work on this patch in this cf. I believe I will be back in next cf, work on this patch and also review other patches. --- Yoshikazu IMAI
RE: [Patch] Optimize dropping of relation buffers using dlist
On Friday, July 31, 2020 2:37 AM, Konstantin Knizhnik wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation:not tested > > I have tested this patch at various workloads and hardware (including Power2 > server with 384 virtual cores) and didn't find performance regression. > > The new status of this patch is: Ready for Committer Thank you very much, Konstantin, for testing the patch for different workloads. I wonder if I need to modify some documentations. I'll leave the final review to the committer/s as well. Regards, Kirk Jamison
Re: Add Information during standby recovery conflicts
On Sat, 11 Jul 2020 at 21:56, Drouvot, Bertrand wrote: > > > On 7/7/20 9:43 AM, Masahiko Sawada wrote: > > Fully makes sense, the new patch version attached is now producing: > >> 2020-07-06 06:10:36.022 UTC [14035] LOG: waiting for recovery conflict > >> on snapshot > > How about adding the subject? that is, "recovery is waiting for > > recovery conflict on %s" or "recovery process is waiting for > > conflict on %s". > > > The subject is now added in the new attached patch (I did not include > the pid as it is part of the log prefix). > > It now looks like: > > 2020-07-11 12:00:41.092 UTC [23217] LOG: recovery is waiting for > recovery conflict on snapshot > 2020-07-11 12:00:41.092 UTC [23217] DETAIL: There is 1 blocking > connection(s). > 2020-07-11 12:00:41.092 UTC [23217] CONTEXT: WAL redo at 0/4A0A6BF0 for > Heap2/CLEAN: remxid 1128 > WAL record received at 2020-07-11 12:00:41.092231+00 > tbs 1663 db 13586 rel 16805, fork main, blkno 0 > > > >> 2020-07-06 06:10:36.022 UTC [14035] DETAIL: WAL record received at > >> 2020-07-06 06:10:36.021963+00. > >> Tablespace/database/relation are 1663/13586/16672, fork is main > >> and block is 0. > >> There is 1 blocking connection(s). > > To follow the existing log message, perhaps this can be something like > > "WAL record received at %s, rel %u/%u/%u, fork %s, blkno %u. %d > > processes". But I'm not sure the errdetail is the best place to > > display the WAL information as I mentioned in the latter part of this > > email. > > moved to the context and formatted the same way as the current > Standby/LOCK context. > > > > Ok. Let's keep this thread for the new attached patch that focus on the > >> recovery process waiting. > > Thank you for updating the patch! > > > > I've tested the latest patch. > > > Thank you for testing and reviewing! > > > > On recovery conflict on lock and on > > bufferpin, if max_standby_streaming_delay is disabled (set to -1), the > > logs don't appear even if log_recovery_conflicts is true. > > > Nice catch! it is fixed in the new attached patch (the log reporting has > been moved out of StandbyTimeoutHandler()). > > > > > Here is random comments on the code: > > > > + recovery_conflict_main_message = psprintf("waiting for > > recovery conflict on %s", > > + > > get_procsignal_reason_desc(reason)); > > : > > + ereport(LOG, > > + (errmsg("%s", > > recovery_conflict_main_message), > > + errdetail("%s\n" "There is %d blocking > > connection(s).", wal_record_detail_str, num_waitlist_entries))); > > > > It's not translation-support-friendly. I think the message "waiting > > for recovery conflict on %s" should be surrounded by _(). Or we can > > just put it to ereport as follows: > > > > ereport(LOG, > > (errmsg("waiting for recovery conflicts on %s", > > get_procsignal_reason_desc(reason)) > > ... > > > changed in the new attached patch. > > > > > > --- > > + oldcontext = MemoryContextSwitchTo(ErrorContext); > > + econtext = error_context_stack; > > + > > + if (XLogRecGetBlockTag(econtext->arg, 0, , , )) > > > > I don't think it's a good idea to rely on error_context_stack because > > other codes might set another error context before reaching here in > > the future. > > > right, changed in the new attached patch: this is now done in > rm_redo_error_callback() and using the XLogReaderState passed as argument. > > > > > > --- > > + if (XLogRecGetBlockTag(econtext->arg, 0, , , )) > > + wal_record_detail_str = psprintf("WAL record received > > at %s.\nTablespace/database/relation are %u/%u/%u, fork is %s and > > block is %u.", > > + receipt_time_str, > > rnode.spcNode, rnode.dbNode, rnode.relNode, > > + forkNames[forknum], > > + blknum); > > > > There might be a block tag in block ids other than 0. > > > right, fixed in the new attached patch. > > > > I'm not sure the > > errdetail is the best place where we display WAL information. > > > moved to context in the new attached patch. > > > > For > > instance, we display both the relation oid and block number depending > > on RM as follows: > > > > 2020-07-07 15:50:30.733 JST [13344] LOG: waiting for recovery conflict on > > lock > > 2020-07-07 15:50:30.733 JST [13344] DETAIL: WAL record received at > > 2020-07-07 15:50:27.73378+09. > > There is 1 blocking connection(s). > > 2020-07-07 15:50:30.733 JST [13344] CONTEXT: WAL redo at 0/328 > > for Standby/LOCK: xid 506 db 13586 rel 16384 > > > > I wonder if we can display the details of redo WAL information by > > improving xlog_outdesc() or rm_redo_error_callback() so that it > > displays relfilenode, forknum, and block number. What do you think? > > > I think that fully make sense to move this to rm_redo_error_callback(). >
Re: windows config.pl question
sorry I meant file src/tools/msvc/Solution.pm routine sub GetOpenSSLVersion has the follwing line: qq("$self->{options}->{openssl}\\bin\\openssl.exe" version 2>&1); in our distribution openssl.exe isn’t in the $self->{options}->{openssl}\bin\ location dm > On Jul 30, 2020, at 10:25 PM, Dmitry Markman wrote: > > Hi Michael, thanks a lot > > I figured it out, UNC works indeed > > however I found at least 2 problems (at least in our 3p harness) > > 1. in our configuration openssl executable went to lib (I don’t know why), so > I had to change line in Configure script > and point to openssl.exe file. Sure I probably can change our makefile to > create bin directory and put openssl.exe there > but it’s not my file, maybe later > > 2. if PostgreSQL is situated on network drive that mapped to say disk z:, > then build script failed: > > Z:\3p\derived\win64\PostgreSQL\source\src\tools\msvc>build > Detected hardware platform: Win32 > Generating win32ver.rc for src/backend > Generating win32ver.rc for src/timezone > Generating win32ver.rc for src/backend/snowball > Generating win32ver.rc for src/pl/plpgsql/src > Generating win32ver.rc for src/backend/replication/libpqwalreceiver > Generating win32ver.rc for src/backend/replication/pgoutput > Generating win32ver.rc for src/interfaces/ecpg/pgtypeslib > > . . . . . . . . . . . . > > Building the projects in this solution one at a time. To enable parallel > build, please add the "/m" switch. > Build started 7/30/2020 5:52:12 PM. > Project "Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln" on node 1 (default > targets). > Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln.metaproj : error MSB4126: The > specified solution configuration "Release > |x64" is invalid. Please specify a valid solution configuration using the > Configuration and Platform properties (e.g. M > SBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or > leave those properties blank to use the defaul > t solution configuration. [Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln] > Done Building Project "Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln" > (default targets) -- FAILED. > > > Build FAILED. > > "Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln" (default target) (1) -> > (ValidateSolutionConfiguration target) -> > Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln.metaproj : error MSB4126: > The specified solution configuration "Relea > se|x64" is invalid. Please specify a valid solution configuration using the > Configuration and Platform properties (e.g. > MSBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or > leave those properties blank to use the defa > ult solution configuration. [Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln] > >0 Warning(s) >1 Error(s) > > > Time Elapsed 00:00:00.37 > > Z:\3p\derived\win64\PostgreSQL\source\src\tools\msvc> > > > the same works just fine if it’s on c: drive > > all PostgreSQL distribution is in the Z:\3p\derived\win64\PostgreSQL\source > folder > > > > network UNC path is mapped to Z: > > thanks again for your help > > dm > > > >> On Jul 30, 2020, at 9:59 PM, Michael Paquier wrote: >> >> On Thu, Jul 30, 2020 at 06:55:28AM -0400, Dmitry Markman wrote: >>> icu => , >>> >>> is it correct? >> >> Exactly. >> >>> if it’s correct does build support UNC paths? >> >> Yes, these work. One take to be aware of is that the quoting of the >> paths can be annoying. FWIW, I just use single quotes with normal >> slashes as that's a no-brainer, say: >> openssl => 'C:/OpenSSL-hoge/', >> >> If you are able to break the scripts with a given path, this would >> likely be a bug we should address. For example, we had lately >> complains about the build scripts breaking once you inserted spaces in >> the python or OpenSSL paths (see beb2516 and ad7595b). >> -- >> Michael >
Re: Index Skip Scan (new UniqueKeys)
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee wrote: > One question about the unique keys - probably for Andy or David: I've looked > in the archives to find arguments for/against using Expr nodes or > EquivalenceClasses in the Unique Keys patch. However, I couldn't really find > a clear answer about why the current patch uses Expr rather than > EquivalenceClasses. At some point David mentioned "that probably Expr nodes > were needed rather than EquivalenceClasses", but it's not really clear to me > why. What were the thoughts behind this? I'm still not quite sure on this either way. I did think EquivalenceClasses were more suitable before I wrote the POC patch for unique keys. But after that, I had in mind that Exprs might be better. The reason I thought this was due to the fact that the DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were EquivalenceClasses then checking to see if the DISTINCT can be skipped turned into something more complex that required looking through lists of ec_members rather than just checking if the uniquekey exprs were a subset of the DISTINCT clause. Thinking about it a bit harder, if we did use Exprs then it would mean it a case like the following wouldn't work for Andy's DISTINCT no-op stuff. CREATE TABLE xy (x int primary key, y int not null); SELECT DISTINCT y FROM xy WHERE x=y; whereas if we use EquivalenceClasses then we'll find that we have an EC with x,y in it and can skip the DISTINCT since we have a UniqueKey containing that EquivalenceClass. Also, looking at what Andy wrote to make a case like the following work in his populate_baserel_uniquekeys() function in the 0002 patch: CREATE TABLE ab (a int, b int, primary key(a,b)); SELECT DISTINCT a FROM ab WHERE b = 1; it's a bit uninspiring. Really what we want here when checking if we can skip doing the DISTINCT is a UniqueKey set using EquivalenceClasses as we can just insist that any unmatched UniqueKey items have an ec_is_const == true. However, that means we have to loop through the ec_members of the EquivalenceClasses in the uniquekeys during the DISTINCT check. That's particularly bad when you consider that in a partitioned table case there might be an ec_member for each child partition and there could be 1000s of child partitions and following those ec_members chains is going to be too slow. My current thoughts are that we should be using EquivalenceClasses but we should first add some infrastructure to make them perform better. My current thoughts are that we do something like what I mentioned in [1] or something more like what Andres mentions in [2]. After that, we could either make EquivalenceClass.ec_members a hash table or binary search tree. Or even perhaps just have a single hash table/BST for all EquivalenceClasses that allows very fast lookups from {Expr} -> {EquivalenceClass}. I think an Expr can only belong in a single non-merged EquivalenceClass. So when we do merging of EquivalenceClasses we could just repoint that data structure to point to the new EquivalenceClass. We'd never point to ones that have ec_merged != NULL. This would also allow us to fix the poor performance in regards to get_eclass_for_sort_expr() for partitioned tables. So, it seems the patch dependency chain for skip scans just got a bit longer :-( David [1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de#c3add3919c534591eae2179a6c82742c
Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
On Thu, Jul 30, 2020 at 11:23:38PM +0200, Daniel Gustafsson wrote: > Looks good, thanks. Let's close the CF entry with this and open a new for the > pg_depend part when that's done. I have applied the patch, thanks. And actually, I have found just after that CREATE DATABASE gets severely impacted by the number of slots initialized when copying the template dependencies if there are few of them. The fix is as simple as delaying the initialization of the slots once we know they will be used. In my initial tests, I was using fsync = off, so I missed that. Sorry about that. The attached fixes this regression by delaying the slot initialization until we know that it will be used. This does not matter for pg_attribute as we know in advance the number of attributes to insert. -- Michael diff --git a/src/backend/catalog/pg_shdepend.c b/src/backend/catalog/pg_shdepend.c index ef2b87927c..3c63f52a24 100644 --- a/src/backend/catalog/pg_shdepend.c +++ b/src/backend/catalog/pg_shdepend.c @@ -809,15 +809,18 @@ copyTemplateDependencies(Oid templateDbId, Oid newDbId) int slotCount; CatalogIndexState indstate; TupleTableSlot **slot; - int nslots; + int nslots, max_slots; + bool slot_init = true; sdepRel = table_open(SharedDependRelationId, RowExclusiveLock); sdepDesc = RelationGetDescr(sdepRel); - nslots = MAX_PGSHDEPEND_INSERT_BYTES / sizeof(FormData_pg_shdepend); - slot = palloc(sizeof(TupleTableSlot *) * nslots); - for (int i = 0; i < nslots; i++) - slot[i] = MakeSingleTupleTableSlot(sdepDesc, ); + /* + * Allocate the slots to use, but delay initialization until we know + * that they will be used. + */ + max_slots = MAX_PGSHDEPEND_INSERT_BYTES / sizeof(FormData_pg_shdepend); + slot = palloc(sizeof(TupleTableSlot *) * max_slots); indstate = CatalogOpenIndexes(sdepRel); @@ -842,6 +845,9 @@ copyTemplateDependencies(Oid templateDbId, Oid newDbId) { Form_pg_shdepend shdep; + if (slot_init) + slot[slotCount] = MakeSingleTupleTableSlot(sdepDesc, ); + ExecClearTuple(slot[slotCount]); shdep = (Form_pg_shdepend) GETSTRUCT(tup); @@ -858,10 +864,11 @@ copyTemplateDependencies(Oid templateDbId, Oid newDbId) slotCount++; /* If slots are full, insert a batch of tuples */ - if (slotCount == nslots) + if (slotCount == max_slots) { CatalogTuplesMultiInsertWithInfo(sdepRel, slot, slotCount, indstate); slotCount = 0; + slot_init = false; } } @@ -874,6 +881,8 @@ copyTemplateDependencies(Oid templateDbId, Oid newDbId) CatalogCloseIndexes(indstate); table_close(sdepRel, RowExclusiveLock); + /* drop only the number of slots used */ + nslots = slot_init ? slotCount : max_slots; for (int i = 0; i < nslots; i++) ExecDropSingleTupleTableSlot(slot[i]); pfree(slot); signature.asc Description: PGP signature
Re: Cache relation sizes?
On Sat, Jun 20, 2020 at 10:32 AM Thomas Munro wrote: > Rebased. I'll add this to the open commitfest. I traced the recovery process while running pgbench -M prepared -c16 -j16 -t1 (= 160,000 transactions). With the patch, the number of lseeks went from 1,080,661 (6.75 per pgbench transaction) to just 85. I went ahead and pushed this patch. There's still the matter of crazy numbers of lseeks in regular backends; looking at all processes while running the above test, I get 1,469,060 (9.18 per pgbench transaction) without -M prepared, and 193,722 with -M prepared (1.21 per pgbench transaction). Fixing that with this approach will require bullet-proof shared invalidation, but I think it's doable, in later work.
Re: windows config.pl question
Hi Michael, thanks a lot I figured it out, UNC works indeed however I found at least 2 problems (at least in our 3p harness) 1. in our configuration openssl executable went to lib (I don’t know why), so I had to change line in Configure script and point to openssl.exe file. Sure I probably can change our makefile to create bin directory and put openssl.exe there but it’s not my file, maybe later 2. if PostgreSQL is situated on network drive that mapped to say disk z:, then build script failed: Z:\3p\derived\win64\PostgreSQL\source\src\tools\msvc>build Detected hardware platform: Win32 Generating win32ver.rc for src/backend Generating win32ver.rc for src/timezone Generating win32ver.rc for src/backend/snowball Generating win32ver.rc for src/pl/plpgsql/src Generating win32ver.rc for src/backend/replication/libpqwalreceiver Generating win32ver.rc for src/backend/replication/pgoutput Generating win32ver.rc for src/interfaces/ecpg/pgtypeslib . . . . . . . . . . . . Building the projects in this solution one at a time. To enable parallel build, please add the "/m" switch. Build started 7/30/2020 5:52:12 PM. Project "Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln" on node 1 (default targets). Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln.metaproj : error MSB4126: The specified solution configuration "Release |x64" is invalid. Please specify a valid solution configuration using the Configuration and Platform properties (e.g. M SBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or leave those properties blank to use the defaul t solution configuration. [Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln] Done Building Project "Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln" (default targets) -- FAILED. Build FAILED. "Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln" (default target) (1) -> (ValidateSolutionConfiguration target) -> Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln.metaproj : error MSB4126: The specified solution configuration "Relea se|x64" is invalid. Please specify a valid solution configuration using the Configuration and Platform properties (e.g. MSBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or leave those properties blank to use the defa ult solution configuration. [Z:\3p\derived\win64\PostgreSQL\source\pgsql.sln] 0 Warning(s) 1 Error(s) Time Elapsed 00:00:00.37 Z:\3p\derived\win64\PostgreSQL\source\src\tools\msvc> the same works just fine if it’s on c: drive all PostgreSQL distribution is in the Z:\3p\derived\win64\PostgreSQL\source folder network UNC path is mapped to Z: thanks again for your help dm > On Jul 30, 2020, at 9:59 PM, Michael Paquier wrote: > > On Thu, Jul 30, 2020 at 06:55:28AM -0400, Dmitry Markman wrote: >> icu => , >> >> is it correct? > > Exactly. > >> if it’s correct does build support UNC paths? > > Yes, these work. One take to be aware of is that the quoting of the > paths can be annoying. FWIW, I just use single quotes with normal > slashes as that's a no-brainer, say: > openssl => 'C:/OpenSSL-hoge/', > > If you are able to break the scripts with a given path, this would > likely be a bug we should address. For example, we had lately > complains about the build scripts breaking once you inserted spaces in > the python or OpenSSL paths (see beb2516 and ad7595b). > -- > Michael
Re: Use of "long" in incremental sort code
On Fri, 3 Jul 2020 at 07:47, James Coleman wrote: > Patch using int64 attached. I added this to the open items list for PG13. David
Re: windows config.pl question
On Thu, Jul 30, 2020 at 06:55:28AM -0400, Dmitry Markman wrote: > icu => , > > is it correct? Exactly. > if it’s correct does build support UNC paths? Yes, these work. One take to be aware of is that the quoting of the paths can be annoying. FWIW, I just use single quotes with normal slashes as that's a no-brainer, say: openssl => 'C:/OpenSSL-hoge/', If you are able to break the scripts with a given path, this would likely be a bug we should address. For example, we had lately complains about the build scripts breaking once you inserted spaces in the python or OpenSSL paths (see beb2516 and ad7595b). -- Michael signature.asc Description: PGP signature
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1. (now: incremental sort)
On Thu, Jul 30, 2020 at 6:40 PM Justin Pryzby wrote: > Feel free to close it out. I'm satisfied that we've had a discussion about > it. Closed it out. -- Peter Geoghegan
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1. (now: incremental sort)
On Thu, Jul 30, 2020 at 6:39 PM James Coleman wrote: > I very much do not like this approach, and I think it's actually > fundamentally wrong, at least for the memory check. Quicksort is not the only > option that uses memory. For now, there's only one option that spills to disk > (external merge sort), but there's no reason it has to remain that way. I wouldn't be surprised if it was possible to get SORT_TYPE_EXTERNAL_SORT even today (though I'm not sure if that's truly possible). That will happen for a regular sort node if we require randomAccess to the sort, and it happens to spill -- we can randomly access the final tape, but cannot do a final on-the-fly merge. Say for a merge join. -- Peter Geoghegan
Re: fixing pg_ctl with relative paths
At 2014-01-28 21:11:54, "Bruce Momjian" wrote: >On Mon, Jul 1, 2013 at 08:10:14PM -0400, Josh Kupershmidt wrote: >> On Thu, Jun 27, 2013 at 11:47 AM, Fujii Masao wrote: >> > On Thu, Jun 27, 2013 at 10:36 AM, Josh Kupershmidt >> > wrote: >> >> On Wed, Jun 26, 2013 at 12:22 PM, Fujii Masao >> >> wrote: >> >>> Though this is a corner case, the patch doesn't seem to handle properly >> >>> the case >> >>> where "-D" appears as other option value, e.g., -k option value, in >> >>> postmaster.opts >> >>> file. >> >> >> >> Could I see a command-line example of what you mean? >> > >> > postmaster -k "-D", for example. Of course, it's really a corner case :) >> >> Oh, I see. I was able to trip up strip_datadirs() with something like >> >> $ PGDATA="/my/data/" postmaster -k "-D" -S 100 & >> $ pg_ctl -D /my/data/ restart >> >> that example causes pg_ctl to fail to start the server after stopping >> it, although perhaps you could even trick the server into starting >> with the wrong options. Of course, similar problems exists today in >> other cases, such as with the relative paths issue this patch is >> trying to address, or a datadir containing embedded quotes. >> >> I am eager to see the relative paths issue fixed, but maybe we need to >> bite the bullet and sort out the escaping of command-line options in >> the rest of pg_ctl first, so that a DataDir like "/tmp/here's a \" >> quote" can consistently be used by pg_ctl {start|stop|restart} before >> we can fix this wart. > >Where are we on this patch? > >-- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + Everyone has their own god. + > > >-- >Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-hackers > Hi, I encountered the same problem. I want to know is there a final conclusion? thank you very much!
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1. (now: incremental sort)
On Thu, Jul 30, 2020 at 06:33:32PM -0700, Peter Geoghegan wrote: > On Thu, Jul 30, 2020 at 5:22 PM Justin Pryzby wrote: > > Because filtering out zero values is exactly what's intended to be avoided > > for > > nontext output. > > > > I think checking whether the method was used should result in the same > > output, > > without the literal check for zero value (which itself sets a bad example). > > It seems fine to me as-is. What about SORT_TYPE_TOP_N_HEAPSORT? Or any > other sort methods we add in the future? Feel free to close it out. I'm satisfied that we've had a discussion about it. -- Justin
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1. (now: incremental sort)
On Thu, Jul 30, 2020 at 8:22 PM Justin Pryzby wrote: > On Wed, Jul 29, 2020 at 09:18:44PM -0700, Peter Geoghegan wrote: > > On Wed, Jul 29, 2020 at 9:05 PM Justin Pryzby > wrote: > > > So my 2ndary suggestion is to conditionalize based on the method > rather than > > > value of space used. > > > > What's the advantage of doing it that way? > > Because filtering out zero values is exactly what's intended to be avoided > for > nontext output. > > I think checking whether the method was used should result in the same > output, > without the literal check for zero value (which itself sets a bad example). > > --- a/src/backend/commands/explain.c > +++ b/src/backend/commands/explain.c > @@ -2824,13 +2824,13 @@ > show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo, > appendStringInfo(, "%s Groups", groupLabel); > ExplainOpenGroup("Incremental Sort Groups", > groupName.data, true, es); > ExplainPropertyInteger("Group Count", NULL, > groupInfo->groupCount, es); > > ExplainPropertyList("Sort Methods Used", methodNames, es); > > - if (groupInfo->maxMemorySpaceUsed > 0) > + if (groupInfo->sortMethods & SORT_TYPE_QUICKSORT) > { > longavgSpace = > groupInfo->totalMemorySpaceUsed / groupInfo->groupCount; > const char *spaceTypeName; > StringInfoData memoryName; > > spaceTypeName = > tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY); > @@ -2841,13 +2841,13 @@ > show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo, > ExplainPropertyInteger("Average Sort Space Used", > "kB", avgSpace, es); > ExplainPropertyInteger("Peak Sort Space Used", > "kB", > > groupInfo->maxMemorySpaceUsed, es); > > ExplainCloseGroup("Sort Spaces", memoryName.data, > true, es); > } > - if (groupInfo->maxDiskSpaceUsed > 0) > + if (groupInfo->sortMethods & SORT_TYPE_EXTERNAL_SORT) > { > longavgSpace = > groupInfo->totalDiskSpaceUsed / groupInfo->groupCount; > const char *spaceTypeName; > StringInfoData diskName; > > spaceTypeName = > tuplesort_space_type_name(SORT_SPACE_TYPE_DISK); > I very much do not like this approach, and I think it's actually fundamentally wrong, at least for the memory check. Quicksort is not the only option that uses memory. For now, there's only one option that spills to disk (external merge sort), but there's no reason it has to remain that way. And in the future we might accurately report memory consumed even when we've eventually spilled to disk also, so memory used would be relevant potentially even if no in-memory sort was ever performed. So I'm pretty confident checking the space used is the correct way to do this. James
Re: new heapcheck contrib module
> On Jul 30, 2020, at 5:53 PM, Robert Haas wrote: > > On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger > wrote: >> No, that wasn't my concern. I was thinking about CLOG entries disappearing >> during the scan as a consequence of concurrent vacuums, and the effect that >> would have on the validity of the cached [relfrozenxid..next_valid_xid] >> range. In the absence of corruption, I don't immediately see how this would >> cause any problems. But for a corrupt table, I'm less certain how it would >> play out. > > Oh, hmm. I wasn't thinking about that problem. I think the only way > this can happen is if we read a page and then, before we try to look > up the CID, vacuum zooms past, finishes the whole table, and truncates > clog. But if that's possible, then it seems like it would be an issue > for SELECT as well, and it apparently isn't, or we would've done > something about it by now. I think the reason it's not possible is > because of the locking rules described in > src/backend/storage/buffer/README, which require that you hold a > buffer lock until you've determined that the tuple is visible. Since > you hold a share lock on the buffer, a VACUUM that hasn't already > processed that freeze the tuples in that buffer; it would need an > exclusive lock on the buffer to do that. Therefore it can't finish and > truncate clog either. > > Now, you raise the question of whether this is still true if the table > is corrupt, but I don't really see why that makes any difference. > VACUUM is supposed to freeze each page it encounters, to the extent > that such freezing is necessary, and with Andres's changes, it's > supposed to ERROR out if things are messed up. We can postulate a bug > in that logic, but inserting a VACUUM-blocking lock into this tool to > guard against a hypothetical vacuum bug seems strange to me. Why would > the right solution not be to fix such a bug if and when we find that > there is one? Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about, I'll withdraw the argument. But that leaves me wondering about a comment that Andres made upthread: > On Apr 20, 2020, at 12:42 PM, Andres Freund wrote: > I don't think random interspersed uses of CLogTruncationLock are a good > idea. If you move to only checking visibility after tuple fits into > [relfrozenxid, nextXid), then you don't need to take any locks here, as > long as a lock against vacuum is taken (which I think this should do > anyway). — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1. (now: incremental sort)
On Thu, Jul 30, 2020 at 5:22 PM Justin Pryzby wrote: > Because filtering out zero values is exactly what's intended to be avoided for > nontext output. > > I think checking whether the method was used should result in the same output, > without the literal check for zero value (which itself sets a bad example). It seems fine to me as-is. What about SORT_TYPE_TOP_N_HEAPSORT? Or any other sort methods we add in the future? The way that we flatten maxDiskSpaceUsed and maxMemorySpaceUsed into "space used" on output might be kind of questionable, but it's something that we have to live with for the foreseeable future. I don't think that this is a bad example -- we don't output maxDiskSpaceUsed or maxMemorySpaceUsed at the conceptual level. -- Peter Geoghegan
Re: Missing CFI in hlCover()?
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> Here's a proposed patch along that line. > I've back-patched this to 11 (which was just a bit of fuzz) and tested > it out with a couple of different queries that were causing issues > previously on the archive server, and they finish in a much more > reasonable time and react faster to cancel requests/signals. Yeah, I'd tried this locally using the data from the one test case you showed me, and it seemed to fix that. > So, looks good to me, and would certainly be nice to get this into the > next set of releases, so the archive server doesn't get stuck anymore. I'll push this tomorrow if nobody has objected to it. BTW, I had noticed last night that hlFirstIndex is being unreasonably stupid. Many of the "words" have null item pointers and hence can't possibly match any query item (I think because we have "words" for inter-word spaces/punctuation as well as the actual words). Checking that, as in the attached v2 patch, makes things a bit faster yet. regards, tom lane diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 76b6f9aef0..92517e4c17 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2011,13 +2011,18 @@ hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos) for (i = pos; i < prs->curwords; i++) { /* ... scan the query to see if this word matches any operand */ - QueryItem *item = GETQUERY(query); + QueryOperand *worditem = prs->words[i].item; + QueryItem *item; int j; + if (worditem == NULL) + continue; /* certainly not a match */ + + item = GETQUERY(query); for (j = 0; j < query->size; j++) { if (item->type == QI_VAL && -prs->words[i].item == >qoperand) +worditem == >qoperand) return i; item++; } @@ -2028,8 +2033,14 @@ hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos) /* * hlCover: try to find a substring of prs' word list that satisfies query * - * At entry, *p must be the first word index to consider (initialize this to - * zero, or to the next index after a previous successful search). + * At entry, *p must be the first word index to consider (initialize this + * to zero, or to the next index after a previous successful search). + * We will consider all substrings starting at or after that word, and + * containing no more than max_cover words. (We need a length limit to + * keep this from taking O(N^2) time for a long document with many query + * words but few complete matches. Actually, since checkcondition_HL is + * roughly O(N) in the length of the substring being checked, it's even + * worse than that.) * * On success, sets *p to first word index and *q to last word index of the * cover substring, and returns true. @@ -2038,7 +2049,8 @@ hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos) * words used in the query. */ static bool -hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q) +hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover, + int *p, int *q) { int pmin, pmax, @@ -2084,7 +2096,7 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q) nextpmin = nextpmax; pmax = nextpmax; } - while (pmax >= 0); + while (pmax >= 0 && pmax - pmin < max_cover); /* No luck here, so try next feasible startpoint */ pmin = nextpmin; } @@ -2186,7 +2198,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos, static void mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall, int shortword, int min_words, - int max_words, int max_fragments) + int max_words, int max_fragments, int max_cover) { int32 poslen, curlen, @@ -2213,7 +2225,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall, covers = palloc(maxcovers * sizeof(CoverPos)); /* get all covers */ - while (hlCover(prs, query, , )) + while (hlCover(prs, query, max_cover, , )) { startpos = p; endpos = q; @@ -2368,7 +2380,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall, */ static void mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall, - int shortword, int min_words, int max_words) + int shortword, int min_words, int max_words, int max_cover) { int p = 0, q = 0; @@ -2386,7 +2398,7 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall, if (!highlightall) { /* examine all covers, select a headline using the best one */ - while (hlCover(prs, query, , )) + while (hlCover(prs, query, max_cover, , )) { /* * Count words (curlen) and interesting words (poslen) within @@ -2542,6 +2554,7 @@ prsd_headline(PG_FUNCTION_ARGS) int shortword = 3; int max_fragments = 0; bool highlightall = false; + int max_cover; ListCell *l; /* Extract configuration option values */ @@ -2581,6 +2594,15 @@
Re: new heapcheck contrib module
On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger wrote: > No, that wasn't my concern. I was thinking about CLOG entries disappearing > during the scan as a consequence of concurrent vacuums, and the effect that > would have on the validity of the cached [relfrozenxid..next_valid_xid] > range. In the absence of corruption, I don't immediately see how this would > cause any problems. But for a corrupt table, I'm less certain how it would > play out. Oh, hmm. I wasn't thinking about that problem. I think the only way this can happen is if we read a page and then, before we try to look up the CID, vacuum zooms past, finishes the whole table, and truncates clog. But if that's possible, then it seems like it would be an issue for SELECT as well, and it apparently isn't, or we would've done something about it by now. I think the reason it's not possible is because of the locking rules described in src/backend/storage/buffer/README, which require that you hold a buffer lock until you've determined that the tuple is visible. Since you hold a share lock on the buffer, a VACUUM that hasn't already processed that freeze the tuples in that buffer; it would need an exclusive lock on the buffer to do that. Therefore it can't finish and truncate clog either. Now, you raise the question of whether this is still true if the table is corrupt, but I don't really see why that makes any difference. VACUUM is supposed to freeze each page it encounters, to the extent that such freezing is necessary, and with Andres's changes, it's supposed to ERROR out if things are messed up. We can postulate a bug in that logic, but inserting a VACUUM-blocking lock into this tool to guard against a hypothetical vacuum bug seems strange to me. Why would the right solution not be to fix such a bug if and when we find that there is one? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: proposal: type info support functions for functions that use "any" type
Daniel Gustafsson writes: > This patch has been bumped in CFs for the past year, with the thread stalled > and the last review comment being in support of rejection. Tom, do you still > feel it should be rejected in light of Pavel's latest posts? I have seen no convincing response to the concerns I raised in my last message in the thread [1], to wit that 1. I think the "flexibility" of letting a support function resolve the output type in some unspecified way is mostly illusory, because if it doesn't do it in a way that's morally equivalent to polymorphism, it's doing it wrong. Also, I'm not that excited about improving polymorphism in a way that is only accessible with specialized C code. The example of Oracle-like DECODE() could be handled about as well if we had a second set of anycompatible-style polymorphic types, that is something like decode(expr anycompatible, search1 anycompatible, result1 anycompatible2, search2 anycompatible, result2 anycompatible2, search3 anycompatible, result3 anycompatible2, ... ) returns anycompatible2; Admittedly, you'd need to write a separate declaration for each number of arguments you wanted to support, but they could all point at the same C function --- which'd be a lot simpler than in this patch, since it would not need to deal with any type coercions, only comparisons. I also argue that to the extent that the support function is reinventing polymorphism internally, it's going to be inferior to the parser's version. As an example, with Pavel's sample implementation, if a particular query needs a coercion from type X to type Y, that's nowhere visible in the parse tree. So you could drop the cast without being told that view so-and-so depends on it, leading to a run-time failure next time you try to use that view. Doing the same thing with normal polymorphism, the X-to-Y cast function would be used in the parse tree and so we'd know about the dependency. 2. I have no faith that the proposed implementation is correct or complete. As I complained earlier, a lot of places have special-case handling for polymorphism, and it seems like every one of them would need to know about this feature too. That is, to the extent that this patch's footprint is smaller than commit 24e2885ee -- which it is, by a lot -- I think those are bugs of omission. It will not work to have a situation where some parts of the backend resolve a function's result type as one thing and others resolve it as something else thanks to failure to account for this new feature. As a concrete example, it looks like we'd fail pretty hard if someone tried to use this facility in an aggregate support function. So my opinion is still what it was in January. regards, tom lane [1] https://www.postgresql.org/message-id/31501.1579036195%40sss.pgh.pa.us
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1. (now: incremental sort)
On Wed, Jul 29, 2020 at 09:18:44PM -0700, Peter Geoghegan wrote: > On Wed, Jul 29, 2020 at 9:05 PM Justin Pryzby wrote: > > So my 2ndary suggestion is to conditionalize based on the method rather than > > value of space used. > > What's the advantage of doing it that way? Because filtering out zero values is exactly what's intended to be avoided for nontext output. I think checking whether the method was used should result in the same output, without the literal check for zero value (which itself sets a bad example). --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -2824,13 +2824,13 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo, appendStringInfo(, "%s Groups", groupLabel); ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es); ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es); ExplainPropertyList("Sort Methods Used", methodNames, es); - if (groupInfo->maxMemorySpaceUsed > 0) + if (groupInfo->sortMethods & SORT_TYPE_QUICKSORT) { longavgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount; const char *spaceTypeName; StringInfoData memoryName; spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY); @@ -2841,13 +2841,13 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo, ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es); ExplainPropertyInteger("Peak Sort Space Used", "kB", groupInfo->maxMemorySpaceUsed, es); ExplainCloseGroup("Sort Spaces", memoryName.data, true, es); } - if (groupInfo->maxDiskSpaceUsed > 0) + if (groupInfo->sortMethods & SORT_TYPE_EXTERNAL_SORT) { longavgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount; const char *spaceTypeName; StringInfoData diskName; spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
Re: Missing CFI in hlCover()?
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > >> Stephen Frost writes: > >>> * Tom Lane (t...@sss.pgh.pa.us) wrote: > We could hard-code a rule like that, or we could introduce a new > explicit parameter for the maximum cover length. The latter would be > more flexible, but we need something back-patchable and I'm concerned > about the compatibility hazards of adding a new parameter in minor > releases. So on the whole I propose hard-wiring a multiplier of, > say, 10 for both these cases. > > >>> That sounds alright to me, though I do think we should probably still > >>> toss a CFI (or two) in this path somewhere as we don't know how long > >>> some of these functions might take... > > >> Yeah, of course. I'm still leaning to doing that in TS_execute_recurse. > > > Works for me. > > Here's a proposed patch along that line. I've back-patched this to 11 (which was just a bit of fuzz) and tested it out with a couple of different queries that were causing issues previously on the archive server, and they finish in a much more reasonable time and react faster to cancel requests/signals. If you'd like to play with it more, the PG11 installed on ark2 now has this patch built into it. So, looks good to me, and would certainly be nice to get this into the next set of releases, so the archive server doesn't get stuck anymore. :) Thanks! Stephen signature.asc Description: PGP signature
logtape.c stats don't account for unused "prefetched" block numbers
Commit 896ddf9b added prefetching to logtape.c to avoid excessive fragmentation in the context of hash aggs that spill and have many batches/tapes. Apparently the preallocation doesn't actually perform any filesystem operations, so the new mechanism should be zero overhead when "preallocated" blocks aren't actually used after all (right?). However, I notice that this breaks the statistics shown by things like trace_sort, and even EXPLAIN ANALYZE. LogicalTapeSetBlocks() didn't get the memo about preallocation. The easiest way to spot the issue is to compare trace_sort output on v13 with output for the same case in v12 -- the "%u disk blocks used" statistics are consistently higher on v13, especially for cases with many tapes. I spotted the bug when I noticed that v13 external sorts reportedly use more or less disk space when fewer or more tapes are involved (again, this came from trace_sort). That doesn't make sense -- the total amount of space used for external sort temp files should practically be fixed, aside from insignificant rounding effects. Reducing the amount of memory by orders of magnitude in a Postgres 12 tuplesort will hardly affect the "%u disk blocks used" trace_sort output at all. That's what we need to get back to. This bug probably won't be difficult to fix. Actually, we have had similar problems in the past. The fix could be as simple as teaching LogicalTapeSetBlocks() about this new variety of "sparse allocation". Although maybe the preallocation stuff should somehow be rolled into the much older nHoleBlocks stuff. Unsure. -- Peter Geoghegan
Re: pg_dump --where option
> On 10 Jul 2020, at 02:03, Cary Huang wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: tested, failed > Spec compliant: tested, failed > Documentation:tested, failed > > Hi > > I had a look at the patch and it cleanly applies to postgres master branch. I > tried to do a quick test on the new "where clause" functionality and for the > most part it does the job as described and I'm sure some people will find > this feature useful to their database dump needs. However I tried the feature > with a case where I have a subquery in the where clause, but it seems to be > failing to dump the data. I ran the pg_dump like: > > $ pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > > testdump2 > $ pg_dump: error: processing of table "public.test1" failed > > both test1 and test2 exist in the database and the same subquery works under > psql. > > I also notice that the regression tests for pg_dump is failing due to the > patch, I think it is worth looking into the failure messages and also add > some test cases on the new "where" clause to ensure that it can cover as many > use cases as possible. As this is being reviewed, but time is running out in this CF, I'm moving this to the next CF. The entry will be moved to Waiting for Author based on the above review. cheers ./daniel
Re: ModifyTable overheads in generic plans
> On 1 Jul 2020, at 15:38, Amit Langote wrote: > Another thing I could do is decouple the patches to discuss here from > the patches of the other thread, which should be possible and might be > good to avoid back and forth between the two threads. It sounds like it would make it easier for reviewers, so if it's possible with a reasonable effort it might be worth it. I've moved this entry to the next CF for now. cheers ./daniel
Re: Online verification of checksums
> On 5 Jul 2020, at 13:52, Daniel Gustafsson wrote: > >> On 6 Apr 2020, at 23:15, Michael Banck wrote: > >> Probably we need to take a step back; > > This patch has been Waiting on Author since the last commitfest (and no longer > applies as well), and by the sounds of the thread there are some open issues > with it. Should it be Returned with Feedback to be re-opened with a fresh > take > on it? Marked as Returned with Feedback, please open a new entry in case there is a renewed interest with a new patch. cheers ./daniel
Re: Online checksums patch - once again
> On 29 Jul 2020, at 19:58, Robert Haas wrote: > Here are a bunch of comments based on a partial read-through of this > patch. Thanks a lot Robert and Justin for the reviews! With the commitfest wrap-up imminent and being on vacation I will have a hard time responding properly before the end of CF so I'm moving it to the next CF for now. cheers ./daniel
Re: proposal: type info support functions for functions that use "any" type
> On 26 Jan 2020, at 16:33, Pavel Stehule wrote: > I reread all related mails and I think so it should be safe - or there is > same risk like using any C extensions for functions or hooks. This patch has been bumped in CFs for the past year, with the thread stalled and the last review comment being in support of rejection. Tom, do you still feel it should be rejected in light of Pavel's latest posts? cheers ./daniel
Re: new heapcheck contrib module
> On Jul 30, 2020, at 1:47 PM, Andres Freund wrote: > > Hi, > > On 2020-07-30 13:18:01 -0700, Mark Dilger wrote: >> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax >> committed via TransactionIdDidCommit. I am worried about concurrent >> truncation of clog entries causing I/O errors on SLRU lookup when performing >> that check. The three strategies I had for dealing with that were taking >> the XactTruncationLock (formerly known as CLogTruncationLock, for those >> reading this thread from the beginning), locking out vacuum, and the idea >> upthread from Andres about setting PROC_IN_VACUUM and such. Maybe I'm being >> dense and don't need to worry about this. But I haven't convinced myself of >> that, yet. > > I think it's not at all ok to look in the procarray or clog for xids > that are older than what you're announcing you may read. IOW I don't > think it's OK to just ignore the problem, or try to work around it by > holding XactTruncationLock. The current state of the patch is that concurrent vacuums are kept out of the table being checked by means of taking a ShareUpdateExclusive lock on the table being checked. In response to Robert's review, I was contemplating whether that was necessary, but you raise the interesting question of whether it is even sufficient. The logic in verify_heapam is currently relying on the ShareUpdateExclusive lock to prevent any of the xids in the range relfrozenxid..nextFullXid from being invalid arguments to TransactionIdDidCommit. Ignoring whether that is a good choice vis-a-vis performance, is that even a valid strategy? It sounds like you are saying it is not. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: new heapcheck contrib module
> On Jul 30, 2020, at 2:00 PM, Robert Haas wrote: > > On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger > wrote: >>> Maybe I'm just being dense here -- exactly what problem are you worried >>> about? >> >> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax >> committed via TransactionIdDidCommit. I am worried about concurrent >> truncation of clog entries causing I/O errors on SLRU lookup when performing >> that check. The three strategies I had for dealing with that were taking >> the XactTruncationLock (formerly known as CLogTruncationLock, for those >> reading this thread from the beginning), locking out vacuum, and the idea >> upthread from Andres about setting PROC_IN_VACUUM and such. Maybe I'm being >> dense and don't need to worry about this. But I haven't convinced myself of >> that, yet. > > I don't get it. If you've already checked that the XIDs are >= > relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be > a problem. It could be, if CLOG is hosed, which is possible, because > if the table is corrupted, why shouldn't CLOG also be corrupted? But > I'm not sure that's what your concern is here. No, that wasn't my concern. I was thinking about CLOG entries disappearing during the scan as a consequence of concurrent vacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range. In the absence of corruption, I don't immediately see how this would cause any problems. But for a corrupt table, I'm less certain how it would play out. The kind of scenario I'm worried about may not be possible in practice. I think it would depend on how vacuum behaves when scanning a corrupt table that is corrupt in some way that vacuum doesn't notice, and whether vacuum could finish scanning the table with the false belief that it has frozen all tuples with xids less than some cutoff. I thought it would be safer if that kind of thing were not happening during verify_heapam's scan of the table. Even if a careful analysis proved it was not an issue with the current coding of vacuum, I don't think there is any coding convention requiring future versions of vacuum to be hardened against corruption, so I don't see how I can rely on vacuum not causing such problems. I don't think this is necessarily a too-rare-to-care-about type concern, either. If corruption across multiple tables prevents autovacuum from succeeding, and the DBA doesn't get involved in scanning tables for corruption until the lack of successful vacuums impacts the production system, I imagine you could end up with vacuums repeatedly happening (or trying to happen) around the time the DBA is trying to fix tables, or perhaps drop them, or whatever, using verify_heapam for guidance on which tables are corrupted. Anyway, that's what I was thinking. I was imagining that calling TransactionIdDidCommit might keep crashing the backend while the DBA is trying to find and fix corruption, and that could get really annoying. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Batch insert in CTAS/MatView code
In an off-list discussion with Paul, we decided to withdraw this patch for now and instead create a new entry when there is a re-worked patch. This has now been done in the CF app. cheers ./daniel
Re: Identifying user-created objects
> On 1 Jul 2020, at 14:15, Daniel Gustafsson wrote: > >> On 4 Mar 2020, at 12:06, Masahiko Sawada >> wrote: > >> Attached updated patch that incorporated comments from Amit and Vignesh. > > This patch fails to compile due to an Oid collision in pg_proc.dat. Please > submit a new version with an Oid from the recommended range for new patches: > 8000-. See the below documentation page for more information on this. > > https://www.postgresql.org/docs/devel/system-catalog-initial-data.html > > I'm marking the entry Waiting on Author in the meantime. As no new patch has been presented, and the thread contains doubts over the proposed functionality, I'm marking this returned with feedback. cheers ./daniel
Re: psql - improve test coverage from 41% to 88%
> On 5 Jul 2020, at 13:38, Daniel Gustafsson wrote: > >> On 24 Mar 2020, at 15:47, David Steele wrote: > >> This patch no longer applies: http://cfbot.cputube.org/patch_27_2262.log >> >> CF entry has been updated to Waiting on Author. > > This patch hasn't been updated and still doesn't apply, do you intend to > rebase > it during this commitfest or should we move it to returned with feedback? It > can always be re-opened at a later date. As the thread has stalled, I've marked this Returned with Feedback. cheers ./daniel
Re: [Patch] Optimize dropping of relation buffers using dlist
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested I have tested this patch at various workloads and hardware (including Power2 server with 384 virtual cores) and didn't find performance regression. The new status of this patch is: Ready for Committer
Re: Reducing WaitEventSet syscall churn
On Thu, Jul 30, 2020 at 5:50 PM Thomas Munro wrote: > I pushed those three patches, but will wait for more discussion on the rest. And here's a rebase, to make cfbot happy. From a760053ac6fea1b9a829e9a170902a555863ce66 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 24 Feb 2020 23:51:09 +1300 Subject: [PATCH v6 1/4] Use WL_EXIT_ON_PM_DEATH in FeBeWaitSet. Previously, we'd give either a FATAL message or a silent exit() when we detected postmaster death, depending on which wait point we were at. Make the exit more uniform, by using the standard exit facility. This will allow a later patch to use FeBeWaitSet in a new location without having to add more explicit handling for postmaster death, in line with our policy of reducing such clutter. Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com --- src/backend/libpq/be-secure.c | 28 src/backend/libpq/pqcomm.c| 2 +- 2 files changed, 1 insertion(+), 29 deletions(-) diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c index 2ae507a902..aec0926d93 100644 --- a/src/backend/libpq/be-secure.c +++ b/src/backend/libpq/be-secure.c @@ -184,28 +184,6 @@ retry: WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , , 1, WAIT_EVENT_CLIENT_READ); - /* - * If the postmaster has died, it's not safe to continue running, - * because it is the postmaster's job to kill us if some other backend - * exits uncleanly. Moreover, we won't run very well in this state; - * helper processes like walwriter and the bgwriter will exit, so - * performance may be poor. Finally, if we don't exit, pg_ctl will be - * unable to restart the postmaster without manual intervention, so no - * new connections can be accepted. Exiting clears the deck for a - * postmaster restart. - * - * (Note that we only make this check when we would otherwise sleep on - * our latch. We might still continue running for a while if the - * postmaster is killed in mid-query, or even through multiple queries - * if we never have to wait for read. We don't want to burn too many - * cycles checking for this very rare condition, and this should cause - * us to exit quickly in most cases.) - */ - if (event.events & WL_POSTMASTER_DEATH) - ereport(FATAL, - (errcode(ERRCODE_ADMIN_SHUTDOWN), - errmsg("terminating connection due to unexpected postmaster exit"))); - /* Handle interrupt. */ if (event.events & WL_LATCH_SET) { @@ -296,12 +274,6 @@ retry: WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , , 1, WAIT_EVENT_CLIENT_WRITE); - /* See comments in secure_read. */ - if (event.events & WL_POSTMASTER_DEATH) - ereport(FATAL, - (errcode(ERRCODE_ADMIN_SHUTDOWN), - errmsg("terminating connection due to unexpected postmaster exit"))); - /* Handle interrupt. */ if (event.events & WL_LATCH_SET) { diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index ac986c0505..6e91581c0b 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -222,7 +222,7 @@ pq_init(void) AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); - AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); + AddWaitEventToSet(FeBeWaitSet, WL_EXIT_ON_PM_DEATH, -1, NULL, NULL); } /* -- 2.20.1 From 91c7a5a8a04363274ea0e74a69d89e261d7a0e4f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 24 Feb 2020 23:48:29 +1300 Subject: [PATCH v6 2/4] Introduce symbolic names for FeBeWaitSet positions. Previously we used hard coded 0 and 1 to refer to the socket and latch. Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com --- src/backend/libpq/be-secure.c | 4 ++-- src/backend/libpq/pqcomm.c| 18 +++--- src/backend/utils/init/miscinit.c | 4 ++-- src/include/libpq/libpq.h | 3 +++ 4 files changed, 22 insertions(+), 7 deletions(-) diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c index aec0926d93..2d5cfbd6f8 100644 --- a/src/backend/libpq/be-secure.c +++ b/src/backend/libpq/be-secure.c @@ -179,7 +179,7 @@ retry: Assert(waitfor); - ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL); + ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, waitfor, NULL); WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , , 1, WAIT_EVENT_CLIENT_READ); @@ -269,7 +269,7 @@ retry: Assert(waitfor); - ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL); + ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, waitfor, NULL); WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , , 1, WAIT_EVENT_CLIENT_WRITE); diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index
Re: Missing CFI in hlCover()?
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> Stephen Frost writes: >>> * Tom Lane (t...@sss.pgh.pa.us) wrote: We could hard-code a rule like that, or we could introduce a new explicit parameter for the maximum cover length. The latter would be more flexible, but we need something back-patchable and I'm concerned about the compatibility hazards of adding a new parameter in minor releases. So on the whole I propose hard-wiring a multiplier of, say, 10 for both these cases. >>> That sounds alright to me, though I do think we should probably still >>> toss a CFI (or two) in this path somewhere as we don't know how long >>> some of these functions might take... >> Yeah, of course. I'm still leaning to doing that in TS_execute_recurse. > Works for me. Here's a proposed patch along that line. regards, tom lane diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 76b6f9aef0..1df1c0362d 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -2028,8 +2028,10 @@ hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos) /* * hlCover: try to find a substring of prs' word list that satisfies query * - * At entry, *p must be the first word index to consider (initialize this to - * zero, or to the next index after a previous successful search). + * At entry, *p must be the first word index to consider (initialize this + * to zero, or to the next index after a previous successful search). + * We will consider all substrings starting at or after that word, and + * containing no more than max_cover words. * * On success, sets *p to first word index and *q to last word index of the * cover substring, and returns true. @@ -2038,7 +2040,8 @@ hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos) * words used in the query. */ static bool -hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q) +hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover, + int *p, int *q) { int pmin, pmax, @@ -2084,7 +2087,7 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q) nextpmin = nextpmax; pmax = nextpmax; } - while (pmax >= 0); + while (pmax >= 0 && pmax - pmin < max_cover); /* No luck here, so try next feasible startpoint */ pmin = nextpmin; } @@ -2186,7 +2189,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos, static void mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall, int shortword, int min_words, - int max_words, int max_fragments) + int max_words, int max_fragments, int max_cover) { int32 poslen, curlen, @@ -2213,7 +2216,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall, covers = palloc(maxcovers * sizeof(CoverPos)); /* get all covers */ - while (hlCover(prs, query, , )) + while (hlCover(prs, query, max_cover, , )) { startpos = p; endpos = q; @@ -2368,7 +2371,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall, */ static void mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall, - int shortword, int min_words, int max_words) + int shortword, int min_words, int max_words, int max_cover) { int p = 0, q = 0; @@ -2386,7 +2389,7 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall, if (!highlightall) { /* examine all covers, select a headline using the best one */ - while (hlCover(prs, query, , )) + while (hlCover(prs, query, max_cover, , )) { /* * Count words (curlen) and interesting words (poslen) within @@ -2542,6 +2545,7 @@ prsd_headline(PG_FUNCTION_ARGS) int shortword = 3; int max_fragments = 0; bool highlightall = false; + int max_cover; ListCell *l; /* Extract configuration option values */ @@ -2581,6 +2585,15 @@ prsd_headline(PG_FUNCTION_ARGS) defel->defname))); } + /* + * We might eventually make max_cover a user-settable parameter, but for + * now, just compute a reasonable value based on max_words and + * max_fragments. + */ + max_cover = Max(max_words * 10, 100); + if (max_fragments > 0) + max_cover *= max_fragments; + /* in HighlightAll mode these parameters are ignored */ if (!highlightall) { @@ -2605,10 +2618,10 @@ prsd_headline(PG_FUNCTION_ARGS) /* Apply appropriate headline selector */ if (max_fragments == 0) mark_hl_words(prs, query, highlightall, shortword, - min_words, max_words); + min_words, max_words, max_cover); else mark_hl_fragments(prs, query, highlightall, shortword, - min_words, max_words, max_fragments); + min_words, max_words, max_fragments, max_cover); /* Fill in default values for string options */ if (!prs->startsel) diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c index f01b1ee253..756a48a167 100644 ---
Re: OpenSSL randomness seeding
> On 26 Jul 2020, at 09:06, Michael Paquier wrote: > > On Wed, Jul 22, 2020 at 11:31:38PM +0200, Daniel Gustafsson wrote: >> Thanks for picking it up! > > For the archives, the patch set has been applied as ce4939f and > 15e4419 on HEAD. Thanks, Noah. Indeed, thanks! >>> Do you happen to know how OpenSSL 1.1.1 changed its reaction to forks in >>> order >>> to make the RAND_poll() superfluous? (No need to research it if you don't.) >> >> I'm not entirely sure, but I do believe that 1.1.1 ported over the RNG from >> the >> FIPS module which re-seeds itself with fork() protection. There was however >> a >> bug, fixed in 1.1.1d or thereabouts as CVE-2019-1549, where the fork >> protection >> wasn't activated by default.. so there is that. Since that bug was found, >> there has been tests introduced to catch any regression on that which is >> comforting. > > No idea about this one actually. I did some more reading and AFAICT it won't be required in 1.1.1+, but it also won't cause any harm so unless evidence of the latter emerge we may just as well leave it as an extra safeguard. Somewhat on topic though, 1.1.1 adds a RAND_priv_bytes function for random numbers that are supposed to be private and extra protected via it's own DRBG. Maybe we should use that for SCRAM salts etc in case we detect 1.1.1? cheers ./daniel
Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
> On 30 Jul 2020, at 03:28, Michael Paquier wrote: > On Wed, Jul 29, 2020 at 11:34:07PM +0200, Daniel Gustafsson wrote: >> Fair enough, let's break out pg_depend and I'll have another go at that. > > Thanks. Attached is a polished version of the patch that I intend to > commit for pg_attribute and pg_shdepend. Let's look again at > pg_depend later, as there are also links with the handling of > dependencies for ALTER TABLE mainly. Looks good, thanks. Let's close the CF entry with this and open a new for the pg_depend part when that's done. cheers ./daniel
Re: new heapcheck contrib module
On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger wrote: > > Maybe I'm just being dense here -- exactly what problem are you worried > > about? > > Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax > committed via TransactionIdDidCommit. I am worried about concurrent > truncation of clog entries causing I/O errors on SLRU lookup when performing > that check. The three strategies I had for dealing with that were taking the > XactTruncationLock (formerly known as CLogTruncationLock, for those reading > this thread from the beginning), locking out vacuum, and the idea upthread > from Andres about setting PROC_IN_VACUUM and such. Maybe I'm being dense and > don't need to worry about this. But I haven't convinced myself of that, yet. I don't get it. If you've already checked that the XIDs are >= relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be a problem. It could be, if CLOG is hosed, which is possible, because if the table is corrupted, why shouldn't CLOG also be corrupted? But I'm not sure that's what your concern is here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Global temporary tables
On Thu, Jul 30, 2020 at 8:09 AM wenjing zeng wrote: > Please continue to review the code. This patch is pretty light on comments. Many of the new functions have no header comments, for example. There are comments here and there in the body of the new functions that are added, and in places where existing code is changed there are comments here and there, but overall it's not a whole lot. There's no documentation and no README, either. Since this adds a new feature and a bunch of new SQL-callable functions that interact with that feature, the feature itself should be documented, along with its limitations and the new SQL-callable functions that interact with it. I think there should be either a lengthy comment in some suitable file, or maybe various comments in various files, or else a README file, that clearly sets out the major design principles behind the patch, and explaining also what that means in terms of features and limitations. Without that, it's really hard for anyone to jump into reviewing this code, and it will be hard for people who have to maintain it in the future to understand it, either. Or for users, for that matter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: new heapcheck contrib module
Hi, On 2020-07-30 13:18:01 -0700, Mark Dilger wrote: > Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax > committed via TransactionIdDidCommit. I am worried about concurrent > truncation of clog entries causing I/O errors on SLRU lookup when performing > that check. The three strategies I had for dealing with that were taking the > XactTruncationLock (formerly known as CLogTruncationLock, for those reading > this thread from the beginning), locking out vacuum, and the idea upthread > from Andres about setting PROC_IN_VACUUM and such. Maybe I'm being dense and > don't need to worry about this. But I haven't convinced myself of that, yet. I think it's not at all ok to look in the procarray or clog for xids that are older than what you're announcing you may read. IOW I don't think it's OK to just ignore the problem, or try to work around it by holding XactTruncationLock. Greetings, Andres Freund
Re: new heapcheck contrib module
> On Jul 29, 2020, at 12:52 PM, Robert Haas wrote: > > On Mon, Jul 20, 2020 at 5:02 PM Mark Dilger > wrote: >> I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults >> to 'none'. > > That looks nice. > >>> I guess that >>> could still be expensive if there's a lot of them, but needing >>> ShareUpdateExclusiveLock rather than only AccessShareLock is a little >>> unfortunate. >> >> I welcome strategies that would allow for taking a lesser lock. > > I guess I'm not seeing why you need any particular strategy here. Say > that at the beginning you note the starting relfrozenxid of the table > -- I think I would lean toward just ignoring datfrozenxid and the > cluster-wide value completely. You also note the current value of the > transaction ID counter. Those are the two ends of the acceptable > range. > > Let's first consider the oldest acceptable XID, bounded by > relfrozenxid. If you see a value that is older than the relfrozenxid > value that you noted at the start, it is definitely invalid. If you > see a newer value, it could still be older than the table's current > relfrozenxid, but that doesn't seem very worrisome. If the user > vacuumed the table while they were running this tool, they can always > run the tool again afterward if they wish. Forcing the vacuum to wait > by taking ShareUpdateExclusiveLock doesn't actually solve anything > anyway: you STILL won't notice any problems the vacuum introduces, and > in fact you are now GUARANTEED not to notice them, plus now the vacuum > happens later. > > Now let's consider the newest acceptable XID, bounded by the value of > the transaction ID counter. Any time you see a newer XID than the last > value of the transaction ID counter that you observed, you go observe > it again. If the value from the table still looks invalid, then you > complain about it. Either way, you remember the new observation and > check future tuples against that value. I think the patch is already > doing this anyway; if it weren't, you'd need an even stronger lock, > one sufficient to prevent any insert/update/delete activity on the > table altogether. > > Maybe I'm just being dense here -- exactly what problem are you worried about? Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit. I am worried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check. The three strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUM and such. Maybe I'm being dense and don't need to worry about this. But I haven't convinced myself of that, yet. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Threading in BGWorkers (!)
On Thu, Jul 30, 2020 at 2:55 PM Tom Lane wrote: > > Why not? Our answer to threading inside functions has been, for quite a > > while, that it's kinda ok if the threads never call into postgres and > > can never escape the lifetime of a function. But that's not actually > > really safe due to the signal handler issue. > > In other words, it's *not* safe and never has been. I see no good reason > to believe that the signal handler issue is the only one. Even if it is, > not being able to call any postgres infrastructure is a pretty huge > handicap. I find this line of argument really unfair. It's true that there might be issues other than the signal handler one, but so what? That is not a principled argument against fixing the signal handler part of the problem, which is the only *known* problem with the use case Andres describes. It is also true that it would be more useful to enable additional use cases rather than just this one, but that is not a principled argument against enabling this one. My only present concern about the proposal actually in front of us -- that is to say, use pthread_sigmask() rather than sigprocmask() -- is Thomas's observation that on his system doing so breaks the world. That seems to be quite a serious problem. If we are deciding whether to use one function or another some purpose and they are equally good for the core code but one is better for people who want to play around with extensions, we may as well use the one that's better for that purpose. We need not give such experimentation our unqualified endorsement; we need only decide against obstructing it unnecessarily. But when such a substitution risks breaking things that work today, the calculus gets a lot more complicated. Unless we can find a way to avoid that risk, I don't think this is a good trade-off. But more broadly I think it is well past time that we look into making the backend more thread-friendly. The fact that "it's *not* safe and never has been" has not prevented people from doing it. We don't end up with people going "oh, well sigprocmask uh oh so I better give up now." What we end up with is people just going right ahead and doing it, probably without even thinking about sigprocmask, and ending up with low-probability failure conditions, which seems likely to hurt PostgreSQL's reputation for reliability with no compensating benefit. Or alternatively they hack core, which sucks, or they switch to some non-PG database, which also sucks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Threading in BGWorkers (!)
Andres Freund writes: > On 2020-07-28 21:52:20 -0400, Tom Lane wrote: >> TBH, though, I do not buy this argument for a millisecond. I don't >> think that anything is going to come out of multithreading a bgworker >> but blood and tears. Perhaps someday we'll make a major push to >> make the backend code (somewhat(?)) thread safe ... but I'm not on >> board with making one-line-at-a-time changes in hopes of getting >> partway there. We need some kind of concrete plan for what is a >> usable amount of functionality and what has to be done to get it. > Why not? Our answer to threading inside functions has been, for quite a > while, that it's kinda ok if the threads never call into postgres and > can never escape the lifetime of a function. But that's not actually > really safe due to the signal handler issue. In other words, it's *not* safe and never has been. I see no good reason to believe that the signal handler issue is the only one. Even if it is, not being able to call any postgres infrastructure is a pretty huge handicap. So I stand by the position that we need an actual plan here, not just chipping away at one-liner things that might or might not improve matters noticeably. regards, tom lane
Re: Threading in BGWorkers (!)
Hi, On 2020-07-28 21:52:20 -0400, Tom Lane wrote: > >> The rationale here is that as far as I can tell this is the *only* blocker > >> to using multithreaded code in a BGWorker which can't be avoided by > >> adhering to strict code rules (eg: no PG calls from non-main threads, no > >> interaction with signals from non-main threads). > > TBH, though, I do not buy this argument for a millisecond. I don't > think that anything is going to come out of multithreading a bgworker > but blood and tears. Perhaps someday we'll make a major push to > make the backend code (somewhat(?)) thread safe ... but I'm not on > board with making one-line-at-a-time changes in hopes of getting > partway there. We need some kind of concrete plan for what is a > usable amount of functionality and what has to be done to get it. Why not? Our answer to threading inside functions has been, for quite a while, that it's kinda ok if the threads never call into postgres and can never escape the lifetime of a function. But that's not actually really safe due to the signal handler issue. Whether it's a normal backend or a bgworker doesn't really play a role here, no? Greetings, Andres Freund
Re: Threading in BGWorkers (!)
Hi, On 2020-07-29 13:41:02 +1200, Thomas Munro wrote: > One practical problem with this change is that some systems have a > stub definition of pthread_sigmask() that does nothing, when you don't > link against the threading library. Realistically, most *useful* > builds of PostgreSQL bring it in indirectly (via SSL, LDAP, blah > blah), but it so happens that a bare bones build and make check on > this here FreeBSD box hangs in CHECK DATABASE waiting for the > checkpointer to signal it. I can fix that by putting -lthr into > LDFLAGS, so we'd probably have to figure out how to do that on our > supported systems. Couldn't this be driven by --disable-thread-safety? Greetings, Andres Freund
Re: HyperLogLog.c and pg_leftmost_one_pos32()
On Thu, 2020-07-30 at 19:16 +0200, Tomas Vondra wrote: > > Essentially: > > initHyperLogLog(, 5) > > for i in 0 .. one billion > >addHyperLogLog(, hash(i)) > > estimateHyperLogLog > > > > The numbers are the same regardless of bwidth. > > > > Before my patch, it takes about 15.6s. After my patch, it takes > > about > > 6.6s, so it's more than a 2X speedup (including the hash > > calculation). > > > > Wow. That's a huge improvements. To be clear: the 2X+ speedup was on the tight loop test. > How does the whole test (data + query) look like? Is it particularly > rare / special case, or something reasonable to expect in practice? The whole-query test was: config: shared_buffers=8GB jit = off max_parallel_workers_per_gather=0 setup: create table t_1m_20(i int); vacuum (freeze, analyze) t_1m_20; insert into t_1m_20 select (random()*100)::int4 from generate_series(1,2000); query: set work_mem='2048kB'; SELECT pg_prewarm('t_1m_20', 'buffer'); -- median of the three runs select distinct i from t_1m_20 offset 1000; select distinct i from t_1m_20 offset 1000; select distinct i from t_1m_20 offset 1000; results: f2130e77 (before using HLL): 6787ms f1af75c5 (before my recent commit): 7170ms fd734f38 (master now):6990ms My previous results before I committed the patch (and therefore not on the same exact SHA1s) were 6812, 7158, and 6898. So my most recent batch of results is slightly worse, but the most recent commit (fd734f38) still does show an improvement of a couple percent. Regards, Jeff Davis
Re: new heapcheck contrib module
On Mon, Jul 27, 2020 at 1:02 PM Mark Dilger wrote: > Not at all! I appreciate all the reviews. Reviewing 0002, reading through verify_heapam.c: +typedef enum SkipPages +{ + SKIP_ALL_FROZEN_PAGES, + SKIP_ALL_VISIBLE_PAGES, + SKIP_PAGES_NONE +} SkipPages; This looks inconsistent. Maybe just start them all with SKIP_PAGES_. + if (PG_ARGISNULL(0)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("missing required parameter for 'rel'"))); This doesn't look much like other error messages in the code. Do something like git grep -A4 PG_ARGISNULL | grep -A3 ereport and study the comparables. + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unrecognized parameter for 'skip': %s", skip), + errhint("please choose from 'all-visible', 'all-frozen', or 'none'"))); Same problem. Check pg_prewarm's handling of the prewarm type, or EXPLAIN's handling of the FORMAT option, or similar examples. Read the message style guidelines concerning punctuation of hint and detail messages. + * Bugs in pg_upgrade are reported (see commands/vacuum.c circa line 1572) + * to have sometimes rendered the oldest xid value for a database invalid. + * It seems unwise to report rows as corrupt for failing to be newer than + * a value which itself may be corrupt. We instead use the oldest xid for + * the entire cluster, which must be at least as old as the oldest xid for + * our database. This kind of reference to another comment will not age well; line numbers and files change a lot. But I think the right thing to do here is just rely on relfrozenxid and relminmxid. If the table is inconsistent with those, then something needs fixing. datfrozenxid and the cluster-wide value can look out for themselves. The corruption detector shouldn't be trying to work around any bugs in setting relfrozenxid itself; such problems are arguably precisely what we're here to find. +/* + * confess + * + * Return a message about corruption, including information + * about where in the relation the corruption was found. + * + * The msg argument is pfree'd by this function. + */ +static void +confess(HeapCheckContext *ctx, char *msg) Contrary to what the comments say, the function doesn't return a message about corruption or anything else. It returns void. I don't really like the name, either. I get that it's probably inspired by Perl, but I think it should be given a less-clever name like report_corruption() or something. + * corrupted table from using workmem worth of memory building up the This kind of thing destroys grep-ability. If you're going to refer to work_mem, you gotta spell it the same way we do everywhere else. + * Helper function to construct the TupleDesc needed by verify_heapam. Instead of saying it's the TupleDesc somebody needs, how about saying that it's the TupleDesc that we'll use to report problems that we find while scanning the heap, or something like that? + * Given a TransactionId, attempt to interpret it as a valid + * FullTransactionId, neither in the future nor overlong in + * the past. Stores the inferred FullTransactionId in *fxid. It really doesn't, because there's no such thing as 'fxid' referenced anywhere here. You should really make the effort to proofread your patches before posting, and adjust comments and so on as you go. Otherwise reviewing takes longer, and if you keep introducing new stuff like this as you fix other stuff, you can fail to ever produce a committable patch. + * Determine whether tuples are visible for verification. Similar to + * HeapTupleSatisfiesVacuum, but with critical differences. Not accurate, because it also reports problems, which is not mentioned anywhere in the function header comment that purports to be a detailed description of what the function does. + else if (TransactionIdIsCurrentTransactionId(raw_xmin)) + return true; /* insert or delete in progress */ + else if (TransactionIdIsInProgress(raw_xmin)) + return true; /* HEAPTUPLE_INSERT_IN_PROGRESS */ + else if (!TransactionIdDidCommit(raw_xmin)) + { + return false; /* HEAPTUPLE_DEAD */ + } One of these cases is not punctuated like the others. + pstrdup("heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has a valid xmax")); 1. I don't think that's very grammatical. 2. Why abbreviate HEAP_XMAX_IS_MULTI to XMAX_IS_MULTI and HEAP_XMAX_IS_LOCKED_ONLY to LOCKED_ONLY? I don't even think you should be referencing C constant names here at all, and if you are I don't think you should abbreviate, and if you do abbreviate I don't think you should omit different numbers of words depending on which constant it is. I wonder what the intended division of responsibility is here, exactly. It seems like you've ended up with some sanity checks in check_tuple() before tuple_is_visible() is called, and others in tuple_is_visible() proper. As far as I can see the comments don't really discuss the logic behind the split, but there's clearly a close relationship between the two sets
Re: PG 13 release notes, first draft
On Thu, Jul 30, 2020 at 10:32 AM Bruce Momjian wrote: > I came up with a more verbose documentation suggestion, attached. I'm okay with this. Thanks -- Peter Geoghegan
Re: PG 13 release notes, first draft
On Wed, Jul 29, 2020 at 08:43:24PM -0700, David G. Johnston wrote: > On Wednesday, July 29, 2020, Peter Geoghegan wrote: > > On Wed, Jul 29, 2020 at 6:30 PM Bruce Momjian wrote: > > > There should be a note about this in the Postgres 13 release notes, > > > for the usual reasons. More importantly, the "Allow hash aggregation > > > to use disk storage for large aggregation result sets" feature should > > > reference the new GUC directly. Users should be advised that the GUC > > > may be useful in cases where they upgrade and experience a performance > > > regression linked to slower hash aggregation. Just including a > > > documentation link for the GUC would be very helpful. > > > > I came up with the attached patch. > > I was thinking something along like the following (after the existing > sentence about avoiding hash aggs in the planner): > > If you find that hash aggregation is slower than in previous major > releases of PostgreSQL, it may be useful to increase the value of > hash_mem_multiplier. This allows hash aggregation to use more memory > without affecting competing query operations that are generally less > likely to put any additional memory to good use. I came up with a more verbose documentation suggestion, attached. > How about adding wording for GROUP BY as well to cater to users who are more > comfortable thinking in terms of SQL statements instead of execution plans? Uh, it is unclear exactly what SQL generates what node types, so I want to avoid this. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml new file mode 100644 index 18e6497..0b1051b *** a/doc/src/sgml/release-13.sgml --- b/doc/src/sgml/release-13.sgml *** Author: Jeff Davis Previously, hash aggregation was avoided if it was expected to use ! more than memory. --- 649,658 Previously, hash aggregation was avoided if it was expected to use ! more than memory. To reduce the ! likelihood of using disk storage for hash aggregation and attain ! behavior similar to previous Postgres releases, increase .
Re: HyperLogLog.c and pg_leftmost_one_pos32()
On Thu, Jul 30, 2020 at 09:21:23AM -0700, Jeff Davis wrote: On Wed, 2020-07-29 at 17:32 -0700, Peter Geoghegan wrote: How did you test this? What kind of difference are we talking about? Essentially: initHyperLogLog(, 5) for i in 0 .. one billion addHyperLogLog(, hash(i)) estimateHyperLogLog The numbers are the same regardless of bwidth. Before my patch, it takes about 15.6s. After my patch, it takes about 6.6s, so it's more than a 2X speedup (including the hash calculation). Wow. That's a huge improvements. How does the whole test (data + query) look like? Is it particularly rare / special case, or something reasonable to expect in practice? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: HyperLogLog.c and pg_leftmost_one_pos32()
On Wed, 2020-07-29 at 17:32 -0700, Peter Geoghegan wrote: > How did you test this? What kind of difference are we talking about? Essentially: initHyperLogLog(, 5) for i in 0 .. one billion addHyperLogLog(, hash(i)) estimateHyperLogLog The numbers are the same regardless of bwidth. Before my patch, it takes about 15.6s. After my patch, it takes about 6.6s, so it's more than a 2X speedup (including the hash calculation). As a part of HashAgg, when I test the worst case, it goes from a 4-5% penalty to ~1% (within noise). > I think that you should change back the rhs() variable names to match > HyperLogLog upstream (as well as the existing rhs() comments). Done. > > I think it's overall an improvement, but > > addHyperLogLog() itself seemed to show up as a cost, so it can hurt > > spilled-but-still-in-memory cases. I'd also like to backpatch this > > to > > 13 (as I already did for 9878b643), if that's acceptable. > > I still wonder if it was ever necessary to add HLL to abbreviated > keys. It only served to avoid some pretty narrow worse cases, at the > expense of typical cases. Given that the existing users of HLL are > pretty narrow, and given the importance of preserving the favorable > performance characteristics of hash aggregate, I'm inclined to agree > that it's worth backpatching to 13 now. Assuming it is a really > measurable cost in practice. Yes, the difference (at least in a tight loop, on my machine) is not subtle. I went ahead and committed and backpatched. Regards, Jeff Davis
Re: Missing CFI in hlCover()?
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > >> We could hard-code a rule like that, or we could introduce a new > >> explicit parameter for the maximum cover length. The latter would be > >> more flexible, but we need something back-patchable and I'm concerned > >> about the compatibility hazards of adding a new parameter in minor > >> releases. So on the whole I propose hard-wiring a multiplier of, > >> say, 10 for both these cases. > > > That sounds alright to me, though I do think we should probably still > > toss a CFI (or two) in this path somewhere as we don't know how long > > some of these functions might take... > > Yeah, of course. I'm still leaning to doing that in TS_execute_recurse. Works for me. Thanks! Stephen signature.asc Description: PGP signature
Re: Missing CFI in hlCover()?
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> We could hard-code a rule like that, or we could introduce a new >> explicit parameter for the maximum cover length. The latter would be >> more flexible, but we need something back-patchable and I'm concerned >> about the compatibility hazards of adding a new parameter in minor >> releases. So on the whole I propose hard-wiring a multiplier of, >> say, 10 for both these cases. > That sounds alright to me, though I do think we should probably still > toss a CFI (or two) in this path somewhere as we don't know how long > some of these functions might take... Yeah, of course. I'm still leaning to doing that in TS_execute_recurse. regards, tom lane
Re: Missing CFI in hlCover()?
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > We could hard-code a rule like that, or we could introduce a new > explicit parameter for the maximum cover length. The latter would be > more flexible, but we need something back-patchable and I'm concerned > about the compatibility hazards of adding a new parameter in minor > releases. So on the whole I propose hard-wiring a multiplier of, > say, 10 for both these cases. That sounds alright to me, though I do think we should probably still toss a CFI (or two) in this path somewhere as we don't know how long some of these functions might take... Thanks, Stephen signature.asc Description: PGP signature
Re: [BUG] Error in BRIN summarization
On 27.07.2020 20:25, Alvaro Herrera wrote: On 2020-Jul-27, Anastasia Lubennikova wrote: Here is the updated version of the fix. The problem can be reproduced on all supported versions, so I suggest to backpatch it. Code slightly changed in v12, so here are two patches: one for versions 9.5 to 11 and another for versions from 12 to master. (I was also considering whether it needs to be a loop to reobtain root tuples, in case a concurrent transaction can create a new item while we're checking that item; but I don't think that can really happen for one individual tuple.) I don't think we need a recheck for a single tuple, because we only care about finding its root, which simply must exist somewhere on this page, as concurrent pruning is not allowed. We also may catch root_offsets[] for subsequent tuples, but it's okay if we don't. These tuples will do the same recheck on their turn. While testing this fix, Alexander Lakhin spotted another problem. I simplified the test case to this: 1) prepare a table with brin index create table tbl (iint, bchar(1000)) with (fillfactor=10); insert into tbl select i, md5(i::text) from generate_series(0,1000) as i; create index idx on tbl using brin(i, b) with (pages_per_range=1); 2) run brin_desummarize_range() in a loop: echo "-- desummarize all ranges SELECT FROM generate_series(0, pg_relation_size('tbl')/8192 - 1) as i, lateral (SELECT brin_desummarize_range('idx', i)) as d; -- summarize them back VACUUM tbl" > brin_desum_test.sql pgbench postgres -f brin_desum_test.sql -n -T 120 3) run a search that invokes bringetbitmap: set enable_seqscan to off; explain analyze select * from tbl where i>10 and i<100; \watch 1 After a few runs, it will fail with "ERROR: corrupted BRIN index: inconsistent range map" The problem is caused by a race in page locking in brinGetTupleForHeapBlock [1]: (1) bitmapsan locks revmap->rm_currBuf and finds the address of the tuple on a regular page "page", then unlocks revmap->rm_currBuf (2) in another transaction desummarize locks both revmap->rm_currBuf and "page", cleans up the tuple and unlocks both buffers (1) bitmapscan locks buffer, containing "page", attempts to access the tuple and fails to find it At first, I tried to fix it by holding the lock on revmap->rm_currBuf until we locked the regular page, but it causes a deadlock with brinsummarize(), It can be easily reproduced with the same test as above. Is there any rule about the order of locking revmap and regular pages in brin? I haven't found anything in README. As an alternative, we can leave locks as is and add a recheck, before throwing an error. What do you think? [1] https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin_revmap.c#L269 -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: INSERT INTO SELECT, Why Parallelism is not selected?
On Thu, Jul 30, 2020 at 12:02 PM Amit Kapila wrote: > > On Wed, Jul 29, 2020 at 7:18 PM Robert Haas wrote: > > > > I still don't agree with this as proposed. > > > > + * For now, we don't allow parallel inserts of any form not even where the > > + * leader can perform the insert. This restriction can be uplifted once > > + * we allow the planner to generate parallel plans for inserts. We can > > > > If I'm understanding this correctly, this logic is completely > > backwards. We don't prohibit inserts here because we know the planner > > can't generate them. We prohibit inserts here because, if the planner > > somehow did generate them, it wouldn't be safe. You're saying that > > it's not allowed because we don't try to do it yet, but actually it's > > not allowed because we want to make sure that we don't accidentally > > try to do it. That's very different. > > > > Right, so how about something like: "To allow parallel inserts, we > need to ensure that they are safe to be performed in workers. We have > the infrastructure to allow parallel inserts in general except for the > case where inserts generate a new commandid (eg. inserts into a table > having a foreign key column)." We can extend this for tuple locking > if required as per the below discussion. Kindly suggest if you prefer > a different wording here. > > > > > + * We should be able to parallelize > > + * the later case if we can ensure that no two parallel processes can ever > > + * operate on the same page. > > > > I don't know whether this is talking about two processes operating on > > the same page at the same time, or ever within a single query > > execution. If it's the former, perhaps we need to explain why that's a > > concern for parallel query but not otherwise; > > > > I am talking about the former case and I know that as per current > design it is not possible that two worker processes try to operate on > the same page but I was trying to be pessimistic so that we can ensure > that via some form of Assert. > I think the two worker processes can operate on the same page for a parallel index scan case but it won't be for same tuple. I am not able to think of any case where we should be worried about tuple locking for Insert's case, so we can probably skip writing anything about it unless someone else can think of such a case. -- With Regards, Amit Kapila.
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
On Mon, Jul 20, 2020 at 3:47 PM Robert Haas wrote: > But I also agree that what pg_start_backup() was doing before v13 was > wrong; that's why I committed > 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't > back-patch it is because the consequences are so minor I didn't think > it was worth worrying about. We could, though. I'd be somewhat > inclined to both do that and also change LLVM to use on_proc_exit() in > master, but I don't feel super-strongly about it. Unless somebody complains pretty soon, I'm going to go ahead and do what is described above. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
windows config.pl question
Hi I’d like to add icu/openssl support to my postgresql build on windows documentation says that I have to modify config.pl file, however it's not clear what exactly I have to do config-default.pl for example has the following line icu => undef,# --with-icu= so, if I want to add icu support what exactly should I do? should I replace undef with path? icu => , is it correct? if it’s correct does build support UNC paths? thanks in advance Dimitry Markman Dmitry Markman
Re: Doc patch: mention indexes in pg_inherits docs
Michael Paquier writes: > On Wed, Jul 29, 2020 at 03:06:58PM +0900, Michael Paquier wrote: >> This is actually a bug fix, because we include in the docs some >> incorrect information, so it should be backpatched. If there are no >> objections, I'll take care of that. > > And done. Thanks! - ilmari -- "A disappointingly low fraction of the human race is, at any given time, on fire." - Stig Sandbeck Mathisen
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Jul 30, 2020 at 12:28 PM Ajin Cherian wrote: > > I was running some tests on this patch. I was generally trying to see how the > patch affects logical replication when doing bulk inserts. This issue has > been raised in the past, for eg: this [1]. > My test setup is: > 1. Two postgres servers running - A and B > 2. Create a pgbench setup on A. (pgbench -i -s 5 postgres) > 3. replicate the 3 tables (schema only) on B. > 4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, > pgbench_branches and pgbench_tellers; > 5. Three subscribers on B for the same tables. (streaming on and off based on > the scenarios described below) > > run pgbench with : pgbench -c 4 -T 100 postgres > While pgbench is running, Do a bulk insert on some other table not in the > publication list (say t1); INSERT INTO t1 (select i FROM > generate_series(1,1000) i); > > Four scenarios: > 1. Pgbench with logical replication enabled without bulk insert > Avg TPS (out of 10 runs): 641 TPS > 2.Pgbench without logical replication enabled with bulk insert (no pub/sub) > Avg TPS (out of 10 runs): 665 TPS > 3, Pgbench with logical replication enabled with bulk insert > Avg TPS (out of 10 runs): 278 TPS > 4. Pgbench with logical replication streaming on with bulk insert > Avg TPS (out of 10 runs): 440 TPS > > As you can see, the bulk inserts, although on a totally unaffected table, > does impact the TPS. But what is good is that, enabling streaming improves > the TPS (about 58% improvement) > Thanks for doing these tests, it is a good win and probably the reason is that after patch we won't serialize such big transactions (as shown in Konstantin's email [1]) and they will be simply skipped. Basically, it will try to stream such transactions and will skip them as they are not required to be sent. [1] - https://www.postgresql.org/message-id/5f5143cc-9f73-3909-3ef7-d3895cc6cc90%40postgrespro.ru -- With Regards, Amit Kapila.
Re: [PATCH] Tab completion for VACUUM of partitioned tables
On Thu, Jul 30, 2020 at 02:16:04PM +0900, Masahiko Sawada wrote: > Yeah, that's the reason why I mentioned about toast tables. "VACUUM > " displays "pg_toast." but, "VACUUM pg_to" doesn't > complement anything. So am I OK with this situation. The same is true as well for CLUSTER and TRUNCATE, and "pg_to" would get completion with the toast tables only if we begin to add RELKIND_TOASTVALUE to the queries. Note that the schema completions come from _complete_from_query() where we would need to be smarter regarding the filtering of pg_namespace rows and I have not looked how to do that, but I feel that it may not be that complicated. For now I have applied the proposed patch. -- Michael signature.asc Description: PGP signature
Re: Fix minor source code comment mistake
On Thu, Jul 30, 2020 at 08:03:09AM +, k.jami...@fujitsu.com wrote: > Just found a minor error in source code comment. > src/include/executor/instrument.h > > Attached is the fix. > > - longlocal_blks_dirtied; /* # of shared blocks dirtied */ > + longlocal_blks_dirtied; /* # of local blocks dirtied */ Indeed. Let's fix this. -- Michael signature.asc Description: PGP signature
Re: problem with RETURNING and update row movement
On Wed, Jul 22, 2020 at 3:16 PM Amit Langote wrote: > On Mon, Jul 20, 2020 at 8:35 PM Amit Kapila wrote: > > IIUC, here the problem is related to below part of code: > > ExecInsert(..) > > { > > /* Process RETURNING if present */ > > if (resultRelInfo->ri_projectReturning) > > result = ExecProcessReturning(resultRelInfo, slot, planSlot); > > .. > > } > > > > The reason is that planSlot is for foo1 and slot is for foo2 and when > > it tries to access tuple during ExecProcessReturning(), it results in > > an error. Is my understanding correct? > > Yes. Specifically, the problem exists if there are any non-target > relation attributes in RETURNING which are computed by referring to > planSlot, the plan's output tuple, which may be shaped differently > among result relations due to their tuple descriptors being different. > > > If so, then can't we ensure > > someway that planSlot also belongs to foo2 instead of skipping return > > processing in Insert and then later do more work to perform in Update. > > I did consider that option but failed to see a way to make it work. > > I am not sure if there is a way to make a copy of the plan's output > tuple (planSlot) that is compatible with the destination partition. > Simple conversion using execute_attr_map_slot() is okay when we know > the source and the target slots contain relation tuples, but plan's > output tuples will also have other junk attributes. Also, not all > destination partitions have an associated plan and hence a slot to > hold plan tuples. Yeah, I think it might be possible to create planSlot to pass to ExecInsert() so that we can process RETURNING within that function, but even if so, that would be cumbersome not only because partitions can have different rowtypes but because they can have different junk columns as well, because e.g., subplan foreign partitions may have different row ID columns as junk columns. The proposed patch is simple, so I would vote for it. (Note: in case of a foreign partition, we call ExecForeignInsert() with the source partition’s planSlot in ExecInsert(), which is not correct, but that would sbe OK because it seems unlikely that the FDW would look at the planSlot for INSERT.) One thing I noticed is that the patch changes the existing behavior. Here is an example: create table range_parted (a text, b bigint) partition by range (a, b); create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10); create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10); create function trigfunc() returns trigger as $$ begin return null; end; $$ language plpgsql; create trigger trig before insert on part_b_1_b_10 for each row execute function trigfunc(); insert into range_parted values ('a', 1); In HEAD: postgres=# update range_parted r set a = 'b' from (values ('a', 1)) s(x, y) where s.x = r.a and s.y = r.b returning tableoid::regclass, r.*; tableoid | a | b --+---+--- (0 rows) UPDATE 0 But with the patch: postgres=# update range_parted r set a = 'b' from (values ('a', 1)) s(x, y) where s.x = r.a and s.y = r.b returning tableoid::regclass, r.*; tableoid| a | b ---+---+--- part_a_1_a_10 | b | 1 (1 row) UPDATE 1 This produces RETURNING, though INSERT on the destination partition was skipped by the trigger. Another thing is that the patch assumes that the tuple slot to pass to ExecInsert() would store the inserted tuple when doing that function, but that’s not always true, because in case of a foreign partition, the FDW may return a slot other than the passed-in slot when called from ExecForeignInsert(), in which case the passed-in slot would not store the inserted tuple anymore. To fix these, I modified the patch so that we 1) add to ExecInsert() an output parameter slot to store the inserted tuple, and 2) compute RETURNING based on the parameter. I also added a regression test case. Attached is an updated version of the patch. Sorry for the long delay. Best regards, Etsuro Fujita v3-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patch Description: Binary data
Fix minor source code comment mistake
Hi, Just found a minor error in source code comment. src/include/executor/instrument.h Attached is the fix. - longlocal_blks_dirtied; /* # of shared blocks dirtied */ + longlocal_blks_dirtied; /* # of local blocks dirtied */ Regards, Kirk Jamison 0001-Fix-comment-in-source-code.patch Description: 0001-Fix-comment-in-source-code.patch
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wednesday, July 29, 2020 4:55 PM, Konstantin Knizhnik wrote: > On 17.06.2020 09:14, k.jami...@fujitsu.com wrote: > > Hi, > > > > Since the last posted version of the patch fails, attached is a rebased > > version. > > Written upthread were performance results and some benefits and challenges. > > I'd appreciate your feedback/comments. > > > > Regards, > > Kirk Jamison > As far as i understand this patch can provide significant improvement of > performance only in case of recovery of truncates of large number of tables. > You > have added shared hash of relation buffers and certainly if adds some extra > overhead. According to your latest results this overhead is quite small. But > it will > be hard to prove that there will be no noticeable regression at some > workloads. Thank you for taking a look at this. Yes, one of the aims is to speed up recovery of truncations, but at the same time the patch also improves autovacuum, vacuum and relation truncate index executions. I showed results of pgbench results above for different types of workloads, but I am not sure if those are validating enough... > I wonder if you have considered case of local hash (maintained only during > recovery)? > If there is after-crash recovery, then there will be no concurrent access to > shared > buffers and this hash will be up-to-date. > in case of hot-standby replica we can use some simple invalidation (just one > flag > or counter which indicates that buffer cache was updated). > This hash also can be constructed on demand when DropRelFileNodeBuffers is > called first time (so w have to scan all buffers once, but subsequent drop > operation will be fast. > > i have not thought much about it, but it seems to me that as far as this > problem > only affects recovery, we do not need shared hash for it. > The idea of the patch is to mark the relation buffers to be dropped after scanning the whole shared buffers, and store them into shared memory maintained in a dlist, and traverse the dlist on the next scan. But I understand the point that it is expensive and may cause overhead, that is why I tried to define a macro to limit the number of pages that we can cache for cases that lookup cost can be problematic (i.e. too many pages of relation). #define BUF_ID_ARRAY_SIZE 100 int buf_id_array[BUF_ID_ARRAY_SIZE]; int forknum_indexes[BUF_ID_ARRAY_SIZE]; In DropRelFileNodeBuffers do { nbufs = CachedBlockLookup(..., forknum_indexes, buf_id_array, lengthof(buf_id_array)); for (i = 0; i < nbufs; i++) { ... } } while (nbufs == lengthof(buf_id_array)); Perhaps the patch affects complexities so we want to keep it simpler, or commit piece by piece? I will look further into your suggestion of maintaining local hash only during recovery. Thank you for the suggestion. Regards, Kirk Jamison
RE: Is it useful to record whether plans are generic or custom?
>> Main purpose is to decide (1) the user interface and (2) the >> way to get the plan type from pg_stat_statements. >> >> (1) the user interface >> I added a new boolean column 'generic_plan' to both >> pg_stat_statements view and the member of the hash key of >> pg_stat_statements. >> >> This is because as Legrand pointed out the feature seems >> useful under the condition of differentiating all the >> counters for a queryid using a generic plan and the one >> using a custom one. > I don't like this because this may double the number of entries in pgss. > Which means that the number of entries can more easily reach > pg_stat_statements.max and some entries will be discarded. Not all the entries will be doubled, only the ones that are prepared. And even if auto prepare was implemented, having 5000, 1 or 2 max entries seems not a problem to me. >> I thought it might be preferable to make a GUC to enable >> or disable this feature, but changing the hash key makes >> it harder. > What happens if the server was running with this option enabled and then > restarted with the option disabled? Firstly two entries for the same query > were stored in pgss because the option was enabled. But when it's disabled > and the server is restarted, those two entries should be merged into one > at the startup of server? If so, that's problematic because it may take > a long time. What would you think about a third value for this flag to handle disabled case (no need to merge entries in this situation) ? > Therefore I think that it's better and simple to just expose the number of > times generic/custom plan was chosen for each query. I thought this feature was mainly needed to identifiy "under optimal generic plans". Without differentiating execution counters, this will be simpler but useless in this case ... isn't it ? > Regards, > -- > Fujii Masao > Advanced Computing Technology Center > Research and Development Headquarters > NTT DATA CORPORATION Regards PAscal
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jul 29, 2020 at 3:16 PM Dilip Kumar wrote: > > > Thanks, please find the rebased patch set. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when doing bulk inserts. This issue has been raised in the past, for eg: this [1]. My test setup is: 1. Two postgres servers running - A and B 2. Create a pgbench setup on A. (pgbench -i -s 5 postgres) 3. replicate the 3 tables (schema only) on B. 4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers; 5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below) run pgbench with : pgbench -c 4 -T 100 postgres While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1 (select i FROM generate_series(1,1000) i); Four scenarios: 1. Pgbench with logical replication enabled without bulk insert Avg TPS (out of 10 runs): 641 TPS 2.Pgbench without logical replication enabled with bulk insert (no pub/sub) Avg TPS (out of 10 runs): 665 TPS 3, Pgbench with logical replication enabled with bulk insert Avg TPS (out of 10 runs): 278 TPS 4. Pgbench with logical replication streaming on with bulk insert Avg TPS (out of 10 runs): 440 TPS As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is that, enabling streaming improves the TPS (about 58% improvement) [1] - https://www.postgresql.org/message-id/flat/CAMsr%2BYE6aE6Re6smrMr-xCabRmCr%3DyzXEf2Yuv5upEDY5nMX8g%40mail.gmail.com#dbe51a181dd735eec8bb36f8a07bacf5 regards, Ajin Cherian Fujitsu Australia
Re: Doc patch: mention indexes in pg_inherits docs
On Wed, Jul 29, 2020 at 03:06:58PM +0900, Michael Paquier wrote: > This is actually a bug fix, because we include in the docs some > incorrect information, so it should be backpatched. If there are no > objections, I'll take care of that. And done. -- Michael signature.asc Description: PGP signature
Re: INSERT INTO SELECT, Why Parallelism is not selected?
On Wed, Jul 29, 2020 at 7:18 PM Robert Haas wrote: > > I still don't agree with this as proposed. > > + * For now, we don't allow parallel inserts of any form not even where the > + * leader can perform the insert. This restriction can be uplifted once > + * we allow the planner to generate parallel plans for inserts. We can > > If I'm understanding this correctly, this logic is completely > backwards. We don't prohibit inserts here because we know the planner > can't generate them. We prohibit inserts here because, if the planner > somehow did generate them, it wouldn't be safe. You're saying that > it's not allowed because we don't try to do it yet, but actually it's > not allowed because we want to make sure that we don't accidentally > try to do it. That's very different. > Right, so how about something like: "To allow parallel inserts, we need to ensure that they are safe to be performed in workers. We have the infrastructure to allow parallel inserts in general except for the case where inserts generate a new commandid (eg. inserts into a table having a foreign key column)." We can extend this for tuple locking if required as per the below discussion. Kindly suggest if you prefer a different wording here. > > + * We should be able to parallelize > + * the later case if we can ensure that no two parallel processes can ever > + * operate on the same page. > > I don't know whether this is talking about two processes operating on > the same page at the same time, or ever within a single query > execution. If it's the former, perhaps we need to explain why that's a > concern for parallel query but not otherwise; > I am talking about the former case and I know that as per current design it is not possible that two worker processes try to operate on the same page but I was trying to be pessimistic so that we can ensure that via some form of Assert. I don't know whether it is important to mention this case or not but for the sake of extra safety, I have mentioned it. -- With Regards, Amit Kapila.
Re: Creating a function for exposing memory usage of backend process
Hi, On Fri, Jul 10, 2020 at 5:32 PM torikoshia wrote: > - whether information for identifying parent-child relation is necessary > or it's an overkill I think it's important to understand the parent-child relationship of the context. Personally, I often want to know the following two things .. - In which life cycle is the target context? (Remaining as long as the process is living? per query?) - Does the target context belong to the correct (parent) context? > - if this information is necessary, memory address is suitable or other > means like assigning unique numbers are required IMO, If each context can be uniquely identified (or easily guessed) by "name" and "ident", then I don't think the address information is necessary. Instead, I like the way that directly shows the context name of the parent, as in the 0005 patch. Best regards -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com