Re: Have an encrypted pgpass file
Isaac Morland writes: >>> It would also provide a *very* fertile source of shell-script-injection >>> vulnerabilities. (Whaddya mean, you tried to use a user name with a >>> quote mark in it?) > If I understand the proposal correctly, the pgpass program would run on the > client, invoked by libpq when a password is needed for a connection. So the > risk relates to strange things happening on the client when the client > attempts to connect as a strangely-named user or to a strangely-named > database or host, not to being able to break into the server. Yeah. The most obvious scenario for trouble is that somebody enters a crafted user name on a website, and that results in bad things happening on an application-server machine that tried to pass that user name to a database server. The DB server itself isn't compromised, but the app server could be. If we were putting this sort of feature into psql, it wouldn't be such a risk, but if it's in libpq then I fear it is. libpq underlies a lot of client-side code. regards, tom lane
Re: small development tip: Consider using the gold linker
On Mon, Jul 9, 2018 at 4:40 PM, Andres Freund wrote: > FWIW, I've lately noticed that I spend a fair time waiting for the > linker during edit-compile-test cycles. Due to an independent issue I > just used the gold linker, and the speedup is quite noticable. > > Just relinking the backend, without rebuilding anything else, goes from > 0m3.975s to 0m1.585s. Given ccache commonly prevents actually having to > recompile files, that works out to a noticable benefit. > > For me just adding '-fuse-ld=gold' to CFLAGS works. I tried this out today. It makes quite a noticeable difference for me. Thanks for the tip. > Unfortunately I get some spurious warnings, but I hope that's just a > debian unstable issue: /usr/bin/ld.gold: warning: discarding version > information for __cxa_finalize@GLIBC_2.2.5, defined in unused shared > library /lib/x86_64-linux-gnu/libc.so.6 (linked with --as-needed) I didn't have this problem. -- Peter Geoghegan
Re: Have an encrypted pgpass file
On 20 July 2018 at 17:22, Tels wrote: > Moin, > > > It would also provide a *very* fertile source of shell-script-injection > > vulnerabilities. (Whaddya mean, you tried to use a user name with a > > quote mark in it?) > > Little Bobby Tables, we call him. :) > > I'm also concerned that that would let anybody who could alter the > environment then let arbitrary code be run as user postgres. Is this > something that poses a risk in addition to the current situation? > If I understand the proposal correctly, the pgpass program would run on the client, invoked by libpq when a password is needed for a connection. So the risk relates to strange things happening on the client when the client attempts to connect as a strangely-named user or to a strangely-named database or host, not to being able to break into the server.
Re: patch to allow disable of WAL recycling
Thomas, Thanks for your offer to run some tests on different OSes and filesystems that you have. Anything you can provide here would be much appreciated. I don't have anything other than our native SmartOS/ZFS based configurations, but I might be able to setup some VMs and get results that way. I should be able to setup a VM running FreeBSD. If you have a chance to collect some data, just let me know the exact benchmarks you ran and I'll run the same things on the FreeBSD VM. Obviously you're under no obligation to do any of this, so if you don't have time, just let me know and I'll see what I can do on my own. Thanks again, Jerry On Tue, Jul 17, 2018 at 2:47 PM, Tomas Vondra wrote: > On 07/17/2018 09:12 PM, Peter Eisentraut wrote: > > On 17.07.18 00:04, Jerry Jelinek wrote: > >> There have been quite a few comments since last week, so at this point I > >> am uncertain how to proceed with this change. I don't think I saw > >> anything concrete in the recent emails that I can act upon. > > > > The outcome of this could be multiple orthogonal patches that affect the > > WAL file allocation behavior somehow. I think your original idea of > > skipping recycling on a COW file system is sound. But I would rather > > frame the option as "preallocating files is obviously useless on a COW > > file system" rather than "this will make things mysteriously faster with > > uncertain trade-offs". > > > > Makes sense, I guess. But I think many claims made in this thread are > mostly just assumptions at this point, based on our beliefs how CoW or > non-CoW filesystems work. The results from ZFS (showing positive impact) > are an exception, but that's about it. I'm sure those claims are based > on real-world experience and are likely true, but it'd be good to have > data from a wider range of filesystems / configurations etc. so that we > can give better recommendations to users, for example. > > That's something I can help with, assuming we agree on what tests we > want to do. I'd say the usual batter of write-only pgbench tests with > different scales (fits into s_b, fits into RAM, larger then RAM) on > common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and > different types of storage would be enough. I don't have any freebsd box > available, unfortunately. > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack
On Fri, Jul 20, 2018 at 2:17 AM, Jeremy Schneider wrote: > I'd like to bump this old bug that Lloyd filed for more discussion. It > seems serious enough to me that we should at least talk about it. > > Anyone with simply the login privilege and the ability to run SQL can > instantly block all new incoming connections to a DB including new > superuser connections. > So.. don't VACUUM FULL pg_authid without lock_timeout? I can come up with dozens of ways to achieve the same effect, all of them silly. .m
Re: patch to allow disable of WAL recycling
Peter, Thanks for your feedback. I'm happy to change the name of the tunable or to update the man page in any way. I have already posted an updated patch with changes to the man page which I think may address your concerns there, but please let me know if that still needs more work. It looks like Kyotaro already did some exploration, and tuning the min/max for the WAL size won't solve this problem. Just let me know if there is anything else here which you think I should look into. Thanks again, Jerry On Tue, Jul 17, 2018 at 1:12 PM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 17.07.18 00:04, Jerry Jelinek wrote: > > There have been quite a few comments since last week, so at this point I > > am uncertain how to proceed with this change. I don't think I saw > > anything concrete in the recent emails that I can act upon. > > The outcome of this could be multiple orthogonal patches that affect the > WAL file allocation behavior somehow. I think your original idea of > skipping recycling on a COW file system is sound. But I would rather > frame the option as "preallocating files is obviously useless on a COW > file system" rather than "this will make things mysteriously faster with > uncertain trade-offs". > > The actual implementation could use another round of consideration. I > wonder how this should interact with min_wal_size. Wouldn't > min_wal_size = 0 already do what we need (if you could set it to 0, > which is currently not possible)? Should the new setting be something > like min_wal_size = -1? Or even if it's a new setting, it might be > better to act on it in XLOGfileslop(), so these things are kept closer > together. > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket
On Fri, Jul 20, 2018 at 1:54 AM Heikki Linnakangas wrote: > On 19/07/18 23:13, Andres Freund wrote: > > On 2018-07-19 14:54:52 -0500, Nico Williams wrote: > >> On Thu, Jul 19, 2018 at 12:20:53PM -0700, Andres Freund wrote: > >>> On 2018-07-19 11:57:25 +0300, Heikki Linnakangas wrote: > Ugh. Yeah, in wal_quickdie, and other aux process *_quickdie() > handlers, I > agree we should just _exit(2). All we want to do is to exit the > process > immediately. > >>> > >>> Indeed. > >>> > bgworker_quickdie() makes some effort to block SIGQUIT during the > exit() > processing, but that doesn't solve the whole problem. The process > could've > been in the middle of a malloc/free when the signal arrived, for > example. > exit() is simply not safe to call from a signal handler. > >>> > >>> Yea. I really don't understand why we have't been able to agree on this > >>> for *years* now. > >> > >> I mean, only calling async-signal-safe functions from signal handlers is > >> a critical POSIX requirement, and exit(3) is NOT async-signal-safe. > > > > There's a few standard requirements that aren't particularly relevant in > > practice (including some functions not being listed as signal > > safe). Just arguing from that isn't really helpful. But there's plenty > > hard evidence, including a few messages upthread!, of it being > > practically problematic. Just looking at the reasoning at why exit (and > > malloc) aren't signal safe however, makes it clear that this particular > > restriction should be adhered to, however. > > ISTM that no-one has any great ideas on what to do about the ereport() > in quickdie(). But I think we have consensus on replacing the exit(2) > calls with _exit(2). If we do just that, it would be better than the > status quo, even if it doesn't completely fix the problem. This would > prevent the case that Asim reported, for starters. > > Any objections to the attached? > > With _exit(), I think we wouldn't really need the > "PG_SETMASK();" calls in the aux process *_quickdie handlers > that don't do anything else than call _exit(). But I didn't dare to > remove them yet. > Seems there is opportunity to actuall consolidate all those quickdies as well. As I see no difference between the auxiliary process quickdie implementations. Only backend `quickdie()` has special code for ereport and all.
Re: patch to allow disable of WAL recycling
Hi Robert, I'm new to the Postgresql community, so I'm not familiar with how patches are accepted here. Thanks for your detailed explanation. I do want to keep pushing on this. I'll respond separately to Peter and to Tomas regarding their emails. Thanks again, Jerry On Wed, Jul 18, 2018 at 1:43 PM, Robert Haas wrote: > On Wed, Jul 18, 2018 at 3:22 PM, Jerry Jelinek > wrote: > > I've gotten a wide variety of feedback on the proposed patch. The > comments > > range from rough approval through various discussion about alternative > > solutions. At this point I am unsure if this patch is rejected or if it > > would be accepted once I had the updated man page changes that were > > discussed last week. > > > > I have attached an updated patch which does incorporate man page > changes, in > > case that is the blocker. However, if this patch is simply rejected, I'd > > appreciate it if I could get a definitive statement to that effect. > > 1. There's no such thing as a definitive statement of the community's > opinion, generally speaking, because as a rule the community consists > of many different people who rarely all agree on anything but the most > uncontroversial of topics. We could probably all agree that the sun > rises in the East, or at least has historically done so, and that, > say, typos are bad. > > 2. You can't really expect somebody else to do the work of forging > consensus on your behalf. Sure, that may happen, if somebody else > takes an interest in the problem. But, really, since you started the > thread, most likely you're the one most interested. If you're not > willing to take the time to discuss the issues with the individual > people who have responded, promote your own views, investigate > proposed alternatives, etc., it's unlikely anybody else is going to do > it. > > 3. It's not unusual for a patch of this complexity to take months to > get committed; it's only been a few weeks. If it's important to you, > don't give up now. > > It seems to me that there are several people in favor of this patch, > some others with questions and concerns, and pretty much nobody > adamantly opposed. So I would guess that this has pretty good odds in > the long run. But you're not going to get anywhere by pushing for a > commit-or-reject-right-now. It's been less than 24 hours since Tomas > proposed to do further benchmarking if we could agree on what to test > (you haven't made any suggestions in response) and it's also been less > than 24 hours since Peter and I both sent emails about whether it > should be controlled by its own GUC or in some other way. The > discussion is very much actively continuing. It's too soon to decide > on the conclusion, but it would be a good idea for you to keep > participating. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: [HACKERS] Two pass CheckDeadlock in contentent case
11.07.2018 17:01, Ashutosh Bapat пишет: > The patch still applies and it's part of this commitfest. > > On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura wrote: >> On 2017-10-03 17:30, Sokolov Yura wrote: >>> >>> Good day, hackers. >>> >>> During hard workload sometimes process reaches deadlock timeout >>> even if no real deadlock occurred. It is easily reproducible with >>> pg_xact_advisory_lock on same value + some time consuming >>> operation (update) and many clients. >>> >>> When backend reaches deadlock timeout, it calls CheckDeadlock, >>> which locks all partitions of lock hash in exclusive mode to >>> walk through graph and search for deadlock. >>> >>> If hundreds of backends reaches this timeout trying to acquire >>> advisory lock on a same value, it leads to hard-stuck for many >>> seconds, cause they all traverse same huge lock graph under >>> exclusive lock. >>> During this stuck there is no possibility to do any meaningful >>> operations (no new transaction can begin). >>> >>> Attached patch makes CheckDeadlock to do two passes: >>> - first pass uses LW_SHARED on partitions of lock hash. >>> DeadLockCheck is called with flag "readonly", so it doesn't >>> modify anything. >>> - If there is possibility of "soft" or "hard" deadlock detected, >>> ie if there is need to modify lock graph, then partitions >>> relocked with LW_EXCLUSIVE, and DeadLockCheck is called again. >>> >>> It fixes hard-stuck, cause backends walk lock graph under shared >>> lock, and found that there is no real deadlock. >>> >>> Test on 4 socket xeon machine: >>> pgbench_zipf -s 10 -c 800 -j 100 -M prepared -T 450 -f >>> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5 >>> >>> ycsb_read_zipf.sql: >>> \set i random_zipfian(1, 10 * :scale, 1.01) >>> SELECT abalance FROM pgbench_accounts WHERE aid = :i; >>> ycsb_update_lock2_zipf.sql: >>> \set i random_zipfian(1, 10 * :scale, 1.01) >>> select lock_and_update( :i ); >>> >>> CREATE OR REPLACE FUNCTION public.lock_and_update(i integer) >>> RETURNS void >>> LANGUAGE sql >>> AS $function$ >>> SELECT pg_advisory_xact_lock( $1 ); >>> UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1; >>> $function$ >>> >>> Without attached patch: >>> >>> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757 >>> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353 >>> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017 >>> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199 >>> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570 >>> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697 >>> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116 >>> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan >>> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan >>> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166 >>> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan >>> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625 >>> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453 >>> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741 >>> >>> (autovacuum led to trigger deadlock timeout, >>> and therefore, to stuck) >>> >>> Patched: >>> >>> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980 >>> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997 >>> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222 >>> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296 >>> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720 >>> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554 >>> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787 >>> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465 >>> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114 >>> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281 >>> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487 >>> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141 >>> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080 >>> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482 > > I am confused about interpreting these numbers. Does it mean that > without the patch at the end of 70s, we have completed 40553.2 * 70 > transactions and with the patch there are 70 * 38387.9 transactions > completed in 70 seconds? It is regular pgbench output, so there is no source for confusion. tps numbers is number of transactions completed in that particular 5 second interval. That is why there are 0 tps and 1 tps intervals without patch. Which way 0tps and 1tps could be if tps were average since start? It means that without patch there were 46k+51k+52k+47k+42k+12k+0k+0k+0k+0k+27k+39k+41k=357 thousands of transactions completed. And with patch there were 56k+55k+51k+51k+48k+45k+40k+14k+30k+39k+41k+38k+40k+38k=586 thousands of transactions completed. > I agree that with the patch there is smoother degradation when a lot > of backends
Re: Have an encrypted pgpass file
Moin, On Wed, July 18, 2018 7:25 pm, Tom Lane wrote: > Alvaro Herrera writes: >> Seems to me that passing %-specifiers to the command would make it more >> useful (%u for "user", "host" etc) -- your command could refuse to give >> you a password for the superuser account for instance but grant one for >> a read-only user. > > It would also provide a *very* fertile source of shell-script-injection > vulnerabilities. (Whaddya mean, you tried to use a user name with a > quote mark in it?) Little Bobby Tables, we call him. :) I'm also concerned that that would let anybody who could alter the environment then let arbitrary code be run as user postgres. Is this something that poses a risk in addition to the current situation? Best regards, Tels
Re: Non-portable shell code in pg_upgrade tap tests
Moin, On Fri, July 20, 2018 10:55 am, Victor Wagner wrote: > On Fri, 20 Jul 2018 10:25:47 -0400 > Tom Lane wrote: > >> Victor Wagner writes: >> > I've discovered that in the branch REL_11_STABLE there is shell >> > script src/bin/pg_upgrade/test.sh which doesn't work under Solaris >> > 10. (it uses $(command) syntax with is not compatible with original >> > Solaris /bin/sh) > >> >> Please send a patch. Most of us do not have access to old shells > > Here it goes. Previous letter was written before fixed tests were > completed, because this old machine is slow. + *) if [ `find ${PGDATA} -type f ! -perm 640 | wc -l` -ne 0 ]; then Shouldn't ${PGDATA} in the above as argument to find be quoted, otherwise the shell would get confused if it contains spaces or other special characters? Regards, Tels
Re: pread() and pwrite()
> On 20 Jul 2018, at 17:34, Tom Lane wrote: > > Heikki Linnakangas writes: >> No objections, if you want to make the effort. But IMHO the lseek+read >> fallback is good enough on Windows. Unless you were thinking that we >> could then remove the !HAVE_PREAD fallback altogether. Are there any >> other platforms out there that don't have pread/pwrite that we care about? > > AFAICT, macOS has them as far back as we care about (prairiedog does). > HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep > the lseek+read workaround. Don't know about the oldest Solaris critters > we have in the buildfarm. FreeBSD has had 'em at least since 4.0 (1994); > didn't check the other BSDen. The OpenBSD box I have access to has pwrite/pread, and have had for some time if I read the manpage right. cheers ./daniel
Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Hi, On 2018-07-20 16:43:33 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-07-20 15:35:39 -0400, Tom Lane wrote: > >> In any case, I strongly resist making performance-based changes on > >> the basis of one test on one kernel and one hardware platform. > > > Sure, it'd be good to do more of that. But from a theoretical POV it's > > quite logical that posix semas sharing cachelines is bad for > > performance, if there's any contention. When backed by futexes - > > i.e. all non ancient linux machines - the hot path just does a cmpxchg > > of the *userspace* data (I've copied the relevant code below). > > Here's the thing: the hot path is of little or no interest, because > if we are in the sema code at all, we are expecting to block. Note that we're also using semas for ProcArrayGroupClearXid(), which is pretty commonly hot for pgbench style workloads, and where the expected wait times are very short. > It's possible that the bigger picture here is that the kernel boys > optimized for the "uncontended" path to the point where they broke > performance of the blocking path. It's hard to see how they could > have broke it to the point of being slower than the SysV sema API, > though. I don't see how this is a likely proposition, given that adding padding to the *userspace* portion of futexes increased the performance quite significantly. > On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've > only got 8 cores) but other parameters matching Mithun's example, > I just got It's *really* common to have more actual clients than cpus for oltp workloads, so I don't think it's insane to test with more clients. Greetings, Andres Freund
Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Andres Freund writes: > On 2018-07-20 15:35:39 -0400, Tom Lane wrote: >> In any case, I strongly resist making performance-based changes on >> the basis of one test on one kernel and one hardware platform. > Sure, it'd be good to do more of that. But from a theoretical POV it's > quite logical that posix semas sharing cachelines is bad for > performance, if there's any contention. When backed by futexes - > i.e. all non ancient linux machines - the hot path just does a cmpxchg > of the *userspace* data (I've copied the relevant code below). Here's the thing: the hot path is of little or no interest, because if we are in the sema code at all, we are expecting to block. The only case where we wouldn't block is if the lock manager decided the current process needs to sleep, but some other process already released us by the time we reach the futex/kernel call. Certainly that will happen some of the time, but it's not likely to be the way to bet. So I'm very dubious of any arguments based on the speed of the "uncontended" path. It's possible that the bigger picture here is that the kernel boys optimized for the "uncontended" path to the point where they broke performance of the blocking path. It's hard to see how they could have broke it to the point of being slower than the SysV sema API, though. Anyway, I think we need to test first and patch second. I'm working on getting some numbers on my own machines now. On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've only got 8 cores) but other parameters matching Mithun's example, I just got transaction type: scaling factor: 300 query mode: prepared number of clients: 8 number of threads: 8 duration: 1800 s number of transactions actually processed: 29001016 latency average = 0.497 ms tps = 16111.575661 (including connections establishing) tps = 16111.623329 (excluding connections establishing) which is interesting because vmstat was pretty consistently reporting around 50 context swaps/second during the run, or circa 30 cs/transaction. We'd have a minimum of 14 cs/transaction just between client and server (due to seven SQL commands per transaction in TPC-B) so that seems on the low side; not a lot of lock contention here it seems. I wonder what the corresponding ratio was in Mithun's runs. regards, tom lane
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
On 2018-07-20 16:15:14 -0400, Tom Lane wrote: > We've seen several occurrences of $subject in the buildfarm in the past > month or so. Scraping the logs, I find > > coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 > 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 > 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > gull | 2018-06-21 04:02:11 | HEAD | Check | 2018-06-20 > 21:27:06.843 PDT [31138:3] FATAL: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 > 16:46:24.138 UTC [14353240:43] pg_regress/create_table_like ERROR: could not > read block 3 in file "base/16384/2662": read only 0 of 8192 bytes > mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 > 16:46:24.137 UTC [20710034:1] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 > 07:21:45.585 UTC [24022:1] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 > 07:21:45.591 UTC [23941:39] pg_regress/roleattributes ERROR: could not read > block 3 in file "base/16384/2662": read only 0 of 8192 bytes > mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 > 21:21:32.557 EDT [26072:1] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 > 21:21:32.557 EDT [25990:82] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > coypu| 2018-07-17 01:27:39 | HEAD | Check | 2018-07-17 > 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > coypu| 2018-07-20 11:18:02 | HEAD | Check | 2018-07-20 > 13:32:14.104 CEST [5b51c833.4884:131] ERROR: could not read block 3 in file > "base/16384/2662": read only 0 of 8192 bytes > > While I didn't go back indefinitely far, there are no other occurrences of > "could not read block" failures in the last three months. This suggests > strongly that we broke something in early June. Don't know what though. > Ideas? > > In case anyone's wondering, 2662 is pg_class_oid_index. While I do not immediately see how, and the affected versions don't really point that way, the timing does make me think of http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a54e1f1587793b5bf926630ec9ce142e4578d7a0 HEAD/REL_11_STABLE apparently solely being affected points elsewhere, but I don't immediatley know where. Greetings, Andres Freund
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
On 01/03/18 12:40, Heikki Linnakangas wrote: On 16/01/18 15:00, Yoshimi Ichiyanagi wrote: These patches enable to use Persistent Memory Development Kit(PMDK)[1] for reading/writing WAL logs on persistent memory(PMEM). PMEM is next generation storage and it has a number of nice features: fast, byte-addressable and non-volatile. Interesting. How does this compare with using good old mmap()? I think just doing that would allow eliminating much of the complexity around managing the shared_buffers. And if the OS is smart about persistent memory (I don't know what the state of the art on that is), presumably msync() and fsync() on an file that lives in persistent memory is lightning fast. I briefly looked at the docs at pmem.io. pmem_map_file() uses mmap() under the hood, but it does some extra checks to test if the files is on a persistent memory device, and makes a note of it. I think the way forward with this patch would be to map WAL segments with plain old mmap(), and use msync(). If that's faster than the status quo, great. If not, it would still be a good stepping stone for actually using PMDK. If nothing else, it would provide a way to test most of the code paths, without actually having a persistent memory device, or libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest doing exactly that: use libpmem to map a file to memory, and check if it lives on persistent memory using libpmem's pmem_is_pmem() function. If it returns yes, use pmem_drain(), if it return false, fall back to using msync(). - Heikki
buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
We've seen several occurrences of $subject in the buildfarm in the past month or so. Scraping the logs, I find coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes gull | 2018-06-21 04:02:11 | HEAD | Check | 2018-06-20 21:27:06.843 PDT [31138:3] FATAL: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.138 UTC [14353240:43] pg_regress/create_table_like ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.137 UTC [20710034:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.585 UTC [24022:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.591 UTC [23941:39] pg_regress/roleattributes ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [26072:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [25990:82] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes coypu| 2018-07-17 01:27:39 | HEAD | Check | 2018-07-17 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes coypu| 2018-07-20 11:18:02 | HEAD | Check | 2018-07-20 13:32:14.104 CEST [5b51c833.4884:131] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes While I didn't go back indefinitely far, there are no other occurrences of "could not read block" failures in the last three months. This suggests strongly that we broke something in early June. Don't know what though. Ideas? In case anyone's wondering, 2662 is pg_class_oid_index. regards, tom lane
Re: Possible performance regression in version 10.1 with pgbench read-write tests.
On 2018-07-20 15:35:39 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote: > >> I did a quick test applying the patch with same settings as initial mail I > >> have reported (On postgresql 10 latest code) > >> 72 clients > >> > >> CASE 1: > >> Without Patch : TPS 29269.823540 > >> > >> With Patch : TPS 36005.544960.--- 23% jump > >> > >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959 > > >> So it seems that is the issue as the test is being run on 8 node numa > >> machine. > > > Cool. I think we should just backpatch that then. Does anybody want to > > argue against? > > Not entirely clear to me what change is being proposed here? Adding padding to struct PGSemaphoreData, so multiple semas don't share a cacheline. > In any case, I strongly resist making performance-based changes on > the basis of one test on one kernel and one hardware platform. > We should reproduce the results on a few different machines before > we even think of committing anything. I'm happy to test on what > I have, although I'd be the first to agree that what I'm checking > is relatively low-end cases. (Too bad hydra's gone.) Sure, it'd be good to do more of that. But from a theoretical POV it's quite logical that posix semas sharing cachelines is bad for performance, if there's any contention. When backed by futexes - i.e. all non ancient linux machines - the hot path just does a cmpxchg of the *userspace* data (I've copied the relevant code below). Given that we don't have a large number of semas these days, that there's reasons to make the change even without measuring it, that we have benchmark results, and that it's hard to see how it'd cause regressions, I don't think going for a fix quickly is unreasonable. You could argue it'd be better to make semaphores be embeddable in bigger structures like PGPROC, rather than allocated in an array. While I suspect you'd get a bit of a performance benefit from that, it seems like a far bigger change we'd want to do in a minor release. int __new_sem_wait (sem_t *sem) { /* We need to check whether we need to act upon a cancellation request here because POSIX specifies that cancellation points "shall occur" in sem_wait and sem_timedwait, which also means that they need to check this regardless whether they block or not (unlike "may occur" functions). See the POSIX Rationale for this requirement: Section "Thread Cancellation Overview" [1] and austin group issue #1076 [2] for thoughs on why this may be a suboptimal design. [1] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html [2] http://austingroupbugs.net/view.php?id=1076 for thoughts on why this */ __pthread_testcancel (); if (__new_sem_wait_fast ((struct new_sem *) sem, 0) == 0) return 0; else return __new_sem_wait_slow((struct new_sem *) sem, NULL); } /* Fast path: Try to grab a token without blocking. */ static int __new_sem_wait_fast (struct new_sem *sem, int definitive_result) { /* We need acquire MO if we actually grab a token, so that this synchronizes with all token providers (i.e., the RMW operation we read from or all those before it in modification order; also see sem_post). We do not need to guarantee any ordering if we observed that there is no token (POSIX leaves it unspecified whether functions that fail synchronize memory); thus, relaxed MO is sufficient for the initial load and the failure path of the CAS. If the weak CAS fails and we need a definitive result, retry. */ #if __HAVE_64B_ATOMICS uint64_t d = atomic_load_relaxed (>data); do { if ((d & SEM_VALUE_MASK) == 0) break; if (atomic_compare_exchange_weak_acquire (>data, , d - 1)) return 0; } while (definitive_result); return -1; #else unsigned int v = atomic_load_relaxed (>value); do { if ((v >> SEM_VALUE_SHIFT) == 0) break; if (atomic_compare_exchange_weak_acquire (>value, , v - (1 << SEM_VALUE_SHIFT))) return 0; } while (definitive_result); return -1; #endif } /* See sem_wait for an explanation of the algorithm. */ int __new_sem_post (sem_t *sem) { struct new_sem *isem = (struct new_sem *) sem; int private = isem->private; #if __HAVE_64B_ATOMICS /* Add a token to the semaphore. We use release MO to make sure that a thread acquiring this token synchronizes with us and other threads that added tokens before (the release sequence includes atomic RMW operations by other threads). */ /* TODO Use atomic_fetch_add to make it scale better than a CAS loop? */ uint64_t d = atomic_load_relaxed (>data); do { if ((d & SEM_VALUE_MASK) == SEM_VALUE_MAX) { __set_errno (EOVERFLOW); return -1; } } while (!atomic_compare_exchange_weak_release (>data, , d + 1)); /* If there is any potentially
Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Andres Freund writes: > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote: >> I did a quick test applying the patch with same settings as initial mail I >> have reported (On postgresql 10 latest code) >> 72 clients >> >> CASE 1: >> Without Patch : TPS 29269.823540 >> >> With Patch : TPS 36005.544960.--- 23% jump >> >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959 >> So it seems that is the issue as the test is being run on 8 node numa >> machine. > Cool. I think we should just backpatch that then. Does anybody want to > argue against? Not entirely clear to me what change is being proposed here? In any case, I strongly resist making performance-based changes on the basis of one test on one kernel and one hardware platform. We should reproduce the results on a few different machines before we even think of committing anything. I'm happy to test on what I have, although I'd be the first to agree that what I'm checking is relatively low-end cases. (Too bad hydra's gone.) regards, tom lane
Re: Possible performance regression in version 10.1 with pgbench read-write tests.
On Fri, Jul 20, 2018 at 10:52 AM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane wrote: > > > > It's not *that* noticeable, as I failed to demonstrate any performance > > difference before committing the patch. I think some more investigation > > is warranted to find out why some other people are getting different > > results > Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on > Linux/amd64 and we're probably hitting elements clustered at one end > of the array? Let's see... I tried sticking padding into > PGSemaphoreData and I got ~8% more TPS (72 client on multi socket > box, pgbench scale 100, only running for a minute but otherwise the > same settings that Mithun showed). > > --- a/src/backend/port/posix_sema.c > +++ b/src/backend/port/posix_sema.c > @@ -45,6 +45,7 @@ > typedef struct PGSemaphoreData > { > sem_t pgsem; > + charpadding[PG_CACHE_LINE_SIZE - sizeof(sem_t)]; > } PGSemaphoreData; > > That's probably not the right idiom and my tests probably weren't long > enough, but there seems to be some effect here. > I did a quick test applying the patch with same settings as initial mail I have reported (On postgresql 10 latest code) 72 clients CASE 1: Without Patch : TPS 29269.823540 With Patch : TPS 36005.544960.--- 23% jump Just Disabling using unnamed POSIX semaphores: TPS 34481.207959 So it seems that is the issue as the test is being run on 8 node numa machine. I also came across a presentation [1] : slide 20 which says one of those futex architecture is bad for NUMA machine. I am not sure the new fix for same is included as part of Linux version 3.10.0-693.5.2.el7.x86_64 which is on my test machine. [1] https://www.slideshare.net/davidlohr/futex-scaling-for-multicore-systems -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Re: Segfault logical replication PG 10.4
On 2018-Jul-19, Quan TRAN wrote: > Hello, > > In worker.c: > > - in apply_handle_insert, slot_store_cstrings is called before > PushActiveSnapshot > - in apply_handle_update & apply_handle_delete, slot_store_cstrings is > called after PushActiveSnapshot > > /* Process and store remote tuple in the slot */ > oldctx = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); > slot_store_cstrings(remoteslot, rel, newtup.values); > slot_fill_defaults(rel, estate, remoteslot); > MemoryContextSwitchTo(oldctx); > > PushActiveSnapshot(GetTransactionSnapshot()); > > Should this be the cause? Yeah, moving the PushActiveSnapshot call to just before slot_store_cstrings sounds right. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add constraint in a Materialized View
On Wed, Jul 18, 2018 at 09:28:19AM +0200, Kaye Ann Ignacio wrote: > I'm trying to add a foreign constraint in my local table to reference a > column in a materialized view, is it possible to alter this materialized > view by adding a primary key constraint? It's not, but I'm working on a patch for that and much more. Basically, I want to integrate something like https://github.com/twosigma/postgresql-contrib/blob/master/mat_views.sql I currently use that code instead of PG native materialized views. Nico --
Re: Making "COPY partitioned_table FROM" faster
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested This patch applied cleanly and worked as expected. Patch description: This patch implements the use of heap_multi_insert() for partition tables when using the COPY FROM command, benefiting the performance of COPY FROM in cases in which multiple tuples in the file are destined for the same partition. Tests: All tests passed make check-world and TAP tests Functionality: Specifically, we tried the cases in the bottom section "Exercising the code" and found all behavior as described and expected. Note that we did conduct performance testing for our test cases enumerated in this section using COPY FROM PROGRAM comparing the patched and unpatched code using one example with tuples all destined for the same partition and one example with tuples destined for alternating partitions. We saw a similar performance improvement to the one recorded by David in his email, including a decrease in performance for copying data with tuples alternating between destination partitions as compared to the unpatched code. However, in our cases below, we focus on exercising the code added as opposed to on performance. Feature feedback from a user perspective: There will be two differences in performance that may be perceptible to the user after the inclusion of this patch: 1) the relatively small decrease in performance (as compared to master) for copying data from a file or program in which the destination partition changes frequently 2) the stark contrast between the performance of copying data destined for the same partition and data destined for alternating partitions Based on the numbers in David's email the performance difference 27721.054 ms patched for copying data destined for the same partition vs 46151.844 ms patched for copying data destined for two different partitions alternating each row Because both differences could impact users in data loading, it is worth considering making this transparent to the user in some way. Because this will be noticeable to the user, and, furthermore, because it is potentially within the power of the user to make changes to their data copying strategy given this information, it might be prudent to either create a GUC allowing the user to disable heap_multi_insert for COPY FROM (if the user knows the data he/she is copying over is very varied) or to add a comment to the docs that when using COPY FROM for a partition table, it is advisable to sort the data in some manner which would optimize the number of consecutive tuples destined for the same partition. Code Review: Naming of newly introduced enum: The CopyInsertMethod enum value CIM_MULTI_PARTITION is potentially confusing, since, for a partition table with BEFORE or INSTEAD INSERT triggers on child partitions only heap_insert is valid, requiring the additional variable part_can_multi_insert to determine if heap_multi_insert can be used or not. It might be more clear to name it something that indicates it is a candidate for multi-insertion, pending further review. This decouples the state of potentially applicable multi-insertion from the table being a partition table. In the future, perhaps, some other feature might make a table conditionally eligible for multi-insertion of tuples, so, it may be desirable to use this intermediate state for other kinds of tables as well. Alternatively, it might make it more clear if the variable part_can_multi_insert was clearly for a leaf partition table, since this can change from leaf to leaf, maybe named leaf_part_can_multi_insert Code comment potential typo corrections: In the following comment, in the patched code line 2550 /* ... * Note: It does not matter if any partitions have any volatile default * expression as we use the defaults from the target of the COPY command. */ It seems like "volatile default expression" should be "volatile default expressions" In the following comment, in the patched code starting on line 2582 /* ... * the same partition as the previous tuple, and since we flush the * previous partitions buffer once the new tuple has already been * built, we're unable to reset the estate since we'd free the memory * which the new tuple is stored. To work around this we maintain a * secondary expression context and alternate between these when the ... */ "previous partitions" should probably be "previous partition's" and "since we'd free the memory which the new tuple is stored" should be "since we'd free the memory in which the new tuple is stored" In the following comment, in the patched code starting on line 2769 /* * We'd better make the bulk insert mechanism gets a new *
Add constraint in a Materialized View
Hi, I'm trying to add a foreign constraint in my local table to reference a column in a materialized view, is it possible to alter this materialized view by adding a primary key constraint? Thank you. -- Kaye Ann Ignacio, Programmer proceedit "the BPaaS company" kaye.igna...@proceedit.com +34 679188011 (mobile)
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Thu, Jul 19, 2018 at 4:29 AM, Masahiko Sawada wrote: >> One area that might be worth investigating is retail index tuple >> deletion performed within the executor in the event of non-HOT >> updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record", >> at least in unique index tuples with no NULL values. The idea is that >> MVCC index scans can skip over those if they've already found a >> visible tuple with the same value. > > I think that's a good idea. The overhead of marking it as ghost seems > small and it would speed up index scans. If MVCC index scans have > already found a visible tuples with the same value they can not only > skip scanning but also kill them? If can, we can kill index tuples > without checking the heap. I think you're talking about making LP_REDIRECT marking in nbtree represent a "recently dead" hint: the deleting transaction has committed, and so we are 100% sure that the tuple is about to become garbage, but it cannot be LP_DEAD just yet because it needs to stay around for the benefit of at least one existing snapshot/long running transaction. That's a different idea to what I talked about, since it could perhaps work in a way that's much closer to LP_DEAD/kill_prior_tuple (no extra executor stuff is required). I'm not sure if your idea is better or worse than what I suggested, though. It would certainly be easier to implement. -- Peter Geoghegan
Re: Negotiating the SCRAM channel binding type
On 12/07/18 16:08, Michael Paquier wrote: On Thu, Jul 12, 2018 at 12:34:51PM +0300, Heikki Linnakangas wrote: Hmm. That is actually in a section called "Default Channel Binding". And the first paragraph says "A default channel binding type agreement process for all SASL application protocols that do not provide their own channel binding type agreement is provided as follows". Given that, it's not entirely clear to me if the requirement to support tls-unique is for all implementations of SCRAM, or only those applications that don't provide their own channel binding type agreement. I am not sure, but I get that as tls-unique must be the default if available, so if it is technically possible to have it we should have it in priority. If not, then a channel binding type which is supported by both the server and the client can be chosen. Another interesting piece of legalese is in RFC 5929 Channel Bindings for TLS: For many applications, there may be two or more potentially applicable TLS channel binding types. Existing security frameworks (such as the GSS-API [RFC2743] or the SASL [RFC4422] GS2 framework [RFC5801]) and security mechanisms generally do not support negotiation of channel binding types. Therefore, application peers need to agree a priori as to what channel binding type to use (or agree to rules for deciding what channel binding type to use). The specifics of whether and how to negotiate channel binding types are beyond the scope of this document. However, it is RECOMMENDED that application protocols making use of TLS channel bindings, use 'tls-unique' exclusively, except, perhaps, where server-side proxies are common in deployments of an application protocol. In the latter case an application protocol MAY specify that 'tls-server-end-point' channel bindings must be used when available, with 'tls-unique' being used when 'tls-server-end-point' channel bindings are not available. Alternatively, the application may negotiate which channel binding type to use, or may make the choice of channel binding type configurable. In any case, I'm quite convinced now that we should just remove tls-unique, and stick to tls-server-end-point. Patch attached, please take a look. - Heikki >From 4e2f010692aaef97d4e16822359e86073a53e96b Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 20 Jul 2018 20:02:17 +0300 Subject: [PATCH 1/2] Remove support for tls-unique channel binding. There are some problems with the tls-unique channel binding type. It's not supported by all SSL libraries, and strictly speaking it's not defined for TLS 1.3 at all, even though at least in OpenSSL, the functions used for it still seem to work with TLS 1.3 connections. And since we had no mechanism to negotiate what channel binding type to use, there would be awkward interoperability issues if a server only supported some channel binding types. tls-server-end-point seems feasible to support with any SSL library, so let's just stick to that. This removes the scram_channel_binding libpq option altogether, since there is now only one supported channel binding type. This also removes all the channel binding tests from the SSL test suite. It would be good to actually have some tests for channel binding, but these tests were really just testing the scram_channel_binding option, which is now gone. I also removed the SCRAM_CHANNEL_BINDING_TLS_END_POINT constant. This is a matter of taste, but IMO it's more readable to just use the "tls-server-end-point" string. Refactor the checks on whether the SSL library supports the functions needed for tls-server-end-point channel binding. Now the server won't advertise, and the client won't choose, the SCRAM-SHA-256-PLUS variant, if compiled with an OpenSSL version too old to support it. In the passing, add some sanity checks to check that the chosen SASL mechanism, SCRAM-SHA-256 or SCRAM-SHA-256-PLUS, matches whether the SCRAM exchange used channel binding or not. For example, if the client selects the non-channel-binding variant SCRAM-SHA-256, but in the SCRAM message uses channel binding anyway. It's harmless from a security point of view, I believe, and I'm not sure if there are some other conditions that would cause the connection to fail, but it seems better to be strict about these things and check explicitly. Discussion: https://www.postgresql.org/message-id/ec787074-2305-c6f4-86aa-6902f98485a4%40iki.fi --- doc/src/sgml/libpq.sgml | 28 doc/src/sgml/protocol.sgml | 28 ++-- doc/src/sgml/release-11.sgml | 5 +- src/backend/libpq/auth-scram.c | 218 +-- src/backend/libpq/auth.c | 61 +++-- src/backend/libpq/be-secure-openssl.c| 27 +--- src/include/common/scram-common.h| 4 - src/include/libpq/libpq-be.h | 12 +- src/include/libpq/scram.h| 4 +-
Re: Faster str to int conversion (was Table with large number of int columns, very slow COPY FROM)
Hi, On 2018-07-20 08:27:34 -0400, Robert Haas wrote: > On Thu, Jul 19, 2018 at 4:32 PM, Andres Freund wrote: > >> 1. Why the error message changes? If there's a good reason, it should > >> be done as a separate commit, or at least well-documented in the > >> commit message. > > > > Because there's a lot of "invalid input syntax for type %s: \"%s\"", > > error messages, and we shouldn't force translators to have separate > > version that inlines the first %s. But you're right, it'd be worthwhile > > to point that out in the commit message. > > It just seems weird that they're bundled together in one commit like this. I'll push it separately. > Nothing else from me... Thanks for looking! Greetings, Andres Freund
Re: [bug fix] Produce a crash dump before main() on Windows
On Wed, Jul 18, 2018 at 4:38 AM, Tsunakawa, Takayuki wrote: > IIRC, the pop up doesn't appear under Windows service. If you start the > database server with pg_ctl start on the command prompt, the pop up will > appear, which I think is not bad. Hmm. I think that might be bad. What makes you think it isn't? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
On Fri, Jul 20, 2018 at 8:38 AM, Robert Haas wrote: > This looks like a terrible design to me. If somebody in future > invents a new plan type that is not projection-capable, then this > could fail an assertion here and there won't be any simple fix. And > if you reach here and the child targetlists somehow haven't been > adjusted, then you won't even fail an assertion, you'll just crash (or > maybe return wrong answers). To elaborate on this thought a bit, it is currently the case that all scan and join types are projection-capable, and most likely we would make any such node types added in the future projection-capable as well, but it still doesn't seem like a good idea to me to have tangentially-related parts of the system depend on that in subtle ways. Also, even today, this would fail if the subplan happened to be a Sort, and it's not very obvious why that couldn't happen. If it can't happen today, it's not obvious why a future code change couldn't introduce cases where it can happen. More broadly, the central idea of this patch is that we can set the target list for a child rel to something other than the target list that we expect it will eventually produce, and then at plan creation time fix it. I think that's a bad idea. The target list affects lots of things, like costing. If we don't insert a ConvertRowTypeExpr into the child's target list, the costing will be wrong to the extent that ConvertRowTypeExpr has any cost, which it does. Any other current or future property that is computed based on the target list -- parallel-safety, width, security-level, whatever -- could also potentially end up with a wrong value if the target list as written does not match the target list as will actually be produced. It's true that, in all of these areas, the ConvertRowTypeExpr isn't likely to have a lot of impact; it probably won't have a big effect on costing and may not actually cause a problem for the other things that I mentioned. On the other hand, if it does cause a serious problem, what options would we have for fixing it other than to back out your entire patch and solve the problem some other way? The idea that derived properties won't be correct unless they are based on the real target list is pretty fundamental, and I think deviating from the idea that we get the correct target list first and then compute the derived properties afterward is likely to cause real headaches for future developers. In short, plan creation time is just the wrong time to change the plan. It's just a time to translate the plan from the form needed by the planner to the form needed by the executor. It would be fine if the change you were making were only cosmetic, but it's not. After looking over both patches, I think Ashutosh Bapat has basically the right idea. What he is proposing is a localized fix that doesn't seem to make any changes to the way things work overall. I do think his patches need to be fixed up a bit to avoid conflating ConvertRowtypeExpr with "converted whole-row reference." The two are certainly not equivalent; the latter is a narrow special case. Some of the comments could use some more wordsmithing, too, I think. Apart from those gripes, though, I think it's solving the problem the correct way: don't build the wrong plan and try to fix it, just build the right plan from the beginning. There are definitely some things not to like about this approach. In particular, I definitely agree that treating a converted whole-row reference specially is not very nice. It feels like it might be substantially cleaner to be able to represent this idea as a single node rather than a combination of ConvertRowtypeExpr and var with varattno = 0. Perhaps in the future we ought to consider either (a) trying to use a RowExpr for this rather than ConvertRowtypeExpr or (b) inventing a new WholeRowExpr node that stores two RTIs, one for the relation that's generating the whole-row reference and the other for the relation that is controlling the column ordering of the result or (c) allowing a Var to represent a whole-row expression that has to produce its outputs according to the ordering of some other RTE. But I don't think it's wise or necessary to whack that around just to fix this bug; it is refactoring or improvement work best left to a future release. Also, it is definitely a real shame that we have to build attr_needed data for each child separately. Right now, we've got to build a whole lot of things for each child individually, and that's slow and inefficient. We're not going to scale nicely to large partitioning hierarchies unless we can figure out some way of minimizing the work we do for each partition. So, the fact that Fujii-san's approach avoids needing to compute attr_needed in some cases is very appealing. However, in my opinion, it's nowhere near appealing enough to justify trying to do surgery on the target list at plan-creation time. I think we need to leave optimizing this to
Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
On 20 July 2018 at 21:44, Amit Langote wrote: > But I don't think the result of make_partition_pruneinfo itself has to be > List of PartitionedRelPruneInfo nested under PartitionPruneInfo. I gather > that each PartitionPruneInfo corresponds to each root partitioned table > and a PartitionedRelPruneInfo contains the actual pruning information, > which is created for every partitioned table (non-leaf tables), including > the root tables. I don't think such nesting is necessary. I think that > just like flattened partitioned_rels list, we should put flattened list of > PartitionedRelPruneInfo into the Append or MergeAppend plan. No need for > nesting PartitionedRelPruneInfo under PartitionPruneInfo. To do that properly you'd need to mark the target partitioned table of each hierarchy. Your test of pg_class.relispartition is not going to work as you're assuming the query is always going to the root. The user might query some intermediate partitioned table (which will have relispartition = true). Your patch will fall flat in that case. You could work around that by having some array that points to the target partitioned table of each hierarchy, but I don't see why that's better than having the additional struct. There's also some code inside ExecFindInitialMatchingSubPlans() which does a backward scan over the partitions. This must process children before their parents. Unsure how well that's going to work if we start mixing the hierarchies. I'm sure it can be made to work providing each hierarchy is stored together consecutively in the array, but it just seems pretty fragile to me. That code is already pretty hard to follow. What's the reason you don't like the additional level to represent multiple hierarchies? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Non-portable shell code in pg_upgrade tap tests
Victor Wagner writes: > Tom Lane wrote: >> Please send a patch. Most of us do not have access to old shells > Here it goes. Previous letter was written before fixed tests were > completed, because this old machine is slow. Thanks. Will check on my own dinosaurs, and push if I don't find a problem. regards, tom lane
Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket
On 2018-07-20 11:53:32 +0300, Heikki Linnakangas wrote: > ISTM that no-one has any great ideas on what to do about the ereport() in > quickdie(). I think we actually have some decent ideas how to make that less problematic in a number of cases. Avoid translation (thereby avoiding direct malloc()) and rely on the fact that the error message evaluation happens in a memory context with pre-allocated memory. That still leaves openssl, but is better than nothing. If we wanted, we additionally could force ssl allocations to happen through another context with preallocated memory. > But I think we have consensus on replacing the exit(2) calls > with _exit(2). If we do just that, it would be better than the status quo, > even if it doesn't completely fix the problem. This would prevent the case > that Asim reported, for starters. Right. You're planning to backpatch? > With _exit(), I think we wouldn't really need the "PG_SETMASK();" > calls in the aux process *_quickdie handlers that don't do anything else > than call _exit(). But I didn't dare to remove them yet. I think we should remove it at least in master. > diff --git a/src/backend/postmaster/bgwriter.c > b/src/backend/postmaster/bgwriter.c > index 960d3de204..010d53a6ce 100644 > --- a/src/backend/postmaster/bgwriter.c > +++ b/src/backend/postmaster/bgwriter.c > @@ -404,22 +404,21 @@ bg_quickdie(SIGNAL_ARGS) > /* >* We DO NOT want to run proc_exit() callbacks -- we're here because >* shared memory may be corrupted, so we don't want to try to clean up > our > - * transaction. Just nail the windows shut and get out of town. Now > that > - * there's an atexit callback to prevent third-party code from breaking > - * things by calling exit() directly, we have to reset the callbacks > - * explicitly to make this work as intended. > - */ > - on_exit_reset(); > - > - /* > - * Note we do exit(2) not exit(0). This is to force the postmaster > into a > - * system reset cycle if some idiot DBA sends a manual SIGQUIT to a > random > - * backend. This is necessary precisely because we don't clean up our > - * shared memory state. (The "dead man switch" mechanism in pmsignal.c > - * should ensure the postmaster sees this as a crash, too, but no harm > in > - * being doubly sure.) > + * transaction. Just nail the windows shut and get out of town. > + * > + * There's an atexit callback to prevent third-party code from breaking > + * things by calling exit() directly. We don't want to trigger that, so > + * we use _exit(), which doesn't run atexit callbacks, rather than > exit(). > + * And exit() wouldn't be safe to run from a signal handler, anyway. It's much less the exit() that's unsafe, than the callbacks themselves, right? Perhaps just restate that we wouldn't want to trigger atexit processing due to signal safety? > + * Note we use _exit(2) not _exit(0). This is to force the postmaster > + * into a system reset cycle if some idiot DBA sends a manual SIGQUIT > to a > + * random backend. This is necessary precisely because we don't clean > up > + * our shared memory state. (The "dead man switch" mechanism in > + * pmsignal.c should ensure the postmaster sees this as a crash, too, > but > + * no harm in being doubly sure.) >*/ > - exit(2); > + _exit(2); > } Greetings, Andres Freund
Re: [HACKERS] logical decoding of two-phase transactions
On 2018-07-20 12:13:19 +0530, Nikhil Sontakke wrote: > Hi Andres, > > > > So what if we, at the begin / end of cache miss handling, re-check if > > the to-be-decoded transaction is still in-progress (or has > > committed). And we throw an error if that happened. That error is then > > caught in reorderbuffer, the in-progress-xact aborted callback is > > called, and processing continues (there's a couple nontrivial details > > here, but it should be doable). > > > > The biggest issue is what constitutes a "cache miss". It's fairly > > trivial to do this for syscache / relcache, but that's not sufficient: > > there's plenty cases where catalogs are accessed without going through > > either. But as far as I can tell if we declared that all historic > > accesses have to go through systable_beginscan* - which'd imo not be a > > crazy restriction - we could put the checks at that layer. > > > > Documenting that historic accesses go through systable_* APIs does > seem reasonable. In our earlier discussions, we felt asking plugin > writers to do anything along these lines was too onerous and > cumbersome to expect. But they don't really need to do that - in just about all cases access "automatically" goes through systable_* or layers above. If you call output functions, do syscache lookups, etc you're good. > > That'd require that an index lookup can't crash if the corresponding > > heap entry doesn't exist (etc), but that's something we need to handle > > anyway. The issue that multiple separate catalog lookups need to be > > coherent (say Robert's pg_class exists, but pg_attribute doesn't > > example) is solved by virtue of the the pg_attribute lookups failing if > > the transaction aborted. > > > > Am I missing something here? > > > > Are you suggesting we have a: > > PG_TRY() > { > Catalog_Access(); > } > PG_CATCH() > { > Abort_Handling(); > } > > here? Not quite, no. Basically, in a simplified manner, the logical decoding loop is like: while (true) record = readRecord() logical = decodeRecord() PG_TRY(): StartTransactionCommand(); switch (TypeOf(logical)) case INSERT: insert_callback(logical); break; ... CommitTransactionCommand(); PG_CATCH(): AbortCurrentTransaction(); PG_RE_THROW(); what I'm proposing is that that various catalog access functions throw a new class of error, something like "decoding aborted transactions". The PG_CATCH() above would then not unconditionally re-throw, but set a flag and continue iff that class of error was detected. while (true) if (in_progress_xact_abort_pending) StartTransactionCommand(); in_progress_xact_abort_callback(made_up_record); in_progress_xact_abort_pending = false; CommitTransactionCommand(); record = readRecord() logical = decodeRecord() PG_TRY(): StartTransactionCommand(); switch (TypeOf(logical)) case INSERT: insert_callback(logical); break; ... CommitTransactionCommand(); PG_CATCH(): AbortCurrentTransaction(); if (errclass == DECODING_ABORTED_XACT) in_progress_xact_abort_pending = true; continue; else PG_RE_THROW(); Now obviously that's just pseudo code with lotsa things missing, but I think the basic idea should come through? Greetings, Andres Freund
Re: Making "COPY partitioned_table FROM" faster
(This email seemed to only come to me and somehow missed the mailing list. Including the message here in its entirety) On 20 July 2018 at 13:26, Karen Huddleston wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation:not tested > > This patch applied cleanly and worked as expected. > > Patch description: > This patch implements the use of heap_multi_insert() for partition tables > when using the COPY FROM command, benefiting the performance of COPY FROM in > cases in which multiple tuples in the file are destined for the same > partition. > > Tests: > All tests passed make check-world and TAP tests > > Functionality: > Specifically, we tried the cases in the bottom section "Exercising the code" > and found all behavior as described and expected. > Note that we did conduct performance testing for our test cases enumerated in > this section using COPY FROM PROGRAM comparing the patched and unpatched code > using one example with tuples all destined for the same partition and one > example with tuples destined for alternating partitions. We saw a similar > performance improvement to the one recorded by David in his email, including > a decrease in performance for copying data with tuples alternating between > destination partitions as compared to the unpatched code. > However, in our cases below, we focus on exercising the code added as opposed > to on performance. > > Feature feedback from a user perspective: > There will be two differences in performance that may be perceptible to the > user after the inclusion of this patch: > 1) the relatively small decrease in performance (as compared to master) for > copying data from a file or program in which the destination partition > changes frequently > 2) the stark contrast between the performance of copying data destined for > the same partition and data destined for alternating partitions > Based on the numbers in David's email the performance difference > 27721.054 ms patched for copying data destined for the same partition vs > 46151.844 ms patched for copying data destined for two different partitions > alternating each row > Because both differences could impact users in data loading, it is worth > considering making this transparent to the user in some way. Because this > will be noticeable to the user, and, furthermore, because it is potentially > within the power of the user to make changes to their data copying strategy > given this information, it might be prudent to either create a GUC allowing > the user to disable heap_multi_insert for COPY FROM (if the user knows the > data he/she is copying over is very varied) or to add a comment to the docs > that when using COPY FROM for a partition table, it is advisable to sort the > data in some manner which would optimize the number of consecutive tuples > destined for the same partition. I'd rather steer clear of any new GUCs. Instead, in the patch I've attached, I've included some adaptive code which chooses to use or not use multi-inserts based on the average number of tuples that have been in the batches. I also did some tests and it does appear that the benefits of multi-inserts come pretty early. So I ended up coding it to enabling multi-inserts when the average tuples per batch is at least 1.3. I originally guessed 1.5, but one of my tests was faster with multi-inserts and the average batch size there was 1.33. > Code Review: > Naming of newly introduced enum: > The CopyInsertMethod enum value CIM_MULTI_PARTITION is potentially confusing, > since, for a partition table with BEFORE or INSTEAD INSERT triggers on child > partitions only heap_insert is valid, requiring the additional variable > part_can_multi_insert to determine if heap_multi_insert can be used or not. > It might be more clear to name it something that indicates it is a candidate > for multi-insertion, pending further review. This decouples the state of > potentially applicable multi-insertion from the table being a partition > table. In the future, perhaps, some other feature might make a table > conditionally eligible for multi-insertion of tuples, so, it may be desirable > to use this intermediate state for other kinds of tables as well. I've renamed this to CIM_MULTI_CONDITIONAL. > Alternatively, it might make it more clear if the variable > part_can_multi_insert was clearly for a leaf partition table, since this can > change from leaf to leaf, maybe named leaf_part_can_multi_insert Renamed to leafpart_use_multi_insert. I changed from "can" to "use" due to the adaptive code might just choose not to, even if it actually "can" use multi-inserts. > Code comment potential typo corrections: > In the following comment, in the patched code line 2550 > /* ... > * Note: It does not matter if any partitions
Re: Non-portable shell code in pg_upgrade tap tests
On Fri, 20 Jul 2018 10:25:47 -0400 Tom Lane wrote: > Victor Wagner writes: > > I've discovered that in the branch REL_11_STABLE there is shell > > script src/bin/pg_upgrade/test.sh which doesn't work under Solaris > > 10. (it uses $(command) syntax with is not compatible with original > > Solaris /bin/sh) > > Please send a patch. Most of us do not have access to old shells Here it goes. Previous letter was written before fixed tests were completed, because this old machine is slow. > we could test this on. (The oldest machine I have, and it's old > enough to vote, does run that script ... I doubt very many other > developers have anything older.) > > regards, tom lane commit efb0bbcefeccd826ba951ac31057689e752b79d0 Author: Victor Wagner Date: Fri Jul 20 16:06:50 2018 +0300 Fixed incompatibility of src/bin/pg_upgrade/test.sh with solaris /bin/sh diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh index 45ccd8fa66..775dd5729d 100644 --- a/src/bin/pg_upgrade/test.sh +++ b/src/bin/pg_upgrade/test.sh @@ -234,7 +234,7 @@ pg_upgrade $PG_UPGRADE_OPTS -d "${PGDATA}.old" -D "${PGDATA}" -b "$oldbindir" -B # Windows hosts don't support Unix-y permissions. case $testhost in MINGW*) ;; - *) if [ $(find ${PGDATA} -type f ! -perm 640 | wc -l) -ne 0 ]; then + *) if [ `find ${PGDATA} -type f ! -perm 640 | wc -l` -ne 0 ]; then echo "files in PGDATA with permission != 640"; exit 1; fi ;; @@ -242,7 +242,7 @@ esac case $testhost in MINGW*) ;; - *) if [ $(find ${PGDATA} -type d ! -perm 750 | wc -l) -ne 0 ]; then + *) if [ `find ${PGDATA} -type d ! -perm 750 | wc -l` -ne 0 ]; then echo "directories in PGDATA with permission != 750"; exit 1; fi ;;
Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.
Yugo Nagata writes: > Recently, one of our clients reported a problem that Windows 10 sometime > (approximately once in 300 tries) hung up at OS starting up while PostgreSQL > 9.3.x service is starting up. My co-worker analyzed this and found that > PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock > situation. Really? What would they deadlock on? Why is there any connection whatsoever? Why has nobody else run into this? > He reported this problem to pgsql-general list as below. Also, he created a > patch > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub > process starts. This seems like an ugly hack that probably doesn't reliably resolve whatever the problem is, but does manage to kill postmaster responsiveness :-(. It'd be especially awful to insert such a delay after forking parallel worker processes, which would be a problem in anything much newer than 9.3. I think we need more investigation; and to start with, reproducing the problem in a branch that's not within hailing distance of its EOL would be a good idea. (Not that I have reason to think PG's behavior has changed much here ... but 9.3 is just not a good basis for asking us to do anything now.) regards, tom lane
Re: Non-portable shell code in pg_upgrade tap tests
Victor Wagner writes: > I've discovered that in the branch REL_11_STABLE there is shell script > src/bin/pg_upgrade/test.sh which doesn't work under Solaris 10. > (it uses $(command) syntax with is not compatible with original > Solaris /bin/sh) OK ... > It is quite easy to replace $() syntax with backticks. Backticks are > not nestable and considered unsafe by modern shell scripting style > guides, but they do work with older shells. Please send a patch. Most of us do not have access to old shells we could test this on. (The oldest machine I have, and it's old enough to vote, does run that script ... I doubt very many other developers have anything older.) regards, tom lane
Non-portable shell code in pg_upgrade tap tests
Collegues, I've discovered that in the branch REL_11_STABLE there is shell script src/bin/pg_upgrade/test.sh which doesn't work under Solaris 10. (it uses $(command) syntax with is not compatible with original Solaris /bin/sh) I was quite surprised that this problem goes unnoticed on big buildfarm, but after some investigation found out that both Solaris machines in that buildfarm don't configure postgres with --enable-tap-tests. Offending code is: # make sure all directories and files have group permissions, on Unix hosts # Windows hosts don't support Unix-y permissions. case $testhost in MINGW*) ;; *) if [ $(find ${PGDATA} -type f ! -perm 640 | wc -l) -ne 0 ]; then echo "files in PGDATA with permission != 640"; exit 1; fi ;; esac case $testhost in MINGW*) ;; *) if [ $(find ${PGDATA} -type d ! -perm 750 | wc -l) -ne 0 ]; then echo "directories in PGDATA with permission != 750"; exit 1; fi ;; esac It is quite easy to replace $() syntax with backticks. Backticks are not nestable and considered unsafe by modern shell scripting style guides, but they do work with older shells. --
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
On Wed, Jul 18, 2018 at 8:00 AM, Etsuro Fujita wrote: > [ new patch ] + /* + * If the child plan is an Append or MergeAppend, the targetlists of its + * subplans would have already been adjusted before we get here, so need + * to work here. + */ + if (IsA(subplan, Append) || IsA(subplan, MergeAppend)) + return; + + /* The child plan should be a scan or join */ + Assert(is_projection_capable_plan(subplan)); This looks like a terrible design to me. If somebody in future invents a new plan type that is not projection-capable, then this could fail an assertion here and there won't be any simple fix. And if you reach here and the child targetlists somehow haven't been adjusted, then you won't even fail an assertion, you'll just crash (or maybe return wrong answers). I'm going to study this some more now, but I really think this is going in the wrong direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Faster str to int conversion (was Table with large number of int columns, very slow COPY FROM)
On Thu, Jul 19, 2018 at 4:32 PM, Andres Freund wrote: >> 1. Why the error message changes? If there's a good reason, it should >> be done as a separate commit, or at least well-documented in the >> commit message. > > Because there's a lot of "invalid input syntax for type %s: \"%s\"", > error messages, and we shouldn't force translators to have separate > version that inlines the first %s. But you're right, it'd be worthwhile > to point that out in the commit message. It just seems weird that they're bundled together in one commit like this. >> 2. Does the likely/unlikely stuff make a noticeable difference? > > Yes. It's also largely a copy from existing code (scanint8), so I don't > really want to differ here. OK. >> 3. If this is a drop-in replacement for pg_atoi, why not just recode >> pg_atoi this way -- or have it call this -- and leave the callers >> unchanged? > > Because pg_atoi supports a variable 'terminator'. OK. >> 4. Are we sure this is faster on all platforms, or could it work out >> the other way on, say, BSD? > > I'd be *VERY* surprised if any would be faster. It's not easy to write a > faster implmentation, than what I've proposed, and especially not so if > you use strtol() as the API (variable bases, a bit of locale support). OK. Nothing else from me... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: partition tree inspection functions
Hi Amit, On 07/19/2018 10:27 PM, Amit Langote wrote: On 2018/07/19 23:18, Jesper Pedersen wrote: I'm thinking about how to best use these functions to generate a graph that represents the partition hierarchy. What about renaming pg_partition_tree_tables() to pg_partition_children(), and have it work like select * from pg_partition_children('p', true); - p p0 p1 p00 p01 p10 p11 (7 rows) select * from pg_partition_children('p', false); - p0 p1 (2 rows) e.g. if 'bool include_all' is true all nodes under the node, including itself, are fetched. With false only nodes directly under the node, excluding itself, are returned. If there are no children NULL is returned. That's a big change to make to what this function does, but if that's what's useful we could make it. As an alternative, wouldn't it help to implement the idea that Dilip mentioned upthread of providing a function to report the level of a given table in the partition hierarchy -- 0 for root, 1 for its partitions and so on? Yes, Dilip's idea could work. I just don't think that pg_partition_tree_tables() as is would have a benefit over time. Basically, as also discussed before, users can already use SQL to get the information they want out of the relevant catalogs (pg_inherits, etc.). But, such user queries might not be very future-proof as we might want to change the catalog organization in the future, so we'd like to provide users a proper interface to begin with. Keeping that in mind, it'd be better to think carefully about what we ought to be doing here. Input like yours is greatly helpful for that. We could have the patch include pg_partition_root_parent and pg_partition_parent, and leave the rest for a future CommitFest such that more people could provide feedback on what they would like to see in this space. Maybe a function like pg_partition_number_of_partitions() could be of benefit to count the number of actual partitions in a tree. Especially useful in complex scenarios, select pg_partition_number_of_partitions('p') as number; number - 4 (1 row) Okay, adding one more function at this point may not be asking for too much. Although, select count(*) from pg_partition_tree_tables('p') would give you the count, a special function seems nice. Yeah, but I was thinking that the function would only return the number of actual tables that contains data, e.g. not include 'p', 'p0' and 'p1' in the count; otherwise you could use 'select count(*) from pg_partition_children('p', true)' like you said. Maybe call it pg_partition_tree_leaf_count() or some such then? That could work. Best regards, Jesper
Re: Flexible configuration for full-text search
Hi, Aleksandr! On Mon, Jul 9, 2018 at 10:26 AM Aleksandr Parfenov < a.parfe...@postgrespro.ru> wrote: > A new version of the patch in the attachment. There are no changes since > the last version except refreshing it to current HEAD. > I took a look at this patch. It applied cleanly, but didn't pass regression tests. *** /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/misc_sanity.out 2018-07-20 13:44:54.0 +0300 --- /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/misc_sanity.out 2018-07-20 13:47:00.0 +0300 *** *** 105,109 pg_index| indpred | pg_node_tree pg_largeobject | data | bytea pg_largeobject_metadata | lomacl| aclitem[] ! (11 rows) --- 105,110 pg_index| indpred | pg_node_tree pg_largeobject | data | bytea pg_largeobject_metadata | lomacl| aclitem[] ! pg_ts_config_map| mapdicts | jsonb ! (12 rows) It seems to be related to recent patches which adds toast tables to majority of system tables with varlena column. Regression diff indicates that mapdicts field of pg_ts_config_map can't be toasted. I think we should add toast table to pg_ts_config_map table. Also, I see that you add extra grammar rules for ALTER TEXT SEARCH CONFIGURATION to the documentation, which allows specifying config instead of dictionary_name. +ALTER TEXT SEARCH CONFIGURATION name +ADD MAPPING FOR token_type [, ... ] WITH config ALTER TEXT SEARCH CONFIGURATION name ADD MAPPING FOR token_type [, ... ] WITH dictionary_name [, ... ] +ALTER TEXT SEARCH CONFIGURATION name +ALTER MAPPING FOR token_type [, ... ] WITH config ALTER TEXT SEARCH CONFIGURATION name ALTER MAPPING FOR token_type [, ... ] WITH dictionary_name [, ... ] ALTER TEXT SEARCH CONFIGURATION name In the same time you declare config as following. + +Formally config is one of: + + +* dictionary_name + +* config { UNION | INTERSECT | EXCEPT | MAP } config + +* CASE config +WHEN [ NO ] MATCH THEN { KEEP | config } +[ ELSE config ] + END + That is config itself could be a dictionary_name. I think this makes grammar rules for ALTER TEXT SEARCH CONFIGURATION redundant. We can specify those rules to always expect config, assuming that it can be actually a dictionary nay. + if (fout->remoteVersion >= 11) PostgreSQL 11 already passed feature freeze. Thus, we should be aimed to PostgreSQL 12. That's all for now, but I'm going to do more detailed code review. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.
On Fri, Jul 20, 2018 at 05:58:13PM +0900, Yugo Nagata wrote: > He reported this problem to pgsql-general list as below. Also, he created a > patch > to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub > process starts. The attached is the same one. Our client confirmed that > this > patch resolves the dead-lock problem. Is it acceptable to add this option to > PostgreSQL? Any comment would be appreciated. If the OS startup gets slower, then an arbitrary delay is not going to solve things and you would finish by facing the same problem. It seems to me that we need to understand what are the low-level locks which get stuck, if it is possible to make their acquirement conditional, and then loop conditionally with multiple retries. -- Michael signature.asc Description: PGP signature
Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
On 2018/07/19 22:03, David Rowley wrote: > v3-0001-Fix-run-time-partition-pruning-for-UNION-ALL-pare.patch Thanks for updating the patch. I studied this patch today and concluded that it's going a bit too far by carrying the nested partition pruning info structures from the planner all the way into the executor. I understood the root cause of this issue as that make_partition_pruneinfo trips when UNION ALL's parent subquery, instead of the actual individual partitioned root tables, is treated as the root parent rel when converting prunequals using appenrel_adjust_*. That happens because of a flattened partitioned_rels list whose members are all assumed by make_partition_pruneinfo to have the same root parent and that it is an actual partitioned table. That assumption fails in this case because the parent is really the UNION ALL subquery rel. I think the fix implemented in the patch by modifying allpaths.c is correct, whereby the partition hierarchies are preserved by having nested Lists of partitioned rels. So, the partitioned_rels List corresponding to UNION ALL subquery parent itself contains Lists corresponding to partitioned tables appearing under it. With that, make_partition_pruneinfo (actually, make_partitionedrel_pruneinfo in the patch) can correctly perform its work for every sub-List, because for each sub-List, it can identify the correct root partitioned table parent to use. But I don't think the result of make_partition_pruneinfo itself has to be List of PartitionedRelPruneInfo nested under PartitionPruneInfo. I gather that each PartitionPruneInfo corresponds to each root partitioned table and a PartitionedRelPruneInfo contains the actual pruning information, which is created for every partitioned table (non-leaf tables), including the root tables. I don't think such nesting is necessary. I think that just like flattened partitioned_rels list, we should put flattened list of PartitionedRelPruneInfo into the Append or MergeAppend plan. No need for nesting PartitionedRelPruneInfo under PartitionPruneInfo. We create a relid_subplan_map from the flattened list of sub-plans, where sub-plans of leaf partitions of different partitioned tables appear in the same list. Similarly, I think, we should create relid_subpart_map from the flattened list of partitioned_rels, where partitioned rel RT indexes coming from different partitioned tables appear in the same list. Currently relid_subpart_map seems to be constructed separately for each sub-List of nested partitioned_rels list, so while subplan_map of each PartitionedRelPruneInfo contains indexes into a global array of leaf partition sub-plans, subpart_map contains indexes into local array of PartitionedRelPruneInfo for that partitioned table. But, I think there is not a big hurdle in making even the latter contain indexes into global array of PartitionedRelPruneInfos of *all* partitioned tables. On the executor side, instead of having PartitionedRelPruningData be nested under PartitionPruningData, which in turn are stored in the top-level PartitionPruneState, store them directly in PartitionPruneState, since we're making planner put global indexes into subpart_map. Slight adjustment seems to be needed to make ExecInitFindMatchingSubPlans and ExecFindMatchingSubPlans skip the PartitionedRelPruningData of non-root tables, because find_matching_subplans_recurse takes care of recursing to non-root ones. Actually, not skipping them seems to cause wrong result. To verify that such an approach would actually work, I modified the relevant parts of your patch and confirmed that it does. See attached a delta patch. Thanks, Amit PS: Other than the main patch, I think it would be nice to add a RT index field to PartitionedRelPruneInfo in addition to the existing Oid field. It would help to identify and fetch the Relation from a hypothetical executor-local array of Relation pointers which is addressable by RT indexes. diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index dac789d414..f81ff672ca 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -48,7 +48,7 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel, bool *isnull, int maxfieldlen); static List *adjust_partition_tlist(List *tlist, TupleConversionMap *map); -static void find_matching_subplans_recurse(PartitionPruningData *pprune, +static void find_matching_subplans_recurse(PartitionPruneState *prunestate, PartitionedRelPruningData *prelprune, bool initial_prune, Bitmapset **validsubplans); @@ -1396,14 +1396,10 @@ adjust_partition_tlist(List *tlist, TupleConversionMap
Re: project updates
Hello Charles, > Here is my current working status. > 1. Complete the thrift_binary_in and thrift_binary_out functions, so > that users can express their thrift struct using json. These two > functions support both simple data struct and complex data structure > like struct and map. 2. added test cases and document to cover > thrift_binary_in and thrift_binary_out. 3. make the code compile-able > from 9.4 to 11.0. I see multiple warnings during compilation with 11.0, e.g: ``` pg_thrift.c:1120:72: warning: implicit declaration of function ‘JsonbGetDatum’; did you mean ‘JsonbPGetDatum’? [-Wimplicit-function-declaration] pg_thrift.c:1206:13: warning: the address of ‘is_big_endian’ will always evaluate as ‘true’ [-Waddress] pg_thrift.c:1227:18: warning: implicit declaration of function ‘PG_GETARG_JSONB’; did you mean ‘PG_GETARG_JSONB_P’? [-Wimplicit-function-declaration] pg_thrift.c:1227:18: warning: initialization of ‘Jsonb *’ {aka ‘struct *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion] ``` Also tests (make clean && make && make install && make installcheck) don't pass. > One question though, for custom types, it seems rich operations are > also very important besides storing and expressing using thrift type. > What are the most important operators should I support? Any > suggestions? Here is the updated repo > https://github.com/charles-cui/pg_thrift I believe get field / set field are most common ones. Naturally there are also enumerations, projections, you name it. You can check out list of operation supported by JSONB (or a list of operations on dict's in Python) to get a full picture. -- Best regards, Aleksander Alekseev
Adding TCP_USER_TIMEOUT support for libpq/psqlodbc
Hello PostgreSQL Community! Not long ago I faced the situation concerning ODBC/libpq client hanging in case of some network problems. I had a discussion regarding this issue within pgsql-o...@postgresql.org and got some suggestions. Here is this discussion: https://www.postgresql.org/message-id/OF33DF00A3.D6444835-ON432582C3.003EA7C5-432582C3.0045562B%40iba.by In a few words the suggestion was to use pqopt keepalive options for ODBC configuration for example as follows: pqopt = keepalives=1 keepalives_idle=5 keepalives_count=1 keepalives_interval=1 But under some circumstances it can be not reliable and the connection can loose its keepalives properties and it can remain hanging. Here is a quote from our discussion: Hmm it seems keepalive stops while waiting for ack. Therefore it's a matter of retransmission control You can use TCP_USER_TIMEOUT on linux but the option is not used in libpq. In my opinion it make sense to add the support of TCP_USER_TIMEOUT socket option to libpq/psqlodbc connection. The attachment contains a patch with the source code changes regarding this issue. In my implementation it can be configured by new keepalives_user_timeout option within pqopt parameter. Best regards, Andrei Yahorau 0001_TCP_USER_TIMEOUT_libpq-int.patch Description: Binary data 0001_TCP_USER_TIMEOUT_fe-connect.patch Description: Binary data
Fw: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.
Hi, Recently, one of our clients reported a problem that Windows 10 sometime (approximately once in 300 tries) hung up at OS starting up while PostgreSQL 9.3.x service is starting up. My co-worker analyzed this and found that PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock situation. Although this problem have been found only with PostgreSQL 9.3.x and Windows 10 in our client's environment for now, maybe the same problem occurs with other versions of PostgreSQL. He reported this problem to pgsql-general list as below. Also, he created a patch to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub process starts. The attached is the same one. Our client confirmed that this patch resolves the dead-lock problem. Is it acceptable to add this option to PostgreSQL? Any comment would be appreciated. Regards, Begin forwarded message: Date: Fri, 29 Jun 2018 15:03:10 +0900 From: TAKATSUKA Haruka To: pgsql-gene...@postgresql.org Subject: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid. I got a trouble in PostgreSQL 9.3.x on Windows 10. I would like to add new delay code as an official build option. Windows 10 sometime (approximately once in 300 tries) hung up at OS starting up. The logs say it happened while the PostgreSQL service was starting. When OS stopped, some postgres auxiliary process were started and some were not started yet. The Windows dump say some threads of the postgres auxiliary process are waiting OS level locks and the logon processes’thread are also waiting a lock. MS help desk said that PostgreSQL’s OS level deadlock caused OS freeze. I think it is strange story. But, in fact, it not happened in repeated tests when I got rid of PostgreSQL from the initial auto-starting services. I tweaked PostgreSQL 9.3.x (the newest from the repository) to add 0.5 or 3.0 seconds delay after each sub process starts. And then the hung up was gone. This test patch is attached. It is only implemented for Windows. Also, I did not use existing pg_usleep because it contains locking codes (e.g. WaitForSingleObject and Enter/LeaveCriticalSection). Although Windows OS may have some problems, I think we should have a means to avoid it. Can PostgreSQL be accepted such delay codes as build-time options by preprocessor variables? Thanks, Takatsuka Haruka -- Yugo Nagata diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index d6fc2ed..ff03ebd 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -398,6 +398,30 @@ extern int optreset; /* might not be declared by system headers */ static DNSServiceRef bonjour_sdref = NULL; #endif +#define USE_AFTER_AUX_FORK_SLEEP 3000 + +#ifdef USE_AFTER_AUX_FORK_SLEEP +#ifndef WIN32 +#define AFTER_AUX_FORK_SLEEP() +#else +#define AFTER_AUX_FORK_SLEEP() do { SleepEx(USE_AFTER_AUX_FORK_SLEEP, FALSE); } while(0) +#endif +#else +#define AFTER_AUX_FORK_SLEEP() +#endif + +#define USE_AFTER_BACKEND_FORK_SLEEP 500 + +#ifdef USE_AFTER_BACKEND_FORK_SLEEP +#ifndef WIN32 +#define AFTER_BACKEND_FORK_SLEEP() +#else +#define AFTER_BACKEND_FORK_SLEEP() do { SleepEx(USE_AFTER_BACKEND_FORK_SLEEP, FALSE); } while(0) +#endif +#else +#define AFTER_BACKEND_FORK_SLEEP() +#endif + /* * postmaster.c - function prototypes */ @@ -1709,6 +1733,7 @@ ServerLoop(void) */ StreamClose(port->sock); ConnFree(port); + AFTER_BACKEND_FORK_SLEEP(); } } } @@ -2801,11 +2826,20 @@ reaper(SIGNAL_ARGS) * situation, some of them may be alive already. */ if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0) + { AutoVacPID = StartAutoVacLauncher(); + AFTER_AUX_FORK_SLEEP(); + } if (XLogArchivingActive() && PgArchPID == 0) + { PgArchPID = pgarch_start(); + AFTER_AUX_FORK_SLEEP(); + } if (PgStatPID == 0) + { PgStatPID = pgstat_start(); + AFTER_AUX_FORK_SLEEP(); + } /* some workers may be scheduled to start now */ maybe_start_bgworker(); @@ -5259,6 +5293,7 @@ StartChildProcess(AuxProcType type) /* * in parent, successful fork */ + AFTER_AUX_FORK_SLEEP(); return pid; }
Re: de-deduplicate code in DML execution hooks in postgres_fdw
(2018/07/20 13:49), Michael Paquier wrote: On Thu, Jul 19, 2018 at 05:35:11PM +0900, Etsuro Fujita wrote: +1 for the general idea. (Actually, I also thought the same thing before.) But since this is definitely a matter of PG12, ISTM that it's wise to work on this after addressing the issue in [1]. My concern is: if we do this refactoring now, we might need two patches for fixing the issue in case of backpatching as the fix might need to change those executor functions. FWIW, I would think that if some cleanup of the code is obvious, we should make it without waiting for the other issues to settle down because there is no way to know when those are done, I would agree to that if we were late in the development cycle for PG12. and this patch could be forgotten. I won't forget this patch. :) Looking at the proposed patch, moving the new routine closer to execute_dml_stmt and renaming it execute_dml_single_row would be nicer. Or execute_parameterized_dml_stmt? Best regards, Etsuro Fujita
Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket
On 19/07/18 23:13, Andres Freund wrote: On 2018-07-19 14:54:52 -0500, Nico Williams wrote: On Thu, Jul 19, 2018 at 12:20:53PM -0700, Andres Freund wrote: On 2018-07-19 11:57:25 +0300, Heikki Linnakangas wrote: Ugh. Yeah, in wal_quickdie, and other aux process *_quickdie() handlers, I agree we should just _exit(2). All we want to do is to exit the process immediately. Indeed. bgworker_quickdie() makes some effort to block SIGQUIT during the exit() processing, but that doesn't solve the whole problem. The process could've been in the middle of a malloc/free when the signal arrived, for example. exit() is simply not safe to call from a signal handler. Yea. I really don't understand why we have't been able to agree on this for *years* now. I mean, only calling async-signal-safe functions from signal handlers is a critical POSIX requirement, and exit(3) is NOT async-signal-safe. There's a few standard requirements that aren't particularly relevant in practice (including some functions not being listed as signal safe). Just arguing from that isn't really helpful. But there's plenty hard evidence, including a few messages upthread!, of it being practically problematic. Just looking at the reasoning at why exit (and malloc) aren't signal safe however, makes it clear that this particular restriction should be adhered to, however. ISTM that no-one has any great ideas on what to do about the ereport() in quickdie(). But I think we have consensus on replacing the exit(2) calls with _exit(2). If we do just that, it would be better than the status quo, even if it doesn't completely fix the problem. This would prevent the case that Asim reported, for starters. Any objections to the attached? With _exit(), I think we wouldn't really need the "PG_SETMASK();" calls in the aux process *_quickdie handlers that don't do anything else than call _exit(). But I didn't dare to remove them yet. - Heikki diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index f651bb49b1..ac8693fd61 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -650,22 +650,21 @@ bgworker_quickdie(SIGNAL_ARGS) /* * We DO NOT want to run proc_exit() callbacks -- we're here because * shared memory may be corrupted, so we don't want to try to clean up our - * transaction. Just nail the windows shut and get out of town. Now that - * there's an atexit callback to prevent third-party code from breaking - * things by calling exit() directly, we have to reset the callbacks - * explicitly to make this work as intended. - */ - on_exit_reset(); - - /* - * Note we do exit(2) not exit(0). This is to force the postmaster into a - * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random - * backend. This is necessary precisely because we don't clean up our - * shared memory state. (The "dead man switch" mechanism in pmsignal.c - * should ensure the postmaster sees this as a crash, too, but no harm in - * being doubly sure.) + * transaction. Just nail the windows shut and get out of town. + * + * There's an atexit callback to prevent third-party code from breaking + * things by calling exit() directly. We don't want to trigger that, so + * we use _exit(), which doesn't run atexit callbacks, rather than exit(). + * And exit() wouldn't be safe to run from a signal handler, anyway. + * + * Note we use _exit(2) not _exit(0). This is to force the postmaster + * into a system reset cycle if some idiot DBA sends a manual SIGQUIT to a + * random backend. This is necessary precisely because we don't clean up + * our shared memory state. (The "dead man switch" mechanism in + * pmsignal.c should ensure the postmaster sees this as a crash, too, but + * no harm in being doubly sure.) */ - exit(2); + _exit(2); } /* diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 960d3de204..010d53a6ce 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -404,22 +404,21 @@ bg_quickdie(SIGNAL_ARGS) /* * We DO NOT want to run proc_exit() callbacks -- we're here because * shared memory may be corrupted, so we don't want to try to clean up our - * transaction. Just nail the windows shut and get out of town. Now that - * there's an atexit callback to prevent third-party code from breaking - * things by calling exit() directly, we have to reset the callbacks - * explicitly to make this work as intended. - */ - on_exit_reset(); - - /* - * Note we do exit(2) not exit(0). This is to force the postmaster into a - * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random - * backend. This is necessary precisely because we don't clean up our - * shared memory state. (The "dead man switch" mechanism in pmsignal.c - * should ensure the postmaster sees this as a crash, too, but no harm in - * being doubly sure.) + * transaction. Just nail
Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket
On 19/07/18 23:04, Nico Williams wrote: On Thu, Jun 22, 2017 at 03:10:31PM -0400, Tom Lane wrote: Andres Freund writes: Or, probably more robust: Simply _exit(2) without further ado, and rely on postmaster to output an appropriate error message. Arguably it's not actually useful to see hundreds of "WARNING: terminating connection because of crash of another server process" messages in the log anyway. At that point you might as well skip the entire mechanism and go straight to SIGKILL. IMO the only reason quickdie() exists at all is to try to send a helpful message to the client. And it is helpful --- your attempt to claim that it isn't sounds very much like wishful thinking. I dunno if it is or isn't helpful. But I do know that this must be done in an async-signal-safe way. Besides making ereport() async-signal-safe, which is tricky, you could write(2) the arguments to a pipe that another thread in the same process is reading from and which will then call ereport() and exit(3). I don't see how that helps. It still wouldn't be safe for the other thread to call ereport(), because the main thread might be in the middle of calling ereport() itself. - Heikki
Re: pread() and pwrite()
On Thu, Jul 12, 2018 at 01:55:31PM +1200, Thomas Munro wrote: > A couple of years ago, Oskari Saarenmaa proposed a patch[1] to adopt > $SUBJECT. Last year, Tobias Oberstein argued again that we should do > that[2]. At the end of that thread there was a +1 from multiple > committers in support of getting it done for PostgreSQL 12. Since > Oskari hasn't returned, I decided to dust off his patch and start a > new thread. Thanks for picking this up - I was meaning to get back to this, but have unfortunately been too busy with other projects. > 1. FileRead() and FileWrite() are replaced with FileReadAt() and > FileWriteAt(), taking a new offset argument. Now we can do one > syscall instead of two for random reads and writes. > [...] I'm not even sure I'd bother adding "At" to the > function names. If there are any extensions that want the old > interface they will fail to compile either way. Note that the BufFile > interface remains the same, so hopefully that covers many use cases. IIRC I used the "At" suffixes in my first version of the patch before completely removing the functions which didn't take an offset argument Now that they're gone I agree that we could just drop the "At" suffix; "at" suffix is also used by various POSIX functions to operate in a specific directory which may just add to confusion. > I guess the only remaining reason to use FileSeek() is to get the file > size? So I wonder why SEEK_SET remains valid in the patch... if my > suspicion is correct that only SEEK_END still has a reason to exist, > perhaps we should just kill FileSeek() and add FileSize() or something > instead? I see you did this in your updated patch :+1: Happy to see this patch move forward. / Oskari
Re: pread() and pwrite()
On 20/07/18 01:50, Thomas Munro wrote: An idea for how to handle Windows, in a follow-up patch: add a file src/backend/port/win32/file.c that defines pgwin32_pread() and pgwin32_pwrite(), wrapping WriteFile()/ReadFile() and passing in an "OVERLAPPED" struct with the offset and sets errno on error, then set up the macros so that Windows can use them as pread(), pwrite(). It might also be necessary to open all files with FILE_FLAG_OVERLAPPED. Does any Windows hacker have a bettter idea, and/or want to try to write that patch? Otherwise I'll eventually try to do some long distance hacking on AppVeyor. No objections, if you want to make the effort. But IMHO the lseek+read fallback is good enough on Windows. Unless you were thinking that we could then remove the !HAVE_PREAD fallback altogether. Are there any other platforms out there that don't have pread/pwrite that we care about? - Heikki
Re: [HACKERS] logical decoding of two-phase transactions
Hi Andres, > So what if we, at the begin / end of cache miss handling, re-check if > the to-be-decoded transaction is still in-progress (or has > committed). And we throw an error if that happened. That error is then > caught in reorderbuffer, the in-progress-xact aborted callback is > called, and processing continues (there's a couple nontrivial details > here, but it should be doable). > > The biggest issue is what constitutes a "cache miss". It's fairly > trivial to do this for syscache / relcache, but that's not sufficient: > there's plenty cases where catalogs are accessed without going through > either. But as far as I can tell if we declared that all historic > accesses have to go through systable_beginscan* - which'd imo not be a > crazy restriction - we could put the checks at that layer. > Documenting that historic accesses go through systable_* APIs does seem reasonable. In our earlier discussions, we felt asking plugin writers to do anything along these lines was too onerous and cumbersome to expect. > That'd require that an index lookup can't crash if the corresponding > heap entry doesn't exist (etc), but that's something we need to handle > anyway. The issue that multiple separate catalog lookups need to be > coherent (say Robert's pg_class exists, but pg_attribute doesn't > example) is solved by virtue of the the pg_attribute lookups failing if > the transaction aborted. > > Am I missing something here? > Are you suggesting we have a: PG_TRY() { Catalog_Access(); } PG_CATCH() { Abort_Handling(); } here? Regards, Nikhils
Re: More consistency for some file-related error message
On Thu, Jul 19, 2018 at 12:33:30PM +0900, Michael Paquier wrote: > On Wed, Jul 18, 2018 at 11:24:05PM -0400, Tom Lane wrote: >> read() is required by spec to set errno when returning a negative result. >> I think the previous coding paid attention to errno regardless of the sign >> of the result, which would justify pre-zeroing it ... but the new coding >> definitely doesn't. > > Yes, my point is a bit different though.. Do you think that we need to > bother about the case where errno is not 0 before calling read(), in the > case where it returns a positive result? This would mean that errno > would still have a previous errno set, still it returned a number of > bytes read. For the code paths discussed here that visibly does not > matter so you are right, we could remove them, still patterns get easily > copy-pasted around... Okay, I just looked again at this point, and among the new messages only what's in XLogFileCopy has been bothering setting errno to 0 (see 811b6e3), so let's remove it in this case. Thoughts about the previous patch set? -- Michael signature.asc Description: PGP signature