Re: Performance implications of 8K pread()s
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou wrote: > So would it make sense for postgres to perform reads in bigger blocks? Is it > easy-ish to implement (where would one look for that)? Or must the I/O unit be > tied to postgres' page size? FYI as of last week we can do a little bit of that on the master branch: postgres=# select count(*) from t; preadv(46, ..., 8, 256237568) = 131072 preadv(46, ..., 5, 256368640) = 131072 preadv(46, ..., 8, 256499712) = 131072 preadv(46, ..., 5, 256630784) = 131072 postgres=# set io_combine_limit = '256k'; postgres=# select count(*) from t; preadv(47, ..., 5, 613728256) = 262144 preadv(47, ..., 5, 613990400) = 262144 preadv(47, ..., 5, 614252544) = 262144 preadv(47, ..., 5, 614514688) = 262144 Here's hoping the commits implementing this stick, for the PostgreSQL 17 release. It's just the beginning though, we can only do this for full table scans so far (plus a couple of other obscure places). Hopefully in the coming year we'll get the "streaming I/O" mechanism that powers this hooked up to lots more places... index scans and other stuff. And writing. Then eventually pushing the I/O into the background. Your questions actually triggered us to talk about why we couldn't switch a few things around in our project and get the I/O combining piece done sooner. Thanks!
Re: pgsql 10.23 , different systems, same table , same plan, different Buffers: shared hit
On Sat, Sep 16, 2023 at 7:42 AM Tom Lane wrote: > Sadly, this proves very little about Linux's behavior. glibc's idea > of en_US involves some very complicated multi-pass sort rules. > AFAICT from the FreeBSD sort(1) man page, FreeBSD defines en_US > as "same as C except case-insensitive", whereas I'm pretty sure > that underscores and other punctuation are nearly ignored in > glibc's interpretation; they'll only be taken into account if the > alphanumeric parts of the strings sort equal. Achilleas didn't mention the glibc version, but based on the kernel vintage mentioned I guess that must be the "old" (pre 2.28) glibc sorting. In 2.28 they did a big sync-up with ISO 14651, while FreeBSD follows the UCA, a closely related standard[1]. I think newer Linux/glibc systems should agree with FreeBSD's libc in more cases (and also agree with ICU). [1] https://unicode.org/reports/tr10/#Synch_ISO14651
Re: Performance implications of 8K pread()s
On Thu, Jul 13, 2023 at 6:50 AM Dimitrios Apostolou wrote: > Interesting and kind of sad that the last update on the wiki page is from > 2021. What is the latest prototype? I'm not sure I'm up to the task of > putting my database to the test. ;-) It works pretty well, certainly well enough to try out, and work is happening. I'll try to update the wiki with some more up-to-date information soon. Basically, compare these two slides (you could also look at slide 11, which is the most most people are probably interested in, but then you can't really see what's going on with system call-level tools): https://speakerdeck.com/macdice/aio-and-dio-for-postgresql-on-freebsd?slide=7 https://speakerdeck.com/macdice/aio-and-dio-for-postgresql-on-freebsd?slide=9 Not only are the IOs converted into 128KB preadv() calls, they are issued concurrently and ahead of time while your backend is chewing on the last lot of pages. So even if your file system completely fails at prefetching, we'd have a fighting chance at getting closer to device/line speed. That's basically what you have to do to support direct I/O, where there is no system-provided prefetching.
Re: Performance implications of 8K pread()s
On Wed, Jul 12, 2023 at 5:12 AM Thomas Munro wrote: > "gathering" (Oops, for reads, that's "scattering". As in scatter/gather I/O but I picked the wrong one...).
Re: Performance implications of 8K pread()s
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou wrote: > Note that I suspect my setup being related, (btrfs compression behaving > suboptimally) since the raw device can give me up to 1GB/s rate. It is however > evident that reading in bigger chunks would mitigate such setup > inefficiencies. > On a system that reads are already optimal and the read rate remains the same, > then bigger block size would probably reduce the sys time postgresql consumes > because of the fewer system calls. I don't know about btrfs but maybe it can be tuned to prefetch sequential reads better... > So would it make sense for postgres to perform reads in bigger blocks? Is it > easy-ish to implement (where would one look for that)? Or must the I/O unit be > tied to postgres' page size? It is hard to implement. But people are working on it. One of the problems is that the 8KB blocks that we want to read data into aren't necessarily contiguous so you can't just do bigger pread() calls without solving a lot more problems first. The project at https://wiki.postgresql.org/wiki/AIO aims to deal with the "clustering" you seek plus the "gathering" required for non-contiguous buffers by allowing multiple block-sized reads to be prepared and collected on a pending list up to some size that triggers merging and submission to the operating system at a sensible rate, so we can build something like a single large preadv() call. In the current prototype, if io_method=worker then that becomes a literal preadv() call running in a background "io worker" process, but it could also be OS-specific stuff (io_uring, ...) that starts an asynchronous IO depending on settings. If you take that branch and run your test you should see 128KB-sized preadv() calls.
Re: Fsync IO issue
On Fri, May 5, 2023 at 8:37 AM ProfiVPS Support wrote: > I feel like ANYTHING would be better than this. Even risking loosing _some_ > of the latest data in case of a server crash (if it crashes we lose data > anyways until restart, ofc we could have HA I know and we will when there'll > be a need) . Try synchronous_commit=off: https://www.postgresql.org/docs/current/wal-async-commit.html
Re: High QPS, random index writes and vacuum
On Tue, Apr 18, 2023 at 2:43 PM peter plachta wrote: > I was trying to understand whether there are any known workarounds for random > access + index vacuums. Are my vacuum times 'normal' ? Ah, it's not going to help on the old versions you mentioned, but for what it's worth: I remember noticing that I could speed up vacuum of uncorrelated indexes using parallel vacuum (v13), huge_pages=on, maintainance_work_mem=BIG, min_dynamic_shared_memory=BIG (v14), because then the memory that is binary-searched in random order avoids thrashing the TLB.
Re: Getting an index scan to be a parallel index scan
On Thu, Feb 2, 2023 at 1:54 PM Alex Kaiser wrote: > Thanks for the explanation. Yes, that is the query plan I was imagining. I do > see how chopping it up could result in an unfair distribution. But my counter > to that would be that wouldn't chopping it up still be better than not. If > things do happen to work out to be fair, now it's X times as fast, if things > are very unfair, then you haven't really lost much (besides the parallel > overhead) compared to the non-parallel query. Or maybe it should be possible > to do the parallel query if there were some statistics (either normal ones or > extended ones) that told the planner that the result would probably be fair? Maybe, but unfairness multiplies if it's part of a larger plan; what if the output of those nodes is the input to much more work, but now THAT work is being done by one process? But yeah, statistics could help with that. I'm vaguely aware that other systems that do more partition-based parallelism spend a lot of effort on that sort of thinking. > Though I do agree that the "work stealing" option would be the most > efficient, but would be a lot more complicated to code up. Yeah. I probably used the wrong word; what I was describing is (something like) page-based parallelism, where input gets chopped up into arbitrary chunks and handed out to consumers on demand, but we don't know anything about the values in those chunks; that allows for many interesting kind of plans, and it's nice because it's fair. Another kind of parallelism is partition-based, which PostgreSQL can do in a limited sense: we can send workers into different partitions of a table (what we can't do is partition the table on-the-fly, which is central to most parallelism in some other systems). Let's see: CREATE TABLE testing( id INT, info INT, data_one TEXT, data_two TEXT, primary key(id, info) ) partition by hash (id); create table testing_p0 partition of testing for values with (modulus 2, remainder 0); create table testing_p1 partition of testing for values with (modulus 2, remainder 1); INSERT INTO testing(id, info, data_one, data_two) SELECT idx, idx, md5(random()::text), md5(random()::text) FROM generate_series(1,1000) idx; analyze; explain select count(*) from testing where id in (1608377,5449811,5334677,5458230,2053195,3572313,1949724,3559988,5061560,8479775, ...); Aggregate -> Append -> Index Only Scan using testing_p0_pkey on testing_p0 testing_1 -> Index Only Scan using testing_p1_pkey on testing_p1 testing_2 Hmph. I can't seem to convince it to use Parallel Append. I think it might be because the planner is not smart enough to chop down the =ANY lists to match the partitions. One sec... Ok I hacked my copy of PostgreSQL to let me set parallel_setup_costs to negative numbers, and then I told it that parallelism is so awesome that it makes your queries cost -100 timerons before they even start. Now I see a plan like: Gather (cost=-99.57..-987689.45 rows=2000 width=74) Workers Planned: 2 -> Parallel Append (cost=0.43..12110.55 rows=832 width=74) -> Parallel Index Scan using testing_p0_pkey on testing_p0 testing_1 -> Parallel Index Scan using testing_p1_pkey on testing_p1 testing_2 But it's probing every index for every one of the values in the big list, not just the ones that have a non-zero chance of finding a match, which is a waste of cycles. I think if the planner were smarter about THAT (as it is for plain old "="), then the costing would have chosen parallelism naturally by cost. But it's probably not as cool as page-based parallelism, because parallelism is limited by your partitioning scheme. If I had more timerons myself, I'd like to try to make parallel function scans, or parallel CTE scans, work...
Re: Getting an index scan to be a parallel index scan
On Wed, Feb 1, 2023 at 6:39 PM Alex Kaiser wrote: > select * from testing where id in (1608377,5449811, ... <1000 random ids> > ,4654284,3558460); > > Essentially I have a list of 1000 ids and I would like the rows for all of > those ids. > > This seems like it would be pretty easy to parallelize, if you have X threads > then you would split the list of IDs into 1000/X sub lists and give one to > each thread to go find the rows for ids in the given list. Even when I use > the following configs I don't get a query plan that actually uses any > parallelism: It sounds like the plan you are imagining is something like: Gather Nested Loop Join Outer side: Inner side: Index scan of your big table Such a plan would only give the right answer if each process has a non-overlapping subset of the constant values to probe the index with, and together they have the whole set. Hypothetically, a planner could chop that set up beforehand and and give a different subset to each process (just as you could do that yourself using N connections and separate queries), but that might be unfair: one process might find lots of matches, and the others might find none, because of the distribution of data. So you'd ideally want some kind of "work stealing" scheme, where each worker can take more values to probe from whenever it needs more, so that they all keep working until the values run out. We don't have a thing that can do that. You might imagine that a CTE could do it, so WITH keys_to_look_up AS (VALUES (1), (2), ...) SELECT ... JOIN ON ..., but that also doesn't work because we don't have a way to do "partial" scans of CTEs either (though someone could invent that). Likewise for temporary tables: they are invisible to parallel workers, so they can't help us. I have contemplated "partial function scans" for set-returning functions, where a function could be given a bit of shared memory and various other infrastructure to be able to be "parallel aware" (= able to coordinate across processes so that each process gets a subset of the data), and one could imagine that that would allow various solutions to the problem, but that's vapourware. But you can get a plan like that if you insert all those values into a regular table, depending on various settings, stats and min_parallel_table_scan_size (try 0, I guess that'll definitely do it). Which probably isn't the answer you wanted to hear.
Re: Fwd: temp_file_limit?
On Tue, Dec 20, 2022 at 8:59 AM Frits Jalvingh wrote: > @ranier > These files ONLY exist during the query. They get deleted as soon as the > query terminates, by Postgres itself. Once the query terminates pgsql_tmp is > completely empty. Considering what Thomas said (and the actual occurrence of > the files he mentioned) this does seem to be the more likely cause to me. I'm working on some bug fixes near this area at the moment, so I'll also see if I can figure out how to implement the missing eager cleanup of earlier generations. It's still a pretty bad scenario once you reach it (repartitioning repeatedly, that is) and the solution to that it probably much harder, but it's obviously not great to waste temporary disk space like that. BTW you can disable just parallel hash with enable_parallel_hash=false.
Re: Fwd: temp_file_limit?
On Mon, Dec 19, 2022 at 1:51 PM Thomas Munro wrote: > It's really the limit for a single file Oops, sorry I take that back. It should be per process.
Re: Fwd: temp_file_limit?
On Mon, Dec 19, 2022 at 9:11 AM Justin Pryzby wrote: > On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote: > > Just to be clear: my real question is: why is temp_file_limit not > > working at the specified size? Because this is my real problem: when a > > query is dying like this it will also kill other queries because these > > are also running out of space. Even when the limit is per-process it > > should not have exceeded 200GB imo. It's really the limit for a single file (or virtual file because we split them on 1GB boundaries, probably well past time we stopped doing that), but we create many temporary files for various reasons. One possibility is that you've hit a case that needs several rounds of repartitioning (because of a failure to estimate the number of tuples well), but we can't see that because you didn't show EXPLAIN (ANALYZE) output (understandably if it runs out of disk space before completing...). The parallel hash code doesn't free up the previous generations' temporary files; it really only needs two generations' worth concurrently (the one it's reading from and the one it's writing to). In rare cases where more generations are needed it could unlink the older ones -- that hasn't been implemented yet. If you set log_temp_files = 0 to log temporary file names, it should be clear if it's going through multiple rounds of repartitioning, from the names (...of32..., ...of64..., ...of128..., ...of256..., ...).
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Wed, Aug 24, 2022 at 3:06 PM Tom Lane wrote: > Thomas Munro writes: > > Oh, one comment there is actually obsolete now AFAIK. Unless there is > > some reason to think personality(ADDR_NO_RANDOMIZE) might not work in > > some case where sysctl -w kernel.randomize_va_space=0 will, I think we > > can just remove that. > > AFAICS, f3e78069db7 silently does nothing on platforms lacking > ADDR_NO_RANDOMIZE and PROC_ASLR_FORCE_DISABLE. Are you asserting > there are no such platforms? That's a Linux-only sysctl. ADDR_NO_RANDOMIZE is also Linux-only. Both controls are old enough to be in any kernel that anyone's developing on. On further reflection, though, I guess the comment is still useful. ADDR_NO_RANDOMIZE only helps you with clusters launched by pg_ctl and pg_regress. A developer trying to run "postgres" directly might still want to know about the sysctl, so I withdraw that idea. As for whether there are platforms where it does nothing: definitely. These are highly OS-specific, and we've only tackled Linux and FreeBSD (with other solutions for macOS and Windows elsewhere in the tree), but I doubt it matters: these are just the OSes that have ASLR on by default, that someone in our community uses as a daily driver to hack PostgreSQL on, that has been annoyed enough to look up how to turn it off :-)
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Tue, Aug 23, 2022 at 2:42 PM Thomas Munro wrote: > > 0002 isn't quite related, but while writing 0001 I noticed a nearby > > use of /proc/sys/... which I thought should be converted to sysctl. > > IMO /proc/sys pretty much sucks, at least for documentation purposes, > > for multiple reasons: Oh, one comment there is actually obsolete now AFAIK. Unless there is some reason to think personality(ADDR_NO_RANDOMIZE) might not work in some case where sysctl -w kernel.randomize_va_space=0 will, I think we can just remove that. From 68e05cb3e4ae3b064af38241ac4ba8f9fe997b39 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 24 Aug 2022 14:46:11 +1200 Subject: [PATCH] Remove obsolete comment. Since commit f3e78069db7, it shouldn't be necessary to use sysctl to disable ASLR on Linux. Remove comment about that. diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 1664fcee2a..51748ec735 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -4887,16 +4887,6 @@ SubPostmasterMain(int argc, char *argv[]) * to do this before going any further to ensure that we can attach at the * same address the postmaster used. On the other hand, if we choose not * to re-attach, we may have other cleanup to do. - * - * If testing EXEC_BACKEND on Linux, you should run this as root before - * starting the postmaster: - * - * sysctl -w kernel.randomize_va_space=0 - * - * This prevents using randomized stack and code addresses that cause the - * child process's memory map to be different from the parent's, making it - * sometimes impossible to attach to shared memory at the desired address. - * Return the setting to its old value (usually '1' or '2') when finished. */ if (strcmp(argv[1], "--forkbackend") == 0 || strcmp(argv[1], "--forkavlauncher") == 0 || -- 2.37.1
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Tue, Aug 23, 2022 at 3:53 PM Tom Lane wrote: > Thomas Munro writes: > > On Tue, Aug 23, 2022 at 4:57 AM Tom Lane wrote: > > +service the requests, with those clients receiving unhelpful > > +connection failure errors such as Resource temporarily > > +unavailable. > > > LGTM but I guess I would add "... or Connection refused"? > > Is that the spelling that appears on FreeBSD? Happy to add it. Yep.
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Tue, Aug 23, 2022 at 4:57 AM Tom Lane wrote: > 0001 adds a para about how to raise the listen queue length. +service the requests, with those clients receiving unhelpful +connection failure errors such as Resource temporarily +unavailable. LGTM but I guess I would add "... or Connection refused"? > 0002 isn't quite related, but while writing 0001 I noticed a nearby > use of /proc/sys/... which I thought should be converted to sysctl. > IMO /proc/sys pretty much sucks, at least for documentation purposes, > for multiple reasons: +1 > 0003 removes PG_SOMAXCONN. While doing that I noticed that this > computation hadn't been touched throughout all the various > changes fooling with exactly what gets counted in MaxBackends. > I think the most appropriate definition for the listen queue > length is now MaxConnections * 2, not MaxBackends * 2, because > the other processes counted in MaxBackends don't correspond to > incoming connections. +1 > I propose 0003 for HEAD only, but the docs changes could be > back-patched. +1
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 2:18 PM Tom Lane wrote: > Thomas Munro writes: > > On Mon, Aug 22, 2022 at 12:20 PM Tom Lane wrote: > >> Hmm. It'll be awhile till the 128 default disappears entirely > >> though, especially if assorted BSDen use that too. Probably > >> worth the trouble to document. > > > I could try to write a doc patch if you aren't already on it. > > I haven't done anything about it yet, but could do so tomorrow or so. Cool. BTW small correction to something I said about FreeBSD: it'd be better to document the new name kern.ipc.soacceptqueue (see listen(2) HISTORY) even though the old name still works and matches OpenBSD and macOS.
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 12:20 PM Tom Lane wrote: > Thomas Munro writes: > > Yeah retrying doesn't seem that nice. +1 for a bit of documentation, > > which I guess belongs in the server tuning part where we talk about > > sysctls, perhaps with a link somewhere near max_connections? More > > recent Linux kernels bumped it to 4096 by default so I doubt it'll > > come up much in the future, though. > > Hmm. It'll be awhile till the 128 default disappears entirely > though, especially if assorted BSDen use that too. Probably > worth the trouble to document. I could try to write a doc patch if you aren't already on it. > > Note that we also call listen() > > with a backlog value capped to our own PG_SOMAXCONN which is 1000. I > > doubt many people benchmark with higher numbers of connections but > > it'd be nicer if it worked when you do... > > Actually it's 1. Still, I wonder if we couldn't just remove > that limit now that we've desupported a bunch of stone-age kernels. > It's hard to believe any modern kernel can't defend itself against > silly listen-queue requests. Oh, right. Looks like that was just paranoia in commit 153f4006763, back when you got away from using the (very conservative) SOMAXCONN macro. Looks like that was 5 on ancient systems going back to the original sockets stuff, and later 128 was a popular number. Yeah I'd say +1 for removing our cap. I'm pretty sure every system will internally cap whatever value we pass in if it doesn't like it, as POSIX explicitly says it can freely do with this "hint". The main thing I learned today is that Linux's connect(AF_UNIX) implementation doesn't refuse connections when the listen backlog is full, unlike other OSes. Instead, for blocking sockets, it sleeps and wakes with everyone else to fight over space. I *guess* for non-blocking sockets that introduced a small contradiction -- there isn't the state space required to give you a working EINPROGRESS with the same sort of behaviour (if you reified a secondary queue for that you might as well make the primary one larger...), but they also didn't want to give you ECONNREFUSED just because you're non-blocking, so they went with EAGAIN, because you really do need to call again with the sockaddr. The reason I wouldn't want to call it again is that I guess it'd be a busy CPU burning loop until progress can be made, which isn't nice, and failing with "Resource temporarily unavailable" to the user does in fact describe the problem, if somewhat vaguely. Hmm, maybe we could add a hint to the error, though?
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 10:55 AM Tom Lane wrote: > Not sure what I think at this point about making libpq retry after > EAGAIN. It would make sense for this particular undocumented use > of EAGAIN, but I'm worried about others, especially the documented > reason. On the whole I'm inclined to leave the code alone; > but is there sufficient reason to add something about adjusting > somaxconn to our documentation? My Debian system apparently has a newer man page: EAGAIN For nonblocking UNIX domain sockets, the socket is nonblocking, and the connection cannot be completed immediately. For other socket families, there are insufficient entries in the routing cache. Yeah retrying doesn't seem that nice. +1 for a bit of documentation, which I guess belongs in the server tuning part where we talk about sysctls, perhaps with a link somewhere near max_connections? More recent Linux kernels bumped it to 4096 by default so I doubt it'll come up much in the future, though. Note that we also call listen() with a backlog value capped to our own PG_SOMAXCONN which is 1000. I doubt many people benchmark with higher numbers of connections but it'd be nicer if it worked when you do... I was curious and checked how FreeBSD would handle this. Instead of EAGAIN you get ECONNREFUSED here, until you crank up kern.ipc.somaxconn, which also defaults to 128 like older Linux.
Re: pgbench: could not connect to server: Resource temporarily unavailable
On Mon, Aug 22, 2022 at 9:48 AM Tom Lane wrote: > It's also pretty unclear why the kernel would want to return > EAGAIN instead of letting the nonblock connection path do the > waiting, which is why I'm suspecting a bug rather than designed > behavior. Could it be that it fails like that if the listen queue is full on the other side? https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c#L1493 If it's something like that, maybe increasing /proc/sys/net/core/somaxconn would help? I think older kernels only had 128 here.
Re: Need help identifying a periodic performance issue.
On Fri, Nov 19, 2021 at 6:03 AM Robert Creager wrote: > Which would be better? Discard plans or forcing custom plans? Seems like > wrapping a copy might be better than the Postgres.conf change as that would > affect all statements. What kind of performance hit would we be taking with > that do you estimate? Microseconds per statement? Yeah, hard to say, > depends on hardware and such. Would there be any benefit overall to doing > that? Forcing the replan? Just to understand what's going on, it'd be interesting to know if the problem goes away if you *just* inject the DISCARD PLANS statement before running your COPYs, but if that doesn't help it'd also be interesting to know what happens if you ANALYZE each table after each COPY. Are you running any explicit ANALYZE commands? How long do your sessions/connections live for? I'm wondering if the thing that changed between 9.6 and 13 might be the heuristics for when auto vacuum's background ANALYZE is triggered, creating the unlucky timing required to get your system to this state occasionally. For a while now I have been wondering how we could teach the planner/stats system about "volatile" tables (as DB2 calls them), that is, ones that are frequently empty, which often come up in job queue workloads. I've seen problems like this with user queries (I used to work on big job queue systems across different relational database vendors, which is why I finished up writing the SKIP LOCKED patch for 9.5), but this is the first time I've contemplated FK check queries being negatively affected by this kind of stats problem. I don't have a good concrete idea, though (various dumb ideas: don't let auto analyze run on an empty table if it's marked VOLATILE, or ignore apparently empty stats on tables marked VOLATILE (and use what?), ...).
Re: Need help identifying a periodic performance issue.
On Thu, Nov 18, 2021 at 1:18 PM Robert Creager wrote: > So, how do I go about capturing more information for the big brains (you > guys) to help figure this out? I have all our resources at mine (and hence > your) disposal. As a workaround, does it help if you issue DISCARD PLANS before your COPY jobs, or alternatively start with a fresh connection? I'm guessing that something like this is happening. -- set up the auto_explain extension to show the internal foreign key check queries' plans load 'auto_explain'; set auto_explain.log_nested_statements = true; set auto_explain.log_min_duration = 0; set auto_explain.log_analyze = true; drop table if exists r, s cascade; create table r (i int primary key); create table s (i int references r(i)); -- collect stats showing r as empty analyze r; -- execute RI query 6 times to lock the plan (inserts fail, log shows seq scan) insert into s values (42); insert into s values (42); insert into s values (42); insert into s values (42); insert into s values (42); insert into s values (42); insert into r select generate_series(1, 100); -- once more, we still get a seq scan, which is by now a bad idea insert into s values (42); discard plans; -- once more, now we get an index scan insert into s values (42);
Re: Need help identifying a periodic performance issue.
On Thu, Nov 18, 2021 at 8:28 AM Tom Lane wrote: > Justin Pryzby writes: > > It shows that the process is running FK triggers. > > Indeed, and doing a seqscan therein. Normally I'd suppose that > this reflects a lack of an index, but RI_FKey_check should always > be doing something that matches the referenced table's unique > constraint, so why isn't it using that? I wonder if the reference tables are empty sometimes, and there's an unlucky sequence of events that results in cached RI plans with seq scans being used later in the same session after the tables are populated.
Re: Need help identifying a periodic performance issue.
On Wed, Nov 17, 2021 at 11:51 AM Thomas Munro wrote: > It's the right output format, but isn't /pid == '$PID'/ only going to > match one single process called "postgres"? Maybe /execname == > "postgres"/ to catch them all? Oh, duh, it's the top CPU one. Makes sense. Never mind :-)
Re: Need help identifying a periodic performance issue.
On Wed, Nov 17, 2021 at 11:40 AM Robert Creager wrote: > Presuming this is the type of output you are expecting: > > CPU IDFUNCTION:NAME > 0 58709:tick-10s > > > postgres`AtEOXact_LargeObject+0x11 > postgres`CommitTransaction+0x127 > postgres`CommitTransactionCommand+0xf2 > postgres`PostgresMain+0x1fef > postgres`process_startup_packet_die > postgres`0x73055b > postgres`PostmasterMain+0xf36 > postgres`0x697837 > postgres`_start+0x100 > `0x80095f008 > 1 It's the right output format, but isn't /pid == '$PID'/ only going to match one single process called "postgres"? Maybe /execname == "postgres"/ to catch them all? Hopefully it'll be obvious what's going on from an outlier stack with a high sample count. Can also be useful to convert the output to flamegraph format if CPU time is distributed over many distinct stacks.
Re: Need help identifying a periodic performance issue.
On Tue, Nov 16, 2021 at 5:43 PM Robert Creager wrote: > One CPU is pegged, the data has been sent over STDIN, so Postgres is not > waiting for more, there are no other queries running using this select: So PostgreSQL is eating 100% CPU, with no value shown in wait_event_type, and small numbers of system calls are counted. In that case, is there an interesting user stack that jumps out with a profiler during the slowdown (or the kernel version, stack())? sudo dtrace -n 'profile-99 /arg0/ { @[ustack()] = count(); } tick-10s { exit(0); }'
Re: FreeBSD UFS & fsync
On Fri, Mar 12, 2021 at 10:09 PM Luca Ferrari wrote: >fdatasync 16269.365 ops/sec 61 usecs/op >fsync 8471.429 ops/sec 118 usecs/op > Non-sync'ed 8kB writes: >write278484.510 ops/sec 4 usecs/op > tps = 136.963971 (excluding connections establishing) It looks like your system is performing very badly for some other reason, so that synchronous I/O waits are only a small proportion of the time, and thus fsync=off doesn't speed things up very much. I'd look into profiling the system to try to figure out what it's doing... maybe it's suffering from super slow hypercalls for gettimeofday(), or something like that?
Re: FreeBSD UFS & fsync
On Tue, Feb 23, 2021 at 5:49 AM Luca Ferrari wrote: > I'm running a virtual machine with FreeBSD 12.2, PostgreSQL 12.5 and > UFS as filesystem. > I was experimenting with fsync = off and pgbench, and I see no > particular difference in tps having fsync enabled or disabled. > Now, the same tiny test on a linux box provides a 10x tps, while on > FreeBSD is a 1% increase. > I'm trying to figure out why, and I suspect there is something related > to how UFS handles writes. Do you have WCE enabled? In that case, modern Linux file systems would do a synchronous SYNCHRONIZE CACHE for our WAL fdatasync(), but FreeBSD UFS wouldn't as far as I know. It does know how to do that (there's a BIO_FLUSH operation, also used by ZFS), but as far as I can see UFS uses it just for its own file system meta-data crash safety currently (see softdep_synchronize()). (There is also no FUA flag for O_[D]SYNC writes, an even more modern invention.)
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
On Fri, Aug 21, 2020 at 9:58 AM Jim Jarvie wrote: > However, as I scale up the number of concurrent connections, I see a spike in > CPU (to 100% across 80 cores) when the SELECT FOR UPDATE SKIP LOCKED executes > and the select processes wait for multiple minutes (10-20 minutes) before > completing. My use case requires around 256 concurrent processors for the > queue but I've been unable to scale beyond 128 without everything grinding to > a halt. Maybe it's just getting algorithmically ugly. To claim some job rows, you have to skip all dead/non-matching tuples left behind so far at the start of the table by all the other sessions, and then also all currently locked tuples, and you have to do update-chain walks on some of them too. It all gets a bit explosive once you have such high numbers of workers. I think I'd experiment with splitting the job table up into N tables and feed jobs into all of them about evenly (by hashing, at random, whatever), and then I'd assign each consumer a "preferred" table where it looks for jobs first (perhaps my_worker_id % ntables), before trying the others in round robin order. Then they won't trample on each other's toes so much. In the past I've wondered about a hypothetical circular_seqscan option, which would cause table scans to start where they left off last time in each backend, so SELECT * FROM t LIMIT 1 repeated would show you a different row each time until you get all the way around to the start again (as we're entirely within our rights to do for a query with no ORDER BY). That'd give the system a chance to vacuum and start refilling the start of the table before you get around to it again, instead of repeatedly having to step over the same useless pages every time you need a new job. Combined with the N tables thing, you'd be approaching a sweet spot for contention and dead tuple avoidance. The synchronized_seqscans setting is related to this idea, but more expensive, different, and probably not useful. Hmm. I guess another way to avoid colliding with others' work would be to try to use SELECT * FROM t TABLESAMPLE SYSTEM (10) WHERE ... FOR UPDATE SKIP LOCKED LIMIT It's less cache-friendly, and less order-preserving, but way more contention-friendly. That has another complication though; how do you pick 10? And if it doesn't return any or enough rows, it doesn't mean there isn't enough, so you may need to be ready to fall back to the plain approach if having 250 rows is really important to you and TABLESAMPLE doesn't give you enough. Or something. By the way, when working with around 64 consumer processes I was also annoyed by the thundering herd problem when using NOTIFY. I found various workaround solutions to that, but ultimately I think we need more precise wakeups for that sort of thing, which I hope to revisit one day.
Re: dsa_allocate() faliure
On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa wrote: > I see the error showing up every night on 2 different servers. But it's a bit > of a heisenbug because If I go there now it won't be reproducible. Huh. Ok well that's a lot more frequent that I thought. Is it always the same query? Any chance you can get the plan? Are there more things going on on the server, like perhaps concurrent parallel queries? > It was suggested by Justin Pryzby that I recompile pg src with his patch that > would cause a coredump. Small correction to Justin's suggestion: don't abort() after elog(ERROR, ...), it'll never be reached. > But I don't feel comfortable doing this especially if I would have to run > this with prod data. > My question is. Can I do anything like increasing logging level or enable > some additional options? > It's a production server but I'm willing to sacrifice a bit of it's > performance if that would help. If you're able to run a throwaway copy of your production database on another system that you don't have to worry about crashing, you could just replace ERROR with PANIC and run a high-speed loop of the query that crashed in product, or something. This might at least tell us whether it's reach that condition via something dereferencing a dsa_pointer or something manipulating the segment lists while allocating/freeing. In my own 100% unsuccessful attempts to reproduce this I was mostly running the same query (based on my guess at what ingredients are needed), but perhaps it requires a particular allocation pattern that will require more randomness to reach... hmm. -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini wrote: > we are facing a similar issue on a Production system using a Postgresql 10.6: > > org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: > EXCEPTION on getstatistics_media ; ID: uidatareader. > run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free > pages > We would like not to stop the Production system and upgrade it to PG11. And > even though would this guarantee a permanent fix? > Any suggestion? Hi Fabio, Thanks for your report. Could you please also show the query plan that runs on the "remote" node (where the error occurred)? There is no indication that upgrading to PG11 would help here. It seems we have an undiagnosed bug (in 10 and 11), and so far no one has been able to reproduce it at will. I personally have chewed a lot of CPU time on several machines trying various plan shapes and not seen this or the possibly related symptom from bug #15585 even once. But we have about three reports of each of the two symptoms. One reporter wrote to me off-list to say that they'd seen #15585 twice, the second time by running the same query in a tight loop for 8 hours, and then not seen it again in the past 3 weeks. Clearly there is issue needing a fix here, but I don't yet know what it is. -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
On Tue, Jan 29, 2019 at 2:50 AM Arne Roland wrote: > does anybody have any idea what goes wrong here? Is there some additional > information that could be helpful? Hi Arne, This seems to be a bug; that error should not be reached. I wonder if it is a different manifestation of the bug reported as #15585 (ie some type of rare corruption). Are you able to reproduce this consistently? Can you please show the query plan? -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
On Wed, Aug 29, 2018 at 5:48 PM Sand Stone wrote: > I attached a query (and its query plan) that caused the crash: "dsa_allocate > could not find 13 free pages" on one of the worker nodes. I anonymised the > query text a bit. Interestingly, this time only one (same one) of the nodes > is crashing. Since this is a production environment, I cannot get the stack > trace. Once turned off parallel execution for this node. The whole query > finished just fine. So the parallel query plan is from one of the nodes not > crashed, hopefully the same plan would have been executed on the crashed > node. In theory, every worker node has the same bits, and very similar data. I wonder if this was a different symptom of the problem fixed here: https://www.postgresql.org/message-id/flat/194c0706-c65b-7d81-ab32-2c248c3e2344%402ndquadrant.com Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE? -- Thomas Munro http://www.enterprisedb.com
Re: Why the sql is not executed in parallel mode
On Wed, Sep 19, 2018 at 1:53 PM jimmy wrote: > > Why the sql is not executed in parallel mode, does the sql has some problem? > with sql1 as Hello Jimmy, WITH is the problem. From the manual[1]: "The following operations are always parallel restricted. Scans of common table expressions (CTEs). ...". That means that these CTEs can only be scanned in the leader process. If you rewrite the query using sub selects it might do better. FWIW there is a project to make WITH work like subselects automatically in a future release of PostgreSQL: https://www.postgresql.org/message-id/flat/87sh48ffhb@news-spur.riddles.org.uk [1] https://www.postgresql.org/docs/10/static/parallel-safety.html -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Rick Otten <rottenwindf...@gmail.com> writes: >> I'm wondering if there is anything I can tune in my PG 10.1 database to >> avoid these errors: > >> $ psql -f failing_query.sql >> psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages >> CONTEXT: parallel worker > > Hmm. There's only one place in the source code that emits that message > text: > > /* > * Ask the free page manager for a run of pages. This should always > * succeed, since both get_best_segment and make_new_segment should > * only return a non-NULL pointer if it actually contains enough > * contiguous freespace. If it does fail, something in our backend > * private state is out of whack, so use FATAL to kill the process. > */ > if (!FreePageManagerGet(segment_map->fpm, npages, _page)) > elog(FATAL, > "dsa_allocate could not find %zu free pages", npages); > > Now maybe that comment is being unreasonably optimistic, but it sure > appears that this is supposed to be a can't-happen case, in which case > you've found a bug. This is probably the bug fixed here: https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org That was back patched, so 10.2 will contain the fix. The bug was not in dsa.c itself, but in the parallel query code that mixed up DSA areas, corrupting them. The problem comes up when the query plan has multiple Gather nodes (and a particular execution pattern) -- is that the case here, in the EXPLAIN output? That seems plausible given the description of a 50-branch UNION. The only workaround until 10.2 would be to reduce max_parallel_workers_per_gather to 0 to prevent parallelism completely for this query. -- Thomas Munro http://www.enterprisedb.com