Re: Performance implications of 8K pread()s

2024-04-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou  wrote:
> So would it make sense for postgres to perform reads in bigger blocks? Is it
> easy-ish to implement (where would one look for that)? Or must the I/O unit be
> tied to postgres' page size?

FYI as of last week we can do a little bit of that on the master branch:

postgres=# select count(*) from t;

preadv(46, ..., 8, 256237568) = 131072
preadv(46, ..., 5, 256368640) = 131072
preadv(46, ..., 8, 256499712) = 131072
preadv(46, ..., 5, 256630784) = 131072

postgres=# set io_combine_limit = '256k';
postgres=# select count(*) from t;

preadv(47, ..., 5, 613728256) = 262144
preadv(47, ..., 5, 613990400) = 262144
preadv(47, ..., 5, 614252544) = 262144
preadv(47, ..., 5, 614514688) = 262144

Here's hoping the commits implementing this stick, for the PostgreSQL
17 release.  It's just the beginning though, we can only do this for
full table scans so far (plus a couple of other obscure places).
Hopefully in the coming year we'll get the "streaming I/O" mechanism
that powers this hooked up to lots more places... index scans and
other stuff.  And writing.  Then eventually pushing the I/O into the
background.  Your questions actually triggered us to talk about why we
couldn't switch a few things around in our project and get the I/O
combining piece done sooner.  Thanks!




Re: pgsql 10.23 , different systems, same table , same plan, different Buffers: shared hit

2023-09-15 Thread Thomas Munro
On Sat, Sep 16, 2023 at 7:42 AM Tom Lane  wrote:
> Sadly, this proves very little about Linux's behavior.  glibc's idea
> of en_US involves some very complicated multi-pass sort rules.
> AFAICT from the FreeBSD sort(1) man page, FreeBSD defines en_US
> as "same as C except case-insensitive", whereas I'm pretty sure
> that underscores and other punctuation are nearly ignored in
> glibc's interpretation; they'll only be taken into account if the
> alphanumeric parts of the strings sort equal.

Achilleas didn't mention the glibc version, but based on the kernel
vintage mentioned I guess that must be the "old" (pre 2.28) glibc
sorting.  In 2.28 they did a big sync-up with ISO 14651, while FreeBSD
follows the UCA, a closely related standard[1].  I think newer
Linux/glibc systems should agree with FreeBSD's libc in more cases
(and also agree with ICU).

[1] https://unicode.org/reports/tr10/#Synch_ISO14651




Re: Performance implications of 8K pread()s

2023-07-16 Thread Thomas Munro
On Thu, Jul 13, 2023 at 6:50 AM Dimitrios Apostolou  wrote:
> Interesting and kind of sad that the last update on the wiki page is from
> 2021. What is the latest prototype? I'm not sure I'm up to the task of
> putting my database to the test. ;-)

It works pretty well,  certainly well enough to try out, and work is
happening.  I'll try to update the wiki with some more up-to-date
information soon.  Basically, compare these two slides (you could also
look at slide 11, which is the most most people are probably
interested in, but then you can't really see what's going on with
system call-level tools):

https://speakerdeck.com/macdice/aio-and-dio-for-postgresql-on-freebsd?slide=7
https://speakerdeck.com/macdice/aio-and-dio-for-postgresql-on-freebsd?slide=9

Not only are the IOs converted into 128KB preadv() calls, they are
issued concurrently and ahead of time while your backend is chewing on
the last lot of pages.  So even if your file system completely fails
at prefetching, we'd have a fighting chance at getting closer to
device/line speed.  That's basically what you have to do to support
direct I/O, where there is no system-provided prefetching.




Re: Performance implications of 8K pread()s

2023-07-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 5:12 AM Thomas Munro  wrote:
> "gathering"

(Oops, for reads, that's "scattering".  As in scatter/gather I/O but I
picked the wrong one...).




Re: Performance implications of 8K pread()s

2023-07-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou  wrote:
> Note that I suspect my setup being related, (btrfs compression behaving
> suboptimally) since the raw device can give me up to 1GB/s rate. It is however
> evident that reading in bigger chunks would mitigate such setup 
> inefficiencies.
> On a system that reads are already optimal and the read rate remains the same,
> then bigger block size would probably reduce the sys time postgresql consumes
> because of the fewer system calls.

I don't know about btrfs but maybe it can be tuned to prefetch
sequential reads better...

> So would it make sense for postgres to perform reads in bigger blocks? Is it
> easy-ish to implement (where would one look for that)? Or must the I/O unit be
> tied to postgres' page size?

It is hard to implement.  But people are working on it.  One of the
problems is that the 8KB blocks that we want to read data into aren't
necessarily contiguous so you can't just do bigger pread() calls
without solving a lot more problems first.  The project at
https://wiki.postgresql.org/wiki/AIO aims to deal with the
"clustering" you seek plus the "gathering" required for non-contiguous
buffers by allowing multiple block-sized reads to be prepared and
collected on a pending list up to some size that triggers merging and
submission to the operating system at a sensible rate, so we can build
something like a single large preadv() call.  In the current
prototype, if io_method=worker then that becomes a literal preadv()
call running in a background "io worker" process, but it could also be
OS-specific stuff (io_uring, ...) that starts an asynchronous IO
depending on settings.  If you take that branch and run your test you
should see 128KB-sized preadv() calls.




Re: Fsync IO issue

2023-05-04 Thread Thomas Munro
On Fri, May 5, 2023 at 8:37 AM ProfiVPS Support  wrote:
>  I feel like ANYTHING would be better than this. Even risking loosing _some_ 
> of the latest data in case of a server crash (if it crashes we lose data 
> anyways until restart, ofc we could have HA I know and we will when there'll 
> be a need) .

Try synchronous_commit=off:

https://www.postgresql.org/docs/current/wal-async-commit.html




Re: High QPS, random index writes and vacuum

2023-04-17 Thread Thomas Munro
On Tue, Apr 18, 2023 at 2:43 PM peter plachta  wrote:
> I was trying to understand whether there are any known workarounds for random 
> access + index vacuums. Are my vacuum times 'normal' ?

Ah, it's not going to help on the old versions you mentioned, but for
what it's worth: I remember noticing that I could speed up vacuum of
uncorrelated indexes using parallel vacuum (v13), huge_pages=on,
maintainance_work_mem=BIG, min_dynamic_shared_memory=BIG (v14),
because then the memory that is binary-searched in random order avoids
thrashing the TLB.




Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread Thomas Munro
On Thu, Feb 2, 2023 at 1:54 PM Alex Kaiser  wrote:
> Thanks for the explanation. Yes, that is the query plan I was imagining. I do 
> see how chopping it up could result in an unfair distribution. But my counter 
> to that would be that wouldn't chopping it up still be better than not. If 
> things do happen to work out to be fair, now it's X times as fast, if things 
> are very unfair, then you haven't really lost much (besides the parallel 
> overhead) compared to the non-parallel query. Or maybe it should be possible 
> to do the parallel query if there were some statistics (either normal ones or 
> extended ones) that told the planner that the result would probably be fair?

Maybe, but unfairness multiplies if it's part of a larger plan; what
if the output of those nodes is the input to much more work, but now
THAT work is being done by one process?  But yeah, statistics could
help with that.  I'm vaguely aware that other systems that do more
partition-based parallelism spend a lot of effort on that sort of
thinking.

> Though I do agree that the "work stealing" option would be the most 
> efficient, but would be a lot more complicated to code up.

Yeah.  I probably used the wrong word; what I was describing is
(something like) page-based parallelism, where input gets chopped up
into arbitrary chunks and handed out to consumers on demand, but we
don't know anything about the values in those chunks; that allows for
many interesting kind of plans, and it's nice because it's fair.

Another kind of parallelism is partition-based, which PostgreSQL can
do in a limited sense: we can send workers into different partitions
of a table (what we can't do is partition the table on-the-fly, which
is central to most parallelism in some other systems).  Let's see:

CREATE TABLE testing(
   id INT,
   info INT,
   data_one TEXT,
   data_two TEXT,
   primary key(id, info)
) partition by hash (id);
create table testing_p0 partition of testing for values with (modulus
2, remainder 0);
create table testing_p1 partition of testing for values with (modulus
2, remainder 1);
INSERT INTO testing(id, info, data_one, data_two)
SELECT idx, idx, md5(random()::text), md5(random()::text)
FROM generate_series(1,1000) idx;
analyze;

explain select count(*) from testing where id in
(1608377,5449811,5334677,5458230,2053195,3572313,1949724,3559988,5061560,8479775,
...);

 Aggregate
   ->  Append
 ->  Index Only Scan using testing_p0_pkey on testing_p0 testing_1
 ->  Index Only Scan using testing_p1_pkey on testing_p1 testing_2

Hmph.  I can't seem to convince it to use Parallel Append.  I think it
might be because the planner is not smart enough to chop down the =ANY
lists to match the partitions.  One sec...

Ok I hacked my copy of PostgreSQL to let me set parallel_setup_costs
to negative numbers, and then I told it that parallelism is so awesome
that it makes your queries cost -100 timerons before they even
start.  Now I see a plan like:

 Gather  (cost=-99.57..-987689.45 rows=2000 width=74)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.43..12110.55 rows=832 width=74)
 ->  Parallel Index Scan using testing_p0_pkey on testing_p0 testing_1
 ->  Parallel Index Scan using testing_p1_pkey on testing_p1 testing_2

But it's probing every index for every one of the values in the big
list, not just the ones that have a non-zero chance of finding a
match, which is a waste of cycles.  I think if the planner were
smarter about THAT (as it is for plain old "="), then the costing
would have chosen parallelism naturally by cost.

But it's probably not as cool as page-based parallelism, because
parallelism is limited by your partitioning scheme.

If I had more timerons myself, I'd like to try to make parallel
function scans, or parallel CTE scans, work...




Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread Thomas Munro
On Wed, Feb 1, 2023 at 6:39 PM Alex Kaiser  wrote:
> select * from testing where id in (1608377,5449811, ... <1000 random ids> 
> ,4654284,3558460);
>
> Essentially I have a list of 1000 ids and I would like the rows for all of 
> those ids.
>
> This seems like it would be pretty easy to parallelize, if you have X threads 
> then you would split the list of IDs into 1000/X sub lists and give one to 
> each thread to go find the rows for ids in the given list.  Even when I use 
> the following configs I don't get a query plan that actually uses any 
> parallelism:

It sounds like the plan you are imagining is something like:

Gather
  Nested Loop Join
Outer side: 
Inner side: Index scan of your big table

Such a plan would only give the right answer if each process has a
non-overlapping subset of the constant values to probe the index with,
and together they have the whole set.  Hypothetically, a planner could
chop that set up beforehand and and give a different subset to each
process (just as you could do that yourself using N connections and
separate queries), but that might be unfair: one process might find
lots of matches, and the others might find none, because of the
distribution of data.  So you'd ideally want some kind of "work
stealing" scheme, where each worker can take more values to probe from
whenever it needs more, so that they all keep working until the values
run out.  We don't have a thing that can do that.  You might imagine
that a CTE could do it, so WITH keys_to_look_up AS (VALUES (1), (2),
...) SELECT ... JOIN ON ..., but that also doesn't work because we
don't have a way to do "partial" scans of CTEs either (though someone
could invent that).  Likewise for temporary tables: they are invisible
to parallel workers, so they can't help us.  I have contemplated
"partial function scans" for set-returning functions, where a function
could be given a bit of shared memory and various other infrastructure
to be able to be "parallel aware" (= able to coordinate across
processes so that each process gets a subset of the data), and one
could imagine that that would allow various solutions to the problem,
but that's vapourware.

But you can get a plan like that if you insert all those values into a
regular table, depending on various settings, stats and
min_parallel_table_scan_size (try 0, I guess that'll definitely do
it).  Which probably isn't the answer you wanted to hear.




Re: Fwd: temp_file_limit?

2022-12-19 Thread Thomas Munro
On Tue, Dec 20, 2022 at 8:59 AM Frits Jalvingh  wrote:
> @ranier
> These files ONLY exist during the query. They get deleted as soon as the 
> query terminates, by Postgres itself. Once the query terminates pgsql_tmp is 
> completely empty. Considering what Thomas said (and the actual occurrence of 
> the files he mentioned) this does seem to be the more likely cause to me.

I'm working on some bug fixes near this area at the moment, so I'll
also see if I can figure out how to implement the missing eager
cleanup of earlier generations.  It's still a pretty bad scenario once
you reach it (repartitioning repeatedly, that is) and the solution to
that it probably much harder, but it's obviously not great to waste
temporary disk space like that.  BTW you can disable just parallel
hash with enable_parallel_hash=false.




Re: Fwd: temp_file_limit?

2022-12-18 Thread Thomas Munro
On Mon, Dec 19, 2022 at 1:51 PM Thomas Munro  wrote:
> It's really the limit for a single file

Oops, sorry I take that back.  It should be per process.




Re: Fwd: temp_file_limit?

2022-12-18 Thread Thomas Munro
On Mon, Dec 19, 2022 at 9:11 AM Justin Pryzby  wrote:
> On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote:
> > Just to be clear: my real question is: why is temp_file_limit not
> > working at the specified size? Because this is my real problem: when a
> > query is dying like this it will also kill other queries because these
> > are also running out of space. Even when the limit is per-process it
> > should not have exceeded 200GB imo.

It's really the limit for a single file (or virtual file because we
split them on 1GB boundaries, probably well past time we stopped doing
that), but we create many temporary files for various reasons.  One
possibility is that you've hit a case that needs several rounds of
repartitioning (because of a failure to estimate the number of tuples
well), but we can't see that because you didn't show EXPLAIN (ANALYZE)
output (understandably if it runs out of disk space before
completing...).  The parallel hash code doesn't free up the previous
generations' temporary files; it really only needs two generations'
worth concurrently (the one it's reading from and the one it's writing
to).  In rare cases where more generations are needed it could unlink
the older ones -- that hasn't been implemented yet.  If you set
log_temp_files = 0 to log temporary file names, it should be clear if
it's going through multiple rounds of repartitioning, from the names
(...of32..., ...of64..., ...of128..., ...of256..., ...).




Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-23 Thread Thomas Munro
On Wed, Aug 24, 2022 at 3:06 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Oh, one comment there is actually obsolete now AFAIK.  Unless there is
> > some reason to think personality(ADDR_NO_RANDOMIZE) might not work in
> > some case where sysctl -w kernel.randomize_va_space=0 will, I think we
> > can just remove that.
>
> AFAICS, f3e78069db7 silently does nothing on platforms lacking
> ADDR_NO_RANDOMIZE and PROC_ASLR_FORCE_DISABLE.  Are you asserting
> there are no such platforms?

That's a Linux-only sysctl.  ADDR_NO_RANDOMIZE is also Linux-only.
Both controls are old enough to be in any kernel that anyone's
developing on.  On further reflection, though, I guess the comment is
still useful.  ADDR_NO_RANDOMIZE only helps you with clusters launched
by pg_ctl and pg_regress.  A developer trying to run "postgres"
directly might still want to know about the sysctl, so I withdraw that
idea.

As for whether there are platforms where it does nothing: definitely.
These are highly OS-specific, and we've only tackled Linux and FreeBSD
(with other solutions for macOS and Windows elsewhere in the tree),
but I doubt it matters: these are just the OSes that have ASLR on by
default, that someone in our community uses as a daily driver to hack
PostgreSQL on, that has been annoyed enough to look up how to turn it
off :-)




Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-23 Thread Thomas Munro
On Tue, Aug 23, 2022 at 2:42 PM Thomas Munro  wrote:
> > 0002 isn't quite related, but while writing 0001 I noticed a nearby
> > use of /proc/sys/... which I thought should be converted to sysctl.
> > IMO /proc/sys pretty much sucks, at least for documentation purposes,
> > for multiple reasons:

Oh, one comment there is actually obsolete now AFAIK.  Unless there is
some reason to think personality(ADDR_NO_RANDOMIZE) might not work in
some case where sysctl -w kernel.randomize_va_space=0 will, I think we
can just remove that.
From 68e05cb3e4ae3b064af38241ac4ba8f9fe997b39 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 24 Aug 2022 14:46:11 +1200
Subject: [PATCH] Remove obsolete comment.

Since commit f3e78069db7, it shouldn't be necessary to use sysctl to
disable ASLR on Linux.  Remove comment about that.

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1664fcee2a..51748ec735 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4887,16 +4887,6 @@ SubPostmasterMain(int argc, char *argv[])
 	 * to do this before going any further to ensure that we can attach at the
 	 * same address the postmaster used.  On the other hand, if we choose not
 	 * to re-attach, we may have other cleanup to do.
-	 *
-	 * If testing EXEC_BACKEND on Linux, you should run this as root before
-	 * starting the postmaster:
-	 *
-	 * sysctl -w kernel.randomize_va_space=0
-	 *
-	 * This prevents using randomized stack and code addresses that cause the
-	 * child process's memory map to be different from the parent's, making it
-	 * sometimes impossible to attach to shared memory at the desired address.
-	 * Return the setting to its old value (usually '1' or '2') when finished.
 	 */
 	if (strcmp(argv[1], "--forkbackend") == 0 ||
 		strcmp(argv[1], "--forkavlauncher") == 0 ||
-- 
2.37.1



Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-22 Thread Thomas Munro
On Tue, Aug 23, 2022 at 3:53 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Tue, Aug 23, 2022 at 4:57 AM Tom Lane  wrote:
> > +service the requests, with those clients receiving unhelpful
> > +connection failure errors such as Resource temporarily
> > +unavailable.
>
> > LGTM but I guess I would add "... or Connection refused"?
>
> Is that the spelling that appears on FreeBSD?  Happy to add it.

Yep.




Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-22 Thread Thomas Munro
On Tue, Aug 23, 2022 at 4:57 AM Tom Lane  wrote:
> 0001 adds a para about how to raise the listen queue length.

+service the requests, with those clients receiving unhelpful
+connection failure errors such as Resource temporarily
+unavailable.

LGTM but I guess I would add "... or Connection refused"?

> 0002 isn't quite related, but while writing 0001 I noticed a nearby
> use of /proc/sys/... which I thought should be converted to sysctl.
> IMO /proc/sys pretty much sucks, at least for documentation purposes,
> for multiple reasons:

+1

> 0003 removes PG_SOMAXCONN.  While doing that I noticed that this
> computation hadn't been touched throughout all the various
> changes fooling with exactly what gets counted in MaxBackends.
> I think the most appropriate definition for the listen queue
> length is now MaxConnections * 2, not MaxBackends * 2, because
> the other processes counted in MaxBackends don't correspond to
> incoming connections.

+1

> I propose 0003 for HEAD only, but the docs changes could be
> back-patched.

+1




Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-21 Thread Thomas Munro
On Mon, Aug 22, 2022 at 2:18 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Mon, Aug 22, 2022 at 12:20 PM Tom Lane  wrote:
> >> Hmm.  It'll be awhile till the 128 default disappears entirely
> >> though, especially if assorted BSDen use that too.  Probably
> >> worth the trouble to document.
>
> > I could try to write a doc patch if you aren't already on it.
>
> I haven't done anything about it yet, but could do so tomorrow or so.

Cool.  BTW small correction to something I said about FreeBSD: it'd be
better to document the new name kern.ipc.soacceptqueue (see listen(2)
HISTORY) even though the old name still works and matches OpenBSD and
macOS.




Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-21 Thread Thomas Munro
On Mon, Aug 22, 2022 at 12:20 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Yeah retrying doesn't seem that nice.  +1 for a bit of documentation,
> > which I guess belongs in the server tuning part where we talk about
> > sysctls, perhaps with a link somewhere near max_connections?  More
> > recent Linux kernels bumped it to 4096 by default so I doubt it'll
> > come up much in the future, though.
>
> Hmm.  It'll be awhile till the 128 default disappears entirely
> though, especially if assorted BSDen use that too.  Probably
> worth the trouble to document.

I could try to write a doc patch if you aren't already on it.

> > Note that we also call listen()
> > with a backlog value capped to our own PG_SOMAXCONN which is 1000.  I
> > doubt many people benchmark with higher numbers of connections but
> > it'd be nicer if it worked when you do...
>
> Actually it's 1.  Still, I wonder if we couldn't just remove
> that limit now that we've desupported a bunch of stone-age kernels.
> It's hard to believe any modern kernel can't defend itself against
> silly listen-queue requests.

Oh, right.  Looks like that was just  paranoia in commit 153f4006763,
back when you got away from using the (very conservative) SOMAXCONN
macro.  Looks like that was 5 on ancient systems going back to the
original sockets stuff, and later 128 was a popular number.  Yeah I'd
say +1 for removing our cap.  I'm pretty sure every system will
internally cap whatever value we pass in if it doesn't like it, as
POSIX explicitly says it can freely do with this "hint".

The main thing I learned today is that Linux's connect(AF_UNIX)
implementation doesn't refuse connections when the listen backlog is
full, unlike other OSes.  Instead, for blocking sockets, it sleeps and
wakes with everyone else to fight over space.  I *guess* for
non-blocking sockets that introduced a small contradiction -- there
isn't the state space required to give you a working EINPROGRESS with
the same sort of behaviour (if you reified a secondary queue for that
you might as well make the primary one larger...), but they also
didn't want to give you ECONNREFUSED just because you're non-blocking,
so they went with EAGAIN, because you really do need to call again
with the sockaddr.  The reason I wouldn't want to call it again is
that I guess it'd be a busy CPU burning loop until progress can be
made, which isn't nice, and failing with "Resource temporarily
unavailable" to the user does in fact describe the problem, if
somewhat vaguely.  Hmm, maybe we could add a hint to the error,
though?




Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-21 Thread Thomas Munro
On Mon, Aug 22, 2022 at 10:55 AM Tom Lane  wrote:
> Not sure what I think at this point about making libpq retry after
> EAGAIN.  It would make sense for this particular undocumented use
> of EAGAIN, but I'm worried about others, especially the documented
> reason.  On the whole I'm inclined to leave the code alone;
> but is there sufficient reason to add something about adjusting
> somaxconn to our documentation?

My Debian system apparently has a newer man page:

   EAGAIN For  nonblocking UNIX domain sockets, the socket is nonblocking,
  and the connection cannot be completed immediately.   For  other
  socket  families,  there are insufficient entries in the routing
  cache.

Yeah retrying doesn't seem that nice.  +1 for a bit of documentation,
which I guess belongs in the server tuning part where we talk about
sysctls, perhaps with a link somewhere near max_connections?  More
recent Linux kernels bumped it to 4096 by default so I doubt it'll
come up much in the future, though.  Note that we also call listen()
with a backlog value capped to our own PG_SOMAXCONN which is 1000.  I
doubt many people benchmark with higher numbers of connections but
it'd be nicer if it worked when you do...

I was curious and checked how FreeBSD would handle this.  Instead of
EAGAIN you get ECONNREFUSED here, until you crank up
kern.ipc.somaxconn, which also defaults to 128 like older Linux.




Re: pgbench: could not connect to server: Resource temporarily unavailable

2022-08-21 Thread Thomas Munro
On Mon, Aug 22, 2022 at 9:48 AM Tom Lane  wrote:
> It's also pretty unclear why the kernel would want to return
> EAGAIN instead of letting the nonblock connection path do the
> waiting, which is why I'm suspecting a bug rather than designed
> behavior.

Could it be that it fails like that if the listen queue is full on the
other side?

https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c#L1493

If it's something like that, maybe increasing
/proc/sys/net/core/somaxconn would help?  I think older kernels only
had 128 here.




Re: Need help identifying a periodic performance issue.

2021-11-18 Thread Thomas Munro
On Fri, Nov 19, 2021 at 6:03 AM Robert Creager  wrote:
> Which would be better?  Discard plans or forcing custom plans?  Seems like 
> wrapping a copy might be better than the Postgres.conf change as that would 
> affect all statements.  What kind of performance hit would we be taking with 
> that do you estimate?  Microseconds per statement?  Yeah, hard to say, 
> depends on hardware and such.  Would there be any benefit overall to doing 
> that?  Forcing the replan?

Just to understand what's going on, it'd be interesting to know if the
problem goes away if you *just* inject the DISCARD PLANS statement
before running your COPYs, but if that doesn't help it'd also be
interesting to know what happens if you ANALYZE each table after each
COPY.  Are you running any explicit ANALYZE commands?  How long do
your sessions/connections live for?

I'm wondering if the thing that changed between 9.6 and 13 might be
the heuristics for when auto vacuum's background ANALYZE is triggered,
creating the unlucky timing required to get your system to this state
occasionally.

For a while now I have been wondering how we could teach the
planner/stats system about "volatile" tables (as DB2 calls them), that
is, ones that are frequently empty, which often come up in job queue
workloads.  I've seen problems like this with user queries (I used to
work on big job queue systems across different relational database
vendors, which is why I finished up writing the SKIP LOCKED patch for
9.5), but this is the first time I've contemplated FK check queries
being negatively affected by this kind of stats problem.  I don't have
a good concrete idea, though (various dumb ideas: don't let auto
analyze run on an empty table if it's marked VOLATILE, or ignore
apparently empty stats on tables marked VOLATILE (and use what?),
...).




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Thomas Munro
On Thu, Nov 18, 2021 at 1:18 PM Robert Creager  wrote:
> So, how do I go about capturing more information for the big brains (you 
> guys) to help figure this out?  I have all our resources at mine (and hence 
> your) disposal.

As a workaround, does it help if you issue DISCARD PLANS before your
COPY jobs, or alternatively start with a fresh connection?  I'm
guessing that something like this is happening.

-- set up the auto_explain extension to show the internal foreign key
check queries' plans
load 'auto_explain';
set auto_explain.log_nested_statements = true;
set auto_explain.log_min_duration = 0;
set auto_explain.log_analyze = true;

drop table if exists r, s cascade;
create table r (i int primary key);
create table s (i int references r(i));

-- collect stats showing r as empty
analyze r;

-- execute RI query 6 times to lock the plan (inserts fail, log shows seq scan)
insert into s values (42);
insert into s values (42);
insert into s values (42);
insert into s values (42);
insert into s values (42);
insert into s values (42);

insert into r select generate_series(1, 100);

-- once more, we still get a seq scan, which is by now a bad idea
insert into s values (42);

discard plans;

-- once more, now we get an index scan
insert into s values (42);




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Thomas Munro
On Thu, Nov 18, 2021 at 8:28 AM Tom Lane  wrote:
> Justin Pryzby  writes:
> > It shows that the process is running FK triggers.
>
> Indeed, and doing a seqscan therein.  Normally I'd suppose that
> this reflects a lack of an index, but RI_FKey_check should always
> be doing something that matches the referenced table's unique
> constraint, so why isn't it using that?

I wonder if the reference tables are empty sometimes, and there's an
unlucky sequence of events that results in cached RI plans with seq
scans being used later in the same session after the tables are
populated.




Re: Need help identifying a periodic performance issue.

2021-11-16 Thread Thomas Munro
On Wed, Nov 17, 2021 at 11:51 AM Thomas Munro  wrote:
> It's the right output format, but isn't /pid == '$PID'/ only going to
> match one single process called "postgres"?   Maybe /execname ==
> "postgres"/ to catch them all?

Oh, duh, it's the top CPU one.  Makes sense.  Never mind :-)




Re: Need help identifying a periodic performance issue.

2021-11-16 Thread Thomas Munro
On Wed, Nov 17, 2021 at 11:40 AM Robert Creager
 wrote:
> Presuming this is the type of output you are expecting:
>
> CPU IDFUNCTION:NAME
>   0  58709:tick-10s
>
>
>   postgres`AtEOXact_LargeObject+0x11
>   postgres`CommitTransaction+0x127
>   postgres`CommitTransactionCommand+0xf2
>   postgres`PostgresMain+0x1fef
>   postgres`process_startup_packet_die
>   postgres`0x73055b
>   postgres`PostmasterMain+0xf36
>   postgres`0x697837
>   postgres`_start+0x100
>   `0x80095f008
> 1

It's the right output format, but isn't /pid == '$PID'/ only going to
match one single process called "postgres"?   Maybe /execname ==
"postgres"/ to catch them all?  Hopefully it'll be obvious what's
going on from an outlier stack with a high sample count.  Can also be
useful to convert the output to flamegraph format if CPU time is
distributed over many distinct stacks.




Re: Need help identifying a periodic performance issue.

2021-11-15 Thread Thomas Munro
On Tue, Nov 16, 2021 at 5:43 PM Robert Creager  wrote:
> One CPU is pegged, the data has been sent over STDIN, so Postgres is not 
> waiting for more, there are no other queries running using this select:

So PostgreSQL is eating 100% CPU, with no value shown in
wait_event_type, and small numbers of system calls are counted.  In
that case, is there an interesting user stack that jumps out with a
profiler during the slowdown (or the kernel version, stack())?

sudo dtrace -n 'profile-99 /arg0/ { @[ustack()] = count(); } tick-10s
{ exit(0); }'




Re: FreeBSD UFS & fsync

2021-03-12 Thread Thomas Munro
On Fri, Mar 12, 2021 at 10:09 PM Luca Ferrari  wrote:
>fdatasync 16269.365 ops/sec  61 usecs/op
>fsync  8471.429 ops/sec 118 usecs/op

> Non-sync'ed 8kB writes:
>write278484.510 ops/sec   4 usecs/op

> tps = 136.963971 (excluding connections establishing)

It looks like your system is performing very badly for some other
reason, so that synchronous I/O waits are only a small proportion of
the time, and thus fsync=off doesn't speed things up very much.  I'd
look into profiling the system to try to figure out what it's doing...
maybe it's suffering from super slow hypercalls for gettimeofday(), or
something like that?




Re: FreeBSD UFS & fsync

2021-02-22 Thread Thomas Munro
On Tue, Feb 23, 2021 at 5:49 AM Luca Ferrari  wrote:
> I'm running a virtual machine with FreeBSD 12.2, PostgreSQL 12.5 and
> UFS as filesystem.
> I was experimenting with fsync = off and pgbench, and I see no
> particular difference in tps having fsync enabled or disabled.
> Now, the same tiny test on a linux box provides a 10x tps, while on
> FreeBSD is a 1% increase.
> I'm trying to figure out why, and I suspect there is something related
> to how UFS handles writes.

Do you have WCE enabled?  In that case, modern Linux file systems
would do a synchronous SYNCHRONIZE CACHE for our WAL fdatasync(), but
FreeBSD UFS wouldn't as far as I know.  It does know how to do that
(there's a BIO_FLUSH operation, also used by ZFS), but as far as I can
see UFS uses it just for its own file system meta-data crash safety
currently (see softdep_synchronize()).  (There is also no FUA flag for
O_[D]SYNC writes, an even more modern invention.)




Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Thomas Munro
On Fri, Aug 21, 2020 at 9:58 AM Jim Jarvie  wrote:
> However, as I scale up the number of concurrent connections, I see a spike in 
> CPU (to 100% across 80 cores) when the SELECT FOR UPDATE SKIP LOCKED executes 
> and the select processes wait for multiple minutes (10-20 minutes) before 
> completing.  My use case requires around 256 concurrent processors for the 
> queue but I've been unable to scale beyond 128 without everything grinding to 
> a halt.

Maybe it's just getting algorithmically ugly.  To claim some job rows,
you have to skip all dead/non-matching tuples left behind so far at
the start of the table by all the other sessions, and then also all
currently locked tuples, and you have to do update-chain walks on some
of them too.  It all gets a bit explosive once you have such high
numbers of workers.

I think I'd experiment with splitting the job table up into N tables
and feed jobs into all of them about evenly (by hashing, at random,
whatever), and then I'd assign each consumer a "preferred" table where
it looks for jobs first (perhaps my_worker_id % ntables), before
trying the others in round robin order.  Then they won't trample on
each other's toes so much.

In the past I've wondered about a hypothetical circular_seqscan
option, which would cause table scans to start where they left off
last time in each backend, so SELECT * FROM t LIMIT 1 repeated would
show you a different row each time until you get all the way around to
the start again (as we're entirely within our rights to do for a query
with no ORDER BY).  That'd give the system a chance to vacuum and
start refilling the start of the table before you get around to it
again, instead of repeatedly having to step over the same useless
pages every time you need a new job.  Combined with the N tables
thing, you'd be approaching a sweet spot for contention and dead tuple
avoidance.  The synchronized_seqscans setting is related to this idea,
but more expensive, different, and probably not useful.

Hmm.  I guess another way to avoid colliding with others' work would
be to try to use SELECT * FROM t TABLESAMPLE SYSTEM (10) WHERE ... FOR
UPDATE SKIP LOCKED LIMIT   It's less cache-friendly, and less
order-preserving, but way more contention-friendly.  That has another
complication though; how do you pick 10?  And if it doesn't return any
or enough rows, it doesn't mean there isn't enough, so you may need to
be ready to fall back to the plain approach if having 250 rows is
really important to you and TABLESAMPLE doesn't give you enough.  Or
something.

By the way, when working with around 64 consumer processes I was also
annoyed by the thundering herd problem when using NOTIFY.  I found
various workaround solutions to that, but ultimately I think we need
more precise wakeups for that sort of thing, which I hope to revisit
one day.




Re: dsa_allocate() faliure

2019-02-04 Thread Thomas Munro
On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa  wrote:
> I see the error showing up every night on 2 different servers. But it's a bit 
> of a heisenbug because If I go there now it won't be reproducible.

Huh.  Ok well that's a lot more frequent that I thought.  Is it always
the same query?  Any chance you can get the plan?  Are there more
things going on on the server, like perhaps concurrent parallel
queries?

> It was suggested by Justin Pryzby that I recompile pg src with his patch that 
> would cause a coredump.

Small correction to Justin's suggestion: don't abort() after
elog(ERROR, ...), it'll never be reached.

> But I don't feel comfortable doing this especially if I would have to run 
> this with prod data.
> My question is. Can I do anything like increasing logging level or enable 
> some additional options?
> It's a production server but I'm willing to sacrifice a bit of it's 
> performance if that would help.

If you're able to run a throwaway copy of your production database on
another system that you don't have to worry about crashing, you could
just replace ERROR with PANIC and run a high-speed loop of the query
that crashed in product, or something.  This might at least tell us
whether it's reach that condition via something dereferencing a
dsa_pointer or something manipulating the segment lists while
allocating/freeing.

In my own 100% unsuccessful attempts to reproduce this I was mostly
running the same query (based on my guess at what ingredients are
needed), but perhaps it requires a particular allocation pattern that
will require more randomness to reach... hmm.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2019-01-29 Thread Thomas Munro
On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
 wrote:
>  we are facing a similar issue on a Production system using a Postgresql 10.6:
>
> org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: 
> EXCEPTION on getstatistics_media ; ID: uidatareader.
> run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free 
> pages

> We would like not to stop the Production system and upgrade it to PG11. And 
> even though would this guarantee a permanent fix?
> Any suggestion?

Hi Fabio,

Thanks for your report.  Could you please also show the query plan
that runs on the "remote" node (where the error occurred)?

There is no indication that upgrading to PG11 would help here.  It
seems we have an undiagnosed bug (in 10 and 11), and so far no one has
been able to reproduce it at will.  I personally have chewed a lot of
CPU time on several machines trying various plan shapes and not seen
this or the possibly related symptom from bug #15585 even once.  But
we have about three reports of each of the two symptoms.  One reporter
wrote to me off-list to say that they'd seen #15585 twice, the second
time by running the same query in a tight loop for 8 hours, and then
not seen it again in the past 3 weeks.  Clearly there is issue needing
a fix here, but I don't yet know what it is.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2019-01-28 Thread Thomas Munro
On Tue, Jan 29, 2019 at 2:50 AM Arne Roland  wrote:
> does anybody have any idea what goes wrong here? Is there some additional 
> information that could be helpful?

Hi Arne,

This seems to be a bug; that error should not be reached.  I wonder if
it is a different manifestation of the bug reported as #15585 (ie some
type of rare corruption).  Are you able to reproduce this
consistently?  Can you please show the query plan?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2018-10-04 Thread Thomas Munro
On Wed, Aug 29, 2018 at 5:48 PM Sand Stone  wrote:
> I attached a query (and its query plan) that caused the crash: "dsa_allocate 
> could not find 13 free pages" on one of the worker nodes. I anonymised the 
> query text a bit.  Interestingly, this time only one (same one) of the nodes 
> is crashing. Since this is a production environment, I cannot get the stack 
> trace. Once turned off parallel execution for this node. The whole query 
> finished just fine. So the parallel query plan is from one of the nodes not 
> crashed, hopefully the same plan would have been executed on the crashed 
> node. In theory, every worker node has the same bits, and very similar data.

I wonder if this was a different symptom of the problem fixed here:

https://www.postgresql.org/message-id/flat/194c0706-c65b-7d81-ab32-2c248c3e2344%402ndquadrant.com

Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Why the sql is not executed in parallel mode

2018-09-18 Thread Thomas Munro
On Wed, Sep 19, 2018 at 1:53 PM jimmy  wrote:
>
> Why the sql is not executed in parallel mode, does the sql has some problem?
> with sql1 as

Hello Jimmy,

WITH is the problem.  From the manual[1]:  "The following operations
are always parallel restricted.  Scans of common table expressions
(CTEs). ...".  That means that these CTEs can only be scanned in the
leader process.

If you rewrite the query using sub selects it might do better.  FWIW
there is a project to make WITH work like subselects automatically in
a future release of PostgreSQL:

https://www.postgresql.org/message-id/flat/87sh48ffhb@news-spur.riddles.org.uk

[1] https://www.postgresql.org/docs/10/static/parallel-safety.html

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2018-01-29 Thread Thomas Munro
On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Rick Otten <rottenwindf...@gmail.com> writes:
>> I'm wondering if there is anything I can tune in my PG 10.1 database to
>> avoid these errors:
>
>> $  psql -f failing_query.sql
>> psql:failing_query.sql:46: ERROR:  dsa_allocate could not find 7 free pages
>> CONTEXT:  parallel worker
>
> Hmm.  There's only one place in the source code that emits that message
> text:
>
> /*
>  * Ask the free page manager for a run of pages.  This should always
>  * succeed, since both get_best_segment and make_new_segment should
>  * only return a non-NULL pointer if it actually contains enough
>  * contiguous freespace.  If it does fail, something in our backend
>  * private state is out of whack, so use FATAL to kill the process.
>  */
> if (!FreePageManagerGet(segment_map->fpm, npages, _page))
> elog(FATAL,
>  "dsa_allocate could not find %zu free pages", npages);
>
> Now maybe that comment is being unreasonably optimistic, but it sure
> appears that this is supposed to be a can't-happen case, in which case
> you've found a bug.

This is probably the bug fixed here:

https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix.  The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them.  The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output?  That seems plausible given the
description of a 50-branch UNION.  The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

-- 
Thomas Munro
http://www.enterprisedb.com