Re: stress test for parallel workers

2020-10-15 Thread Thomas Munro
On Tue, Aug 25, 2020 at 1:43 AM Tom Lane wrote: > I wrote: > > For our archives' sake: today I got seemingly-automated mail informing me > > that this patch has been merged into the 4.19-stable, 5.4-stable, > > 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable, > > 4.9-stable, or

Re: stress test for parallel workers

2020-08-24 Thread Tom Lane
I wrote: > For our archives' sake: today I got seemingly-automated mail informing me > that this patch has been merged into the 4.19-stable, 5.4-stable, > 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable, > 4.9-stable, or 4.14-stable, because it failed to apply. And this morning's

Re: stress test for parallel workers

2020-08-19 Thread Tom Lane
Thomas Munro writes: > On Tue, Jul 28, 2020 at 3:27 PM Tom Lane wrote: >> Anyway, I guess the interesting question for us is how long it >> will take for this fix to propagate into real-world systems. >> I don't have much of a clue about the Linux kernel workflow, >> anybody want to venture a

Re: stress test for parallel workers

2020-08-10 Thread Thomas Munro
On Tue, Jul 28, 2020 at 3:27 PM Tom Lane wrote: > Anyway, I guess the interesting question for us is how long it > will take for this fix to propagate into real-world systems. > I don't have much of a clue about the Linux kernel workflow, > anybody want to venture a guess? Me neither. It just

Re: stress test for parallel workers

2020-07-27 Thread Tom Lane
Thomas Munro writes: > Hehe, the dodgy looking magic numbers *were* wrong: > - * The kernel signal delivery code writes up to about 1.5kB > + * The kernel signal delivery code writes a bit over 4KB >

Re: stress test for parallel workers

2020-07-27 Thread Thomas Munro
On Wed, Dec 11, 2019 at 3:22 PM Thomas Munro wrote: > On Tue, Oct 15, 2019 at 4:50 AM Tom Lane wrote: > > > Filed at > > > https://bugzilla.kernel.org/show_bug.cgi?id=205183 > > For the curious-and-not-subscribed, there's now a kernel patch > proposed for this. We guessed pretty close, but the

Re: stress test for parallel workers

2019-12-10 Thread Thomas Munro
On Tue, Oct 15, 2019 at 4:50 AM Tom Lane wrote: > > Filed at > > https://bugzilla.kernel.org/show_bug.cgi?id=205183 For the curious-and-not-subscribed, there's now a kernel patch proposed for this. We guessed pretty close, but the problem wasn't those dodgy looking magic numbers, it was that

Re: stress test for parallel workers

2019-10-22 Thread Justin Pryzby
I want to give some conclusion to our occurance of this, which now I think was neither an instance nor indicitive of any bug. Summary: postgres was being kill -9 by a deployment script, after it "timed out". Thus no log messages. I initially experienced this while testing migration of a

Re: stress test for parallel workers

2019-10-14 Thread Tom Lane
I wrote: > Filed at > https://bugzilla.kernel.org/show_bug.cgi?id=205183 > We'll see what happens ... Further to this --- I went back and looked at the outlier events where we saw an infinite_recurse failure on a non-Linux-PPC64 platform. There were only three: mereswine| ARMv7

Re: stress test for parallel workers

2019-10-14 Thread Andres Freund
Hi, On 2019-10-13 13:44:59 +1300, Thomas Munro wrote: > On Sun, Oct 13, 2019 at 1:06 PM Tom Lane wrote: > > I don't think any further proof is required that this is > > a kernel bug. Where would be a good place to file it? > > linuxppc-...@lists.ozlabs.org might be the right place. > >

Re: stress test for parallel workers

2019-10-14 Thread Tom Lane
Andres Freund writes: > Probably requires reproducing on a pretty recent kernel first, to have a > decent chance of being investigated... How recent do you think it needs to be? The machine I was testing on yesterday is under a year old: uname -m = ppc64le uname -r = 4.18.19-100.fc27.ppc64le

Re: stress test for parallel workers

2019-10-13 Thread Tom Lane
Filed at https://bugzilla.kernel.org/show_bug.cgi?id=205183 We'll see what happens ... regards, tom lane

Re: stress test for parallel workers

2019-10-13 Thread Tom Lane
Andres Freund writes: > On 2019-10-13 10:29:45 -0400, Tom Lane wrote: >> How recent do you think it needs to be? > My experience reporting kernel bugs is that the latest released version, > or even just the tip of the git tree, is your best bet :/. Considering that we're going to point them at

Re: stress test for parallel workers

2019-10-13 Thread Andres Freund
Hi, On 2019-10-13 10:29:45 -0400, Tom Lane wrote: > Andres Freund writes: > > Probably requires reproducing on a pretty recent kernel first, to have a > > decent chance of being investigated... > > How recent do you think it needs to be? The machine I was testing on > yesterday is under a year

Re: stress test for parallel workers

2019-10-12 Thread Thomas Munro
On Sun, Oct 13, 2019 at 1:06 PM Tom Lane wrote: > I don't think any further proof is required that this is > a kernel bug. Where would be a good place to file it? linuxppc-...@lists.ozlabs.org might be the right place. https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: stress test for parallel workers

2019-10-12 Thread Tom Lane
I've now also been able to reproduce the "infinite_recurse" segfault on wobbegong's host (or, since I was using a gcc build, I guess I should say vulpes' host). The first-order result is that it's the same problem with the kernel not giving us as much stack space as we expect: there's only

Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
I wrote: > It's not very clear how those things would lead to an intermittent > failure though. In the case of the postmaster crashes, we now see > that timing of signal receipts is relevant. For infinite_recurse, > maybe it only fails if an sinval interrupt happens at the wrong time? > (This

Re: stress test for parallel workers

2019-10-11 Thread Thomas Munro
On Sat, Oct 12, 2019 at 9:40 AM Tom Lane wrote: > Andres Freund writes: > > On 2019-10-11 14:56:41 -0400, Tom Lane wrote: > >> ... So it's really hard to explain > >> that as anything except a kernel bug: sometimes, the kernel > >> doesn't give us as much stack as it promised it would. And the

Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Andres Freund writes: > On 2019-10-11 14:56:41 -0400, Tom Lane wrote: >> ... So it's really hard to explain >> that as anything except a kernel bug: sometimes, the kernel >> doesn't give us as much stack as it promised it would. And the >> machine is not loaded enough for there to be any

Re: stress test for parallel workers

2019-10-11 Thread Andres Freund
Hi, On 2019-10-11 14:56:41 -0400, Tom Lane wrote: > I still don't have a good explanation for why this only seems to > happen in the pg_upgrade test sequence. However, I did notice > something very interesting: the postmaster crashes after consuming > only about 1MB of stack space. This is

Re: stress test for parallel workers

2019-10-11 Thread Mark Wong
On Sat, Oct 12, 2019 at 08:41:12AM +1300, Thomas Munro wrote: > On Sat, Oct 12, 2019 at 7:56 AM Tom Lane wrote: > > This matches up with the intermittent infinite_recurse failures > > we've been seeing in the buildfarm. Those are happening across > > a range of systems, but they're (almost) all

Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Thomas Munro writes: > Yeah, I don't know anything about this stuff, but I was also beginning > to wonder if something is busted in the arch-specific fault.c code > that checks if stack expansion is valid[1], in a way that fails with a > rapidly growing stack, well timed incoming signals, and

Re: stress test for parallel workers

2019-10-11 Thread Thomas Munro
On Sat, Oct 12, 2019 at 7:56 AM Tom Lane wrote: > This matches up with the intermittent infinite_recurse failures > we've been seeing in the buildfarm. Those are happening across > a range of systems, but they're (almost) all Linux-based ppc64, > suggesting that there's a longstanding

Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Andrew Dunstan writes: > On 10/11/19 11:45 AM, Tom Lane wrote: >> FWIW, I'm not excited about that as a permanent solution. It requires >> root privilege, and it affects the whole machine not only the buildfarm, >> and making it persist across reboots is even more invasive. > OK, but I'm not

Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
I wrote: > What we've apparently got here is that signals were received > so fast that the postmaster ran out of stack space. I remember > Andres complaining about this as a theoretical threat, but I > hadn't seen it in the wild before. > I haven't finished investigating though, as there are

Re: stress test for parallel workers

2019-10-11 Thread Andrew Dunstan
On 10/11/19 11:45 AM, Tom Lane wrote: > Andrew Dunstan writes: >>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works. > FWIW, I'm not excited about that as a permanent solution. It requires > root privilege, and it affects the whole machine not only the buildfarm, > and

Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Andrew Dunstan writes: >> At least on F29 I have set /proc/sys/kernel/core_pattern and it works. FWIW, I'm not excited about that as a permanent solution. It requires root privilege, and it affects the whole machine not only the buildfarm, and making it persist across reboots is even more

Re: stress test for parallel workers

2019-10-11 Thread Andrew Dunstan
On 10/10/19 6:01 PM, Andrew Dunstan wrote: > On 10/10/19 5:34 PM, Tom Lane wrote: >> I wrote: > Yeah, I've been wondering whether pg_ctl could fork off a subprocess > that would fork the postmaster, wait for the postmaster to exit, and then > report the exit status. >>> [ pushed at

Re: stress test for parallel workers

2019-10-10 Thread Andrew Dunstan
On 10/10/19 5:34 PM, Tom Lane wrote: > I wrote: Yeah, I've been wondering whether pg_ctl could fork off a subprocess that would fork the postmaster, wait for the postmaster to exit, and then report the exit status. >> [ pushed at 6a5084eed ] >> Given wobbegong's recent failure

Re: stress test for parallel workers

2019-10-10 Thread Mark Wong
On Thu, Oct 10, 2019 at 05:34:51PM -0400, Tom Lane wrote: > A nearer-term solution would be to reproduce this manually and > dig into the core. Mark, are you in a position to give somebody > ssh access to wobbegong's host, or another similarly-configured VM? > > (While at it, it'd be nice to

Re: stress test for parallel workers

2019-10-10 Thread Tom Lane
I wrote: >>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess >>> that would fork the postmaster, wait for the postmaster to exit, and then >>> report the exit status. > [ pushed at 6a5084eed ] > Given wobbegong's recent failure rate, I don't think we'll have to wait > long.

Re: stress test for parallel workers

2019-10-07 Thread Robert Haas
On Tue, Jul 23, 2019 at 7:29 PM Tom Lane wrote: > Parallel workers aren't ever allowed to write, in the current > implementation, so it's not real obvious why they'd have any > WAL log files open at all. Parallel workers are not forbidden to write WAL, nor are they forbidden to modify blocks.

Re: stress test for parallel workers

2019-10-06 Thread Tom Lane
Thomas Munro writes: > On Wed, Aug 7, 2019 at 4:29 PM Tom Lane wrote: >> Yeah, I've been wondering whether pg_ctl could fork off a subprocess >> that would fork the postmaster, wait for the postmaster to exit, and then >> report the exit status. Where to report it *to* seems like the hard part,

Re: stress test for parallel workers

2019-08-07 Thread Heikki Linnakangas
On 07/08/2019 17:45, Tom Lane wrote: Heikki Linnakangas writes: On 07/08/2019 16:57, Tom Lane wrote: Also, if you're using systemd or something else that thinks it ought to interfere with where cores get dropped, that could be a problem. I think they should just go to a file called "core",

Re: stress test for parallel workers

2019-08-07 Thread Tom Lane
Heikki Linnakangas writes: > On 07/08/2019 16:57, Tom Lane wrote: >> Also, if you're using systemd or something else that thinks it >> ought to interfere with where cores get dropped, that could be >> a problem. > I think they should just go to a file called "core", I don't think I've > changed

Re: stress test for parallel workers

2019-08-07 Thread Heikki Linnakangas
On 07/08/2019 16:57, Tom Lane wrote: Heikki Linnakangas writes: On 07/08/2019 02:57, Thomas Munro wrote: On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote: So I think I've got to take back the assertion that we've got some lurking generic problem. This pattern looks way more like a

Re: stress test for parallel workers

2019-08-07 Thread Tom Lane
Heikki Linnakangas writes: > On 07/08/2019 02:57, Thomas Munro wrote: >> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote: >>> So I think I've got to take back the assertion that we've got >>> some lurking generic problem. This pattern looks way more >>> like a platform-specific issue.

Re: stress test for parallel workers

2019-08-07 Thread Heikki Linnakangas
On 07/08/2019 02:57, Thomas Munro wrote: On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote: So I think I've got to take back the assertion that we've got some lurking generic problem. This pattern looks way more like a platform-specific issue. Overaggressive OOM killer would fit the facts on

Re: stress test for parallel workers

2019-08-06 Thread Thomas Munro
On Wed, Aug 7, 2019 at 5:07 PM Tom Lane wrote: > Thomas Munro writes: > > Another question is whether the build farm should be setting the Linux > > oom score adjust thing. > > AFAIK you can't do that without being root. Rats, yeah you need CAP_SYS_RESOURCE or root to lower it. -- Thomas

Re: stress test for parallel workers

2019-08-06 Thread Tom Lane
Thomas Munro writes: > Another question is whether the build farm should be setting the Linux > oom score adjust thing. AFAIK you can't do that without being root. regards, tom lane

Re: stress test for parallel workers

2019-08-06 Thread Thomas Munro
On Wed, Aug 7, 2019 at 4:29 PM Tom Lane wrote: > Thomas Munro writes: > > I wondered if the build farm should try to report OOM kill -9 or other > > signal activity affecting the postmaster. > > Yeah, I've been wondering whether pg_ctl could fork off a subprocess > that would fork the

Re: stress test for parallel workers

2019-08-06 Thread Tom Lane
Thomas Munro writes: > I wondered if the build farm should try to report OOM kill -9 or other > signal activity affecting the postmaster. Yeah, I've been wondering whether pg_ctl could fork off a subprocess that would fork the postmaster, wait for the postmaster to exit, and then report the exit

Re: stress test for parallel workers

2019-08-06 Thread Thomas Munro
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote: > Thomas Munro writes: > > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote: > > Do you have an example to hand? Is this > > failure always happening on Linux? > > I dug around a bit further, and while my recollection of a lot of > "postmaster

Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Thomas Munro writes: > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote: >> In any case, the evidence from the buildfarm is pretty clear that >> there is *some* connection. We've seen a lot of recent failures >> involving "postmaster exited during a parallel transaction", while >> the number of

Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby wrote: > > I ought to have remembered that it *was* in fact out of space this AM when > > this > > core was dumped (due to having not touched it since scheduling transition to > > this

Re: stress test for parallel workers

2019-07-23 Thread Alvaro Herrera
On 2019-Jul-23, Justin Pryzby wrote: > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > failing to find log output, I ran df right after the failure. I'm not sure that this proves much, since I expect temporary files to be deleted on failure; by the time you run 'df'

Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote: > Thomas Munro writes: > > *I suspect that the only thing implicating parallelism in this failure > > is that parallel leaders happen to print out that message if the > > postmaster dies while they are waiting for workers; most other places > >

Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby wrote: > I ought to have remembered that it *was* in fact out of space this AM when > this > core was dumped (due to having not touched it since scheduling transition to > this VM last week). > > I want to say I'm almost certain it wasn't ENOSPC in

Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Justin Pryzby writes: > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > failing to find log output, I ran df right after the failure. The fact that you're not finding log output matching what was reported to the client seems to me to be a mighty strong indication that

Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby wrote: > > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby > > > wrote: > > > > #2 0x0085ddff in errfinish

Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby wrote: > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote: > > > #2 0x0085ddff in errfinish (dummy=) at > > > elog.c:555 > > > edata = > > > > If you have that

Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote: > > #2 0x0085ddff in errfinish (dummy=) at > > elog.c:555 > > edata = > > If you have that core, it might be interesting to go to frame 2 and > print *edata or

Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 10:03 AM Thomas Munro wrote: > > edata = > If you have that core, it might be interesting to go to frame 2 and > print *edata or edata->saved_errno. ... Rats. We already saw that it's optimised out so unless we can find that somewhere else in a variable that's

Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Thomas Munro writes: > *I suspect that the only thing implicating parallelism in this failure > is that parallel leaders happen to print out that message if the > postmaster dies while they are waiting for workers; most other places > (probably every other backend in your cluster) just quietly

Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote: > #2 0x0085ddff in errfinish (dummy=) at > elog.c:555 > edata = > elevel = 22 > oldcontext = 0x27e15d0 > econtext = 0x0 > __func__ = "errfinish" > #3 0x006f7e94 in

Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 4:27 AM Justin Pryzby wrote: > < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a > parallel transaction > < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX > unused0_huawei_umts_nodeb_locell_201907_unique_idx ON >

Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote: > ... you'd think an OOM kill would show up in the kernel log. > (Not necessarily in dmesg, though. Did you try syslog?) Nothing in /var/log/messages (nor dmesg ring). I enabled abrtd while trying to reproduce it last week. Since you

Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Justin Pryzby writes: > Does anyone have a stress test for parallel workers ? > On a customer's new VM, I got this several times while (trying to) migrate > their DB: > < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a > parallel transaction We've been seeing this