On Tue, Aug 25, 2020 at 1:43 AM Tom Lane wrote:
> I wrote:
> > For our archives' sake: today I got seemingly-automated mail informing me
> > that this patch has been merged into the 4.19-stable, 5.4-stable,
> > 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
> > 4.9-stable, or
I wrote:
> For our archives' sake: today I got seemingly-automated mail informing me
> that this patch has been merged into the 4.19-stable, 5.4-stable,
> 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
> 4.9-stable, or 4.14-stable, because it failed to apply.
And this morning's
Thomas Munro writes:
> On Tue, Jul 28, 2020 at 3:27 PM Tom Lane wrote:
>> Anyway, I guess the interesting question for us is how long it
>> will take for this fix to propagate into real-world systems.
>> I don't have much of a clue about the Linux kernel workflow,
>> anybody want to venture a
On Tue, Jul 28, 2020 at 3:27 PM Tom Lane wrote:
> Anyway, I guess the interesting question for us is how long it
> will take for this fix to propagate into real-world systems.
> I don't have much of a clue about the Linux kernel workflow,
> anybody want to venture a guess?
Me neither. It just
Thomas Munro writes:
> Hehe, the dodgy looking magic numbers *were* wrong:
> - * The kernel signal delivery code writes up to about 1.5kB
> + * The kernel signal delivery code writes a bit over 4KB
>
On Wed, Dec 11, 2019 at 3:22 PM Thomas Munro wrote:
> On Tue, Oct 15, 2019 at 4:50 AM Tom Lane wrote:
> > > Filed at
> > > https://bugzilla.kernel.org/show_bug.cgi?id=205183
>
> For the curious-and-not-subscribed, there's now a kernel patch
> proposed for this. We guessed pretty close, but the
On Tue, Oct 15, 2019 at 4:50 AM Tom Lane wrote:
> > Filed at
> > https://bugzilla.kernel.org/show_bug.cgi?id=205183
For the curious-and-not-subscribed, there's now a kernel patch
proposed for this. We guessed pretty close, but the problem wasn't
those dodgy looking magic numbers, it was that
database system is ready to accept
connections
On Tue, Jul 23, 2019 at 11:27:03AM -0500, Justin Pryzby wrote:
> Does anyone have a stress test for parallel workers ?
>
> On a customer's new VM, I got this several times while (trying to) migrate
> their DB:
>
> < 2019-07-23
I wrote:
> Filed at
> https://bugzilla.kernel.org/show_bug.cgi?id=205183
> We'll see what happens ...
Further to this --- I went back and looked at the outlier events
where we saw an infinite_recurse failure on a non-Linux-PPC64
platform. There were only three:
mereswine| ARMv7
Hi,
On 2019-10-13 13:44:59 +1300, Thomas Munro wrote:
> On Sun, Oct 13, 2019 at 1:06 PM Tom Lane wrote:
> > I don't think any further proof is required that this is
> > a kernel bug. Where would be a good place to file it?
>
> linuxppc-...@lists.ozlabs.org might be the right place.
>
>
Andres Freund writes:
> Probably requires reproducing on a pretty recent kernel first, to have a
> decent chance of being investigated...
How recent do you think it needs to be? The machine I was testing on
yesterday is under a year old:
uname -m = ppc64le
uname -r = 4.18.19-100.fc27.ppc64le
Filed at
https://bugzilla.kernel.org/show_bug.cgi?id=205183
We'll see what happens ...
regards, tom lane
Andres Freund writes:
> On 2019-10-13 10:29:45 -0400, Tom Lane wrote:
>> How recent do you think it needs to be?
> My experience reporting kernel bugs is that the latest released version,
> or even just the tip of the git tree, is your best bet :/.
Considering that we're going to point them at
Hi,
On 2019-10-13 10:29:45 -0400, Tom Lane wrote:
> Andres Freund writes:
> > Probably requires reproducing on a pretty recent kernel first, to have a
> > decent chance of being investigated...
>
> How recent do you think it needs to be? The machine I was testing on
> yesterday is under a year
On Sun, Oct 13, 2019 at 1:06 PM Tom Lane wrote:
> I don't think any further proof is required that this is
> a kernel bug. Where would be a good place to file it?
linuxppc-...@lists.ozlabs.org might be the right place.
https://lists.ozlabs.org/listinfo/linuxppc-dev
I've now also been able to reproduce the "infinite_recurse" segfault
on wobbegong's host (or, since I was using a gcc build, I guess I
should say vulpes' host). The first-order result is that it's the
same problem with the kernel not giving us as much stack space as
we expect: there's only
I wrote:
> It's not very clear how those things would lead to an intermittent
> failure though. In the case of the postmaster crashes, we now see
> that timing of signal receipts is relevant. For infinite_recurse,
> maybe it only fails if an sinval interrupt happens at the wrong time?
> (This
On Sat, Oct 12, 2019 at 9:40 AM Tom Lane wrote:
> Andres Freund writes:
> > On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> >> ... So it's really hard to explain
> >> that as anything except a kernel bug: sometimes, the kernel
> >> doesn't give us as much stack as it promised it would. And the
Andres Freund writes:
> On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
>> ... So it's really hard to explain
>> that as anything except a kernel bug: sometimes, the kernel
>> doesn't give us as much stack as it promised it would. And the
>> machine is not loaded enough for there to be any
Hi,
On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> I still don't have a good explanation for why this only seems to
> happen in the pg_upgrade test sequence. However, I did notice
> something very interesting: the postmaster crashes after consuming
> only about 1MB of stack space. This is
On Sat, Oct 12, 2019 at 08:41:12AM +1300, Thomas Munro wrote:
> On Sat, Oct 12, 2019 at 7:56 AM Tom Lane wrote:
> > This matches up with the intermittent infinite_recurse failures
> > we've been seeing in the buildfarm. Those are happening across
> > a range of systems, but they're (almost) all
Thomas Munro writes:
> Yeah, I don't know anything about this stuff, but I was also beginning
> to wonder if something is busted in the arch-specific fault.c code
> that checks if stack expansion is valid[1], in a way that fails with a
> rapidly growing stack, well timed incoming signals, and
On Sat, Oct 12, 2019 at 7:56 AM Tom Lane wrote:
> This matches up with the intermittent infinite_recurse failures
> we've been seeing in the buildfarm. Those are happening across
> a range of systems, but they're (almost) all Linux-based ppc64,
> suggesting that there's a longstanding
Andrew Dunstan writes:
> On 10/11/19 11:45 AM, Tom Lane wrote:
>> FWIW, I'm not excited about that as a permanent solution. It requires
>> root privilege, and it affects the whole machine not only the buildfarm,
>> and making it persist across reboots is even more invasive.
> OK, but I'm not
I wrote:
> What we've apparently got here is that signals were received
> so fast that the postmaster ran out of stack space. I remember
> Andres complaining about this as a theoretical threat, but I
> hadn't seen it in the wild before.
> I haven't finished investigating though, as there are
On 10/11/19 11:45 AM, Tom Lane wrote:
> Andrew Dunstan writes:
>>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.
> FWIW, I'm not excited about that as a permanent solution. It requires
> root privilege, and it affects the whole machine not only the buildfarm,
> and
Andrew Dunstan writes:
>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.
FWIW, I'm not excited about that as a permanent solution. It requires
root privilege, and it affects the whole machine not only the buildfarm,
and making it persist across reboots is even more
On 10/10/19 6:01 PM, Andrew Dunstan wrote:
> On 10/10/19 5:34 PM, Tom Lane wrote:
>> I wrote:
> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
> that would fork the postmaster, wait for the postmaster to exit, and then
> report the exit status.
>>> [ pushed at
On 10/10/19 5:34 PM, Tom Lane wrote:
> I wrote:
Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit status.
>> [ pushed at 6a5084eed ]
>> Given wobbegong's recent failure
On Thu, Oct 10, 2019 at 05:34:51PM -0400, Tom Lane wrote:
> A nearer-term solution would be to reproduce this manually and
> dig into the core. Mark, are you in a position to give somebody
> ssh access to wobbegong's host, or another similarly-configured VM?
>
> (While at it, it'd be nice to
I wrote:
>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>>> that would fork the postmaster, wait for the postmaster to exit, and then
>>> report the exit status.
> [ pushed at 6a5084eed ]
> Given wobbegong's recent failure rate, I don't think we'll have to wait
> long.
On Tue, Jul 23, 2019 at 7:29 PM Tom Lane wrote:
> Parallel workers aren't ever allowed to write, in the current
> implementation, so it's not real obvious why they'd have any
> WAL log files open at all.
Parallel workers are not forbidden to write WAL, nor are they
forbidden to modify blocks.
Thomas Munro writes:
> On Wed, Aug 7, 2019 at 4:29 PM Tom Lane wrote:
>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>> that would fork the postmaster, wait for the postmaster to exit, and then
>> report the exit status. Where to report it *to* seems like the hard part,
On 07/08/2019 17:45, Tom Lane wrote:
Heikki Linnakangas writes:
On 07/08/2019 16:57, Tom Lane wrote:
Also, if you're using systemd or something else that thinks it
ought to interfere with where cores get dropped, that could be
a problem.
I think they should just go to a file called "core",
Heikki Linnakangas writes:
> On 07/08/2019 16:57, Tom Lane wrote:
>> Also, if you're using systemd or something else that thinks it
>> ought to interfere with where cores get dropped, that could be
>> a problem.
> I think they should just go to a file called "core", I don't think I've
> changed
On 07/08/2019 16:57, Tom Lane wrote:
Heikki Linnakangas writes:
On 07/08/2019 02:57, Thomas Munro wrote:
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a
Heikki Linnakangas writes:
> On 07/08/2019 02:57, Thomas Munro wrote:
>> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
>>> So I think I've got to take back the assertion that we've got
>>> some lurking generic problem. This pattern looks way more
>>> like a platform-specific issue.
On 07/08/2019 02:57, Thomas Munro wrote:
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a platform-specific issue. Overaggressive OOM killer
would fit the facts on
On Wed, Aug 7, 2019 at 5:07 PM Tom Lane wrote:
> Thomas Munro writes:
> > Another question is whether the build farm should be setting the Linux
> > oom score adjust thing.
>
> AFAIK you can't do that without being root.
Rats, yeah you need CAP_SYS_RESOURCE or root to lower it.
--
Thomas
Thomas Munro writes:
> Another question is whether the build farm should be setting the Linux
> oom score adjust thing.
AFAIK you can't do that without being root.
regards, tom lane
On Wed, Aug 7, 2019 at 4:29 PM Tom Lane wrote:
> Thomas Munro writes:
> > I wondered if the build farm should try to report OOM kill -9 or other
> > signal activity affecting the postmaster.
>
> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
> that would fork the
Thomas Munro writes:
> I wondered if the build farm should try to report OOM kill -9 or other
> signal activity affecting the postmaster.
Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
> Thomas Munro writes:
> > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote:
> > Do you have an example to hand? Is this
> > failure always happening on Linux?
>
> I dug around a bit further, and while my recollection of a lot of
> "postmaster
Thomas Munro writes:
> On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote:
>> In any case, the evidence from the buildfarm is pretty clear that
>> there is *some* connection. We've seen a lot of recent failures
>> involving "postmaster exited during a parallel transaction", while
>> the number of
On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby wrote:
> > I ought to have remembered that it *was* in fact out of space this AM when
> > this
> > core was dumped (due to having not touched it since scheduling transition to
> > this
On 2019-Jul-23, Justin Pryzby wrote:
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.
I'm not sure that this proves much, since I expect temporary files to be
deleted on failure; by the time you run 'df'
On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote:
> Thomas Munro writes:
> > *I suspect that the only thing implicating parallelism in this failure
> > is that parallel leaders happen to print out that message if the
> > postmaster dies while they are waiting for workers; most other places
> >
On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby wrote:
> I ought to have remembered that it *was* in fact out of space this AM when
> this
> core was dumped (due to having not touched it since scheduling transition to
> this VM last week).
>
> I want to say I'm almost certain it wasn't ENOSPC in
Justin Pryzby writes:
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.
The fact that you're not finding log output matching what was reported
to the client seems to me to be a mighty strong indication that
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby
> > > wrote:
> > > > #2 0x0085ddff in errfinish
On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby wrote:
> On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote:
> > > #2 0x0085ddff in errfinish (dummy=) at
> > > elog.c:555
> > > edata =
> >
> > If you have that
On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote:
> > #2 0x0085ddff in errfinish (dummy=) at
> > elog.c:555
> > edata =
>
> If you have that core, it might be interesting to go to frame 2 and
> print *edata or
On Wed, Jul 24, 2019 at 10:03 AM Thomas Munro wrote:
> > edata =
> If you have that core, it might be interesting to go to frame 2 and
> print *edata or edata->saved_errno. ...
Rats. We already saw that it's optimised out so unless we can find
that somewhere else in a variable that's
Thomas Munro writes:
> *I suspect that the only thing implicating parallelism in this failure
> is that parallel leaders happen to print out that message if the
> postmaster dies while they are waiting for workers; most other places
> (probably every other backend in your cluster) just quietly
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote:
> #2 0x0085ddff in errfinish (dummy=) at
> elog.c:555
> edata =
> elevel = 22
> oldcontext = 0x27e15d0
> econtext = 0x0
> __func__ = "errfinish"
> #3 0x006f7e94 in
On Wed, Jul 24, 2019 at 4:27 AM Justin Pryzby wrote:
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a
> parallel transaction
> < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX
> unused0_huawei_umts_nodeb_locell_201907_unique_idx ON
>
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though. Did you try syslog?)
Nothing in /var/log/messages (nor dmesg ring).
I enabled abrtd while trying to reproduce it last week. Since you
Justin Pryzby writes:
> Does anyone have a stress test for parallel workers ?
> On a customer's new VM, I got this several times while (trying to) migrate
> their DB:
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a
> parallel transacti
Does anyone have a stress test for parallel workers ?
On a customer's new VM, I got this several times while (trying to) migrate
their DB:
< 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a
parallel transaction
< 2019-07-23 10:33:51.552 CDT postgres >STATEM
59 matches
Mail list logo