Re: Non-reproducible AIO failure

2025-06-19 Thread Nico Williams
On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote: > I also dug out an archeologically old MacBook Pro running macOS High Sierra > 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and > it > too fails to reproduce any issue. It's not going to be reproducibl

Re: Non-reproducible AIO failure

2025-06-19 Thread Daniel Gustafsson
> On 19 Jun 2025, at 16:36, Andres Freund wrote: > So for some reason this apparently can only be reproduced on older macos - we > know it's not the older compiler, because I couldn't reproduce it on the same > compile version as alexander, on an m1 that was running sequoia. That's really > reall

Re: Non-reproducible AIO failure

2025-06-19 Thread Andres Freund
Hi, On 2025-06-19 17:02:18 +0300, Konstantin Knizhnik wrote: > On 18/06/2025 7:08 pm, Andres Freund wrote: > > Hi, > > > > On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote: > > > On 17/06/2025 6:08 pm, Andres Freund wrote: > > > > I don't think it can - this must be an independent bug from

Re: Non-reproducible AIO failure

2025-06-19 Thread Daniel Gustafsson
> On 19 Jun 2025, at 16:02, Konstantin Knizhnik wrote: > By the way - still not been able to reproduce assertion failure at most > recent MacPro (Apple M4 Pro) with Sequoia 15.5. I tried to reproduce this on an older quad core i7 MacBook Pro running Sonoma 14.7.5 using Apple clang version 15.0.

Re: Non-reproducible AIO failure

2025-06-19 Thread Konstantin Knizhnik
On 18/06/2025 7:08 pm, Andres Freund wrote: Hi, On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote: On 17/06/2025 6:08 pm, Andres Freund wrote: I don't think it can - this must be an independent bug from the one that Tom and I were encountering. I see... It's a pity. Indeed. Konstant

Re: Non-reproducible AIO failure

2025-06-18 Thread Thomas Munro
On Thu, Jun 19, 2025 at 4:08 AM Andres Freund wrote: > Konstantin, Alexander, can you share what commit you're testing and what > precise changes have been applied to the source? I've now tested this on a > significant number of apple machines for many many days without being able to > reproduce

Re: Non-reproducible AIO failure

2025-06-18 Thread Andres Freund
Hi, On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote: > On 17/06/2025 6:08 pm, Andres Freund wrote: > > > > I don't think it can - this must be an independent bug from the one that Tom > > and I were encountering. > I see... It's a pity. Indeed. Konstantin, Alexander, can you share what

Re: Non-reproducible AIO failure

2025-06-18 Thread Konstantin Knizhnik
On 17/06/2025 6:08 pm, Andres Freund wrote: I don't think it can - this must be an independent bug from the one that Tom and I were encountering. I see... It's a pity. By the way, I have a questions concerning using interrupts in AIO. The comments say: pgaio_io_release(PgAioHandle *ioh)  

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
On 2025-06-17 18:08:30 +0300, Konstantin Knizhnik wrote: > > On 17/06/2025 4:47 pm, Andres Freund wrote: > > > I and Alexandr are using completely different devices with different > > > hardware, OS and clang version. > > Both of you are running Ventura, right? > > > No, Alexandr is using darwin2

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
On 2025-06-17 17:54:12 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > The problem appears to be in that switch between "when submitted, by the IO > > worker" and "then again by the backend". It's not concurrent access in the > > sense of two processes writin

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 17/06/2025 4:47 pm, Andres Freund wrote: I and Alexandr are using completely different devices with different hardware, OS and clang version. Both of you are running Ventura, right? No, Alexandr is using darwin23.5 Alexandr also noticed that he can reproduce the problem only with --with-l

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 12/06/2025 4:57 pm, Andres Freund wrote: The problem appears to be in that switch between "when submitted, by the IO worker" and "then again by the backend". It's not concurrent access in the sense of two processes writing to the same value, it's that when switching from the worker updating

Re: Non-reproducible AIO failure

2025-06-17 Thread Tom Lane
Andres Freund writes: > Both of you are running Ventura, right? FTR, the machines I'm trying this on are all running current Sequoia: [tgl@minim4 ~]$ uname -a Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T604

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
Hi, On 2025-06-17 16:43:05 +0300, Konstantin Knizhnik wrote: > On 17/06/2025 4:35 pm, Andres Freund wrote: > > Konstantin, Alexander - are you using the same device to reproduce this or > > different ones? I wonder if this somehow depends on some MDM / corporate > > enforcement tooling running or

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 17/06/2025 4:35 pm, Andres Freund wrote: Konstantin, Alexander - are you using the same device to reproduce this or different ones? I wonder if this somehow depends on some MDM / corporate enforcement tooling running or such. What does: - profiles status -type enrollment - kextstat -l show?

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
Hi, On 2025-06-16 20:22:00 -0400, Tom Lane wrote: > Konstantin Knizhnik writes: > > On 16/06/2025 6:11 pm, Andres Freund wrote: > >> I unfortunately can't repro this issue so far. > > > But unfortunately it means that the problem is not fixed. > > FWIW, I get similar results to Andres' on a Mac M

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 17/06/2025 3:22 am, Tom Lane wrote: Konstantin Knizhnik writes: On 16/06/2025 6:11 pm, Andres Freund wrote: I unfortunately can't repro this issue so far. But unfortunately it means that the problem is not fixed. FWIW, I get similar results to Andres' on a Mac Mini M4 Pro using MacPorts'

Re: Non-reproducible AIO failure

2025-06-16 Thread Tom Lane
Konstantin Knizhnik writes: > On 16/06/2025 6:11 pm, Andres Freund wrote: >> I unfortunately can't repro this issue so far. > But unfortunately it means that the problem is not fixed. FWIW, I get similar results to Andres' on a Mac Mini M4 Pro using MacPorts' current compiler release (clang vers

Re: Non-reproducible AIO failure

2025-06-16 Thread Konstantin Knizhnik
On 16/06/2025 6:11 pm, Andres Freund wrote: Hi, On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote: One more update: with the proposed patch (memory barrier before `ConditionVariableBroadcast` in `pgaio_io_process_completion` I don't see how that barrier could be required for correctness

Re: Non-reproducible AIO failure

2025-06-16 Thread Andres Freund
Hi, On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote: > One more update: with the proposed patch (memory barrier before > `ConditionVariableBroadcast` in `pgaio_io_process_completion` I don't see how that barrier could be required for correctness - ConditionVariableBroadcast() is a barrier

Re: Non-reproducible AIO failure

2025-06-16 Thread Konstantin Knizhnik
One more update: with the proposed patch (memory barrier before `ConditionVariableBroadcast` in `pgaio_io_process_completion` and replacing bit fields with `uint8`) the problem is not reproduced at my system during 5 seconds.

Re: Non-reproducible AIO failure

2025-06-15 Thread Konstantin Knizhnik
With this two additional changes: diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c index 6c6c0a908e2..6dd2816bea9 100644 --- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -538,6 +538,9 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)

Re: Non-reproducible AIO failure

2025-06-15 Thread Konstantin Knizhnik
On 13/06/2025 11:20 pm, Andres Freund wrote: Attached is a patch that fixes the problem for me. Alexander, Konstantin, could you verify that it also fixes the problem for you? Given that it does address the problem for me, I'm inclined to push this fairly soon, the barrier is pretty obviously r

Re: Non-reproducible AIO failure

2025-06-14 Thread Konstantin Knizhnik
On 13/06/2025 11:20 pm, Andres Freund wrote: Hi, On 2025-06-12 12:23:13 -0400, Andres Freund wrote: On 2025-06-12 11:52:31 -0400, Andres Freund wrote: On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: On 12/06/2025 4:57 pm, Andres Freund wrote: The problem appears to be in that swit

Re: Non-reproducible AIO failure

2025-06-13 Thread Andres Freund
Hi, On 2025-06-12 12:23:13 -0400, Andres Freund wrote: > On 2025-06-12 11:52:31 -0400, Andres Freund wrote: > > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > > > The problem appears to be in that switch between "when submitted, by >

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 11:52:31 -0400, Andres Freund wrote: > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > > The problem appears to be in that switch between "when submitted, by the > > > IO > > > worker" and "then again by the backend".

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > On 12/06/2025 4:57 pm, Andres Freund wrote: > > The problem appears to be in that switch between "when submitted, by the IO > > worker" and "then again by the backend". It's not concurrent access in the > > sense of two processes writ

Re: Non-reproducible AIO failure

2025-06-12 Thread Konstantin Knizhnik
On 12/06/2025 4:57 pm, Andres Freund wrote: The problem appears to be in that switch between "when submitted, by the IO worker" and "then again by the backend". It's not concurrent access in the sense of two processes writing to the same value, it's that when switching from the worker updating

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 16:30:54 +0300, Konstantin Knizhnik wrote: > On 12/06/2025 4:13 pm, Andres Freund wrote: > > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: > > I'm reasonably certain I found the issue, I think it's a missing memory > > barrier on the read side. The CPU is reordering th

Re: Non-reproducible AIO failure

2025-06-12 Thread Konstantin Knizhnik
On 12/06/2025 4:13 pm, Andres Freund wrote: Hi, On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: Reproduced it once again with with write-protected io handle. But once again - no access violation, just assert failure. Previously "op" field was overwritten somewhere between `pgaio_io_

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: > Reproduced it once again with with write-protected io handle. > But once again - no access violation, just assert failure. > > Previously "op" field was overwritten somewhere between `pgaio_io_reclaim` > and `AsyncReadBuffers`: > > !

Re: Non-reproducible AIO failure

2025-06-12 Thread Konstantin Knizhnik
Reproduced it once again with with write-protected io handle. But once again - no access violation, just assert failure. Previously "op" field was overwritten somewhere between `pgaio_io_reclaim` and `AsyncReadBuffers`: !!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generatio

Re: Non-reproducible AIO failure

2025-06-11 Thread Konstantin Knizhnik
I tried to catch moment when memory is changed using mprotect. I have aligned PgAioHandle on page boundary (16kb at MacOS), and disable writes in `pgaio_io_reclaim`: ``` static void pgaio_io_reclaim(PgAioHandle *ioh) {    RESUME_INTERRUPTS();     rc = mprotect(ioh, sizeof(*ioh), PROT_READ);    

Re: Non-reproducible AIO failure

2025-06-10 Thread Andres Freund
Hi, On 2025-06-10 21:09:18 +0300, Konstantin Knizhnik wrote: > > On 10/06/2025 8:41 pm, Andres Freund wrote: > > I was able to reproduce it with gcc, too. > > I've reproduced it without that bitfield, unfortunately :(. > But also only at MacOS? Correct. > I wonder if it is possible to set har

Re: Non-reproducible AIO failure

2025-06-10 Thread Konstantin Knizhnik
On 10/06/2025 8:41 pm, Andres Freund wrote: I was able to reproduce it with gcc, too. I've reproduced it without that bitfield, unfortunately :(. But also only at MacOS? I wonder if it is possible to set hardware watchpoint fro program itself (not using gdb)? I.e. using ptrace? Looks lik

Re: Non-reproducible AIO failure

2025-06-10 Thread Andres Freund
Hi, On 2025-06-10 17:28:11 +0300, Konstantin Knizhnik wrote: > On 09/06/2025 2:05 am, Thomas Munro wrote: > > On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote: > > > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > > > > There is really essential difference in code generated by clang

Re: Non-reproducible AIO failure

2025-06-10 Thread Konstantin Knizhnik
On 09/06/2025 2:05 am, Thomas Munro wrote: On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote: On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: There is really essential difference in code generated by clang 15 (working) and 16 (not working). There also are code gen differences betw

Re: Non-reproducible AIO failure

2025-06-08 Thread Thomas Munro
On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote: > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > > There is really essential difference in code generated by clang 15 (working) > > and 16 (not working). > > There also are code gen differences between upstream clang 17 and apple's >

Re: Non-reproducible AIO failure

2025-06-08 Thread Tom Lane
Andres Freund writes: > The symptoms I can reproduce are slightly different than Alexander's - it's > the assertion failure reported upthread by Tom. > > FWIW, I can continue to repro the assertion after removing the use of the > bitfield in PgAioHandle. So the problem indeed seems to be be indepe

Re: Non-reproducible AIO failure

2025-06-08 Thread Andres Freund
Hi, On 2025-06-06 15:37:45 -0400, Andres Freund wrote: > There shouldn't be any concurrent accesses here, so I don't really see how the > above would explain the problem (the IO can only ever be modified by one > backend, initially the "owning backend", then, when submitted, by the IO > worker, an

Re: Non-reproducible AIO failure

2025-06-07 Thread Konstantin Knizhnik
On 06/06/2025 10:21 pm, Tom Lane wrote: Konstantin Knizhnik writes: There is really essential difference in code generated by clang 15 (working) and 16 (not working). It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations

Re: Non-reproducible AIO failure

2025-06-06 Thread Konstantin Knizhnik
On 06/06/2025 9:47 pm, Andres Freund wrote: Hi, On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: There is really essential difference in code generated by clang 15 (working) and 16 (not working). There also are code gen differences between upstream clang 17 and apple's clang, which is

Re: Non-reproducible AIO failure

2025-06-06 Thread Nico Williams
On Fri, Jun 06, 2025 at 03:37:45PM -0400, Andres Freund wrote: > On 2025-06-06 15:21:13 -0400, Tom Lane wrote: > > So it's our code that is busted. No doubt, what is happening is > > that process A is fetching two fields, modifying one of them, > > and storing the word back (with the observed valu

Re: Non-reproducible AIO failure

2025-06-06 Thread Alexander Lakhin
Hello Andres and Tom, 06.06.2025 22:37, Andres Freund wrote: On 2025-06-06 15:21:13 -0400, Tom Lane wrote: It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations to access bit-field struct members. Such accesses may fetch or

Re: Non-reproducible AIO failure

2025-06-06 Thread Andres Freund
Hi, On 2025-06-06 15:21:13 -0400, Tom Lane wrote: > Konstantin Knizhnik writes: > > There is really essential difference in code generated by clang 15 > > (working) and 16 (not working). > > It's a mistake to think that this is a compiler bug. The C standard > explicitly allows compilers to use

Re: Non-reproducible AIO failure

2025-06-06 Thread Tom Lane
Konstantin Knizhnik writes: > There is really essential difference in code generated by clang 15 > (working) and 16 (not working). It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations to access bit-field struct members. Suc

Re: Non-reproducible AIO failure

2025-06-06 Thread Andres Freund
Hi, On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > There is really essential difference in code generated by clang 15 (working) > and 16 (not working). There also are code gen differences between upstream clang 17 and apple's clang, which is based on llvm 17 as well (I've updated the

Re: Non-reproducible AIO failure

2025-06-06 Thread Konstantin Knizhnik
There is really essential difference in code generated by clang 15 (working) and 16 (not working). ``` pgaio_io_stage(PgAioHandle *ioh, PgAioOp op) { ... HOLD_INTERRUPTS();     ioh->op = op;     ioh->result = 0;     pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);     ... } ``` c

Re: Non-reproducible AIO failure

2025-06-05 Thread Konstantin Knizhnik
On 06/06/2025 2:31 am, Tom Lane wrote: Matthias van de Meent writes: I have a very wild guess that's probably wrong in a weird way, but here goes anyway: Did anyone test if interleaving the enum-typed bitfield fields of PgAioHandle with the uint8 fields might solve the issue? Ugh. I think y

Re: Non-reproducible AIO failure

2025-06-05 Thread Alexander Lakhin
Hello, 05.06.2025 22:00, Alexander Lakhin wrote: Thank you for your attention to this and for the tip! Today I tried the following: --- a/src/include/storage/aio.h +++ b/src/include/storage/aio.h @@ -89,8 +89,8 @@ typedef enum PgAioOp     /* intentionally the zero value, to help catch zeroed

Re: Non-reproducible AIO failure

2025-06-05 Thread Tom Lane
Matthias van de Meent writes: > I have a very wild guess that's probably wrong in a weird way, but > here goes anyway: > Did anyone test if interleaving the enum-typed bitfield fields of > PgAioHandle with the uint8 fields might solve the issue? Ugh. I think you probably nailed it. IMO all thos

Re: Non-reproducible AIO failure

2025-06-05 Thread Matthias van de Meent
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin wrote: > > Hello Thomas and Andres, > > 04.06.2025 23:32, Thomas Munro wrote: > > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote: > >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_

Re: Non-reproducible AIO failure

2025-06-05 Thread Alexander Lakhin
Hello Thomas and Andres, 04.06.2025 23:32, Thomas Munro wrote: On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote: On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->resul

Re: Non-reproducible AIO failure

2025-06-04 Thread Thomas Munro
On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote: > On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > > 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: > > 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: > > 2, ioh->generation: 21694 >

Re: Non-reproducible AIO failure

2025-06-04 Thread Andres Freund
Hi, Thanks for working on investigating this. On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > 02.06.2025 09:00, Alexander Lakhin wrote: > > With additional logging (the patch is attached), I can see the following: > > ... > > !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh

Re: Non-reproducible AIO failure

2025-06-02 Thread Alexander Lakhin
Hello, 02.06.2025 09:00, Alexander Lakhin wrote: With additional logging (the patch is attached), I can see the following: ... !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2 !!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0

Re: Non-reproducible AIO failure

2025-06-01 Thread Alexander Lakhin
31.05.2025 06:00, Alexander Lakhin wrote: Hello Thomas, It looks like I managed to restore all the conditions needed to reproduce that Assert more or less reliably (within a couple of hours), so I can continue experiments. I've added the following debugging: ... With additional logging (the p

Re: Non-reproducible AIO failure

2025-05-30 Thread Alexander Lakhin
Hello Thomas, 25.05.2025 05:45, Thomas Munro wrote: TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 Can you get a core and print *ioh in the debugger? It looks like I managed to restore all the conditions needed to reproduce that Assert more or less

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Andres Freund writes: > I'll see if being graphically logged in somehow indeed increased the repro > rate, and if so I'll expand the debugging somewhat, or if this was just an > absurd coincidence. Hmm. Now that you mention it, the one repro on the M1 came just as I was about to give up and manu

Re: Non-reproducible AIO failure

2025-05-27 Thread Robert Haas
On Sun, May 25, 2025 at 8:25 PM Tom Lane wrote: > The fact that I can trace through this Assert failure but not the > AIO one strongly suggests some system-level problem in the latter. > There is something rotten in the state of Denmark. I have been quite frustrated with lldb on macOS for a while

Re: Non-reproducible AIO failure

2025-05-27 Thread Andres Freund
Hi, On 2025-05-27 14:43:14 -0400, Tom Lane wrote: > Andres Freund writes: > > I just meant that it seems that I can't reproduce it for some as of yet > > unknown reason. I've now been through 3k+ runs of 027_stream_regress, > > without > > a single failure, so there has to be *something* differe

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Andres Freund writes: > I just meant that it seems that I can't reproduce it for some as of yet > unknown reason. I've now been through 3k+ runs of 027_stream_regress, without > a single failure, so there has to be *something* different about my > environment than yours. > Darwin m4-dev 24.1.0 Da

Re: Non-reproducible AIO failure

2025-05-27 Thread Andres Freund
Hi, On 2025-05-27 10:12:28 -0400, Tom Lane wrote: > Andres Freund writes: > > This is on a m4 mac mini. I'm wondering if there's some hardware specific > > memory ordering issue or disk speed based timing issue that I'm just not > > hitting. > > I dunno, I've seen it on three different physical

Re: Non-reproducible AIO failure

2025-05-27 Thread Alexander Lakhin
Hello hackers, 27.05.2025 16:35, Andres Freund пишет: On 2025-05-25 20:05:49 -0400, Tom Lane wrote: Thomas Munro writes: Could you guys please share your exact repro steps? I've just been running 027_stream_regress.pl over and over. It's not a recommendable answer though because the failure p

Re: Non-reproducible AIO failure

2025-05-27 Thread Alexander Lakhin
Hello Tomas, 27.05.2025 16:26, Tomas Vondra wrote: I'm interested in how you run these tests in parallel. Can you share the patch/script? Yeah, sure. I'm running the test as follows: rm -rf src/test/recovery_*; for i in `seq 40`; do cp -r src/test/recovery/ src/test/recovery_$i/; sed -i .bak

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Andres Freund writes: > This is on a m4 mac mini. I'm wondering if there's some hardware specific > memory ordering issue or disk speed based timing issue that I'm just not > hitting. I dunno, I've seen it on three different physical machines now (one M1, two M4 Pros). But it is darn hard to re

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Thomas Munro writes: > Could you please share your configure options? The failures on indri and sifaka were during ordinary buildfarm runs, you can check the animals' details on the website. (Note those are same host machine, the difference is that indri uses some MacPorts packages while sifaka i

Re: Non-reproducible AIO failure

2025-05-27 Thread Tomas Vondra
On 5/24/25 23:00, Alexander Lakhin wrote: > ... > > I'm yet to see the Assert triggered on the buildfarm, but this one looks > interesting too. > > (I can share the complete patch + script for such testing, if it can be > helpful.) > I'm interested in how you run these tests in parallel. Can

Re: Non-reproducible AIO failure

2025-05-27 Thread Andres Freund
Hi, On 2025-05-25 20:05:49 -0400, Tom Lane wrote: > Thomas Munro writes: > > Could you guys please share your exact repro steps? > > I've just been running 027_stream_regress.pl over and over. > It's not a recommendable answer though because the failure > probability is tiny, under 1%. It sound

Re: Non-reproducible AIO failure

2025-05-27 Thread Thomas Munro
On Mon, May 26, 2025 at 12:05 PM Tom Lane wrote: > Thomas Munro writes: > > Could you guys please share your exact repro steps? > > I've just been running 027_stream_regress.pl over and over. > It's not a recommendable answer though because the failure > probability is tiny, under 1%. It sounded

Re: Non-reproducible AIO failure

2025-05-25 Thread Tom Lane
Thomas Munro writes: > On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote: >> So far, I've failed to get anything useful out of core files >> from this failure. The trace goes back no further than >> (lldb) bt >> * thread #1 >> * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8

Re: Non-reproducible AIO failure

2025-05-25 Thread Tom Lane
Thomas Munro writes: > Could you guys please share your exact repro steps? I've just been running 027_stream_regress.pl over and over. It's not a recommendable answer though because the failure probability is tiny, under 1%. It sounded like Alexander had a better way. re

Re: Non-reproducible AIO failure

2025-05-25 Thread Thomas Munro
On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote: > Thomas Munro writes: > > Can you get a core and print *ioh in the debugger? > > So far, I've failed to get anything useful out of core files > from this failure. The trace goes back no further than > > (lldb) bt > * thread #1 > * frame #0: 0x00

Re: Non-reproducible AIO failure

2025-05-24 Thread Tom Lane
Thomas Munro writes: > Can you get a core and print *ioh in the debugger? So far, I've failed to get anything useful out of core files from this failure. The trace goes back no further than (lldb) bt * thread #1 * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8 That's

Re: Non-reproducible AIO failure

2025-05-24 Thread Thomas Munro
On Sun, May 25, 2025 at 9:00 AM Alexander Lakhin wrote: > Hello Thomas, > 24.05.2025 14:42, Thomas Munro wrote: > > On Sat, May 24, 2025 at 3:17 PM Tom Lane wrote: > >> So it seems that "very low-probability issue in our Mac AIO code" is > >> the most probable description. > > There isn't any mac

Re: Non-reproducible AIO failure

2025-05-24 Thread Alexander Lakhin
Hello Thomas, 24.05.2025 14:42, Thomas Munro wrote: On Sat, May 24, 2025 at 3:17 PM Tom Lane wrote: So it seems that "very low-probability issue in our Mac AIO code" is the most probable description. There isn't any macOS-specific AIO code so my first guess would be that it might be due to aa

Re: Non-reproducible AIO failure

2025-05-24 Thread Thomas Munro
On Sat, May 24, 2025 at 3:17 PM Tom Lane wrote: > So it seems that "very low-probability issue in our Mac AIO code" is > the most probable description. There isn't any macOS-specific AIO code so my first guess would be that it might be due to aarch64 weak memory reordering (though Andres speculat

Re: Non-reproducible AIO failure

2025-05-23 Thread Tom Lane
Alexander Lakhin writes: > FWIW, that Assert have just triggered on another mac: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07 Yeah, I was just looking at that too. There is a corefile from that crash, but lldb seems unable to extract anything from

Re: Non-reproducible AIO failure

2025-05-23 Thread Alexander Lakhin
Hello Tom and Andres, 24.04.2025 01:58, Tom Lane wrote: Andres Freund writes: On 2025-04-23 17:17:01 -0400, Tom Lane wrote: My buildfarm animal sifaka just failed like this [1]: There's nothing really special about sifaka, is there? I see -std=gnu99 and a few debug -D cppflags, but they don'

Non-reproducible AIO failure

2025-04-23 Thread Tom Lane
My buildfarm animal sifaka just failed like this [1]: TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 0 postgres0x000100e3df2c ExceptionalCondition + 108 1 postgres0x00

Re: Non-reproducible AIO failure

2025-04-23 Thread Tom Lane
Andres Freund writes: > On 2025-04-23 17:17:01 -0400, Tom Lane wrote: >> My buildfarm animal sifaka just failed like this [1]: > There's nothing really special about sifaka, is there? I see -std=gnu99 and a > few debug -D cppflags, but they don't look they could really be relevant here. No, it's

Re: Non-reproducible AIO failure

2025-04-23 Thread Andres Freund
Hi, On 2025-04-23 17:17:01 -0400, Tom Lane wrote: > My buildfarm animal sifaka just failed like this [1]: There's nothing really special about sifaka, is there? I see -std=gnu99 and a few debug -D cppflags, but they don't look they could really be relevant here. > TRAP: failed Assert("aio_ret->