Re: Why is parula failing?
On Tue, 14 May 2024 at 08:55, David Rowley wrote: > I've not seen any recent failures from Parula that relate to this > issue. The last one seems to have been about 4 weeks ago. > > I'm now wondering if it's time to revert the debugging code added in > 1db689715. Does anyone think differently? > Thanks for keeping an eye. Sadly the older machine was decommissioned and thus parula hasn't been sending results to buildfarm the past few days. I'll try to build a similar machine (but newer gcc etc.) and reopen this thread in case I hit something similar. - robins
Re: Why is parula failing?
On Mon, 15 Apr 2024 at 16:02, Tom Lane wrote: > David Rowley writes: > > If GetNowFloat() somehow was returning a negative number then we could > > end up with a large delay. But if gettimeofday() was so badly broken > > then wouldn't there be some evidence of this in the log timestamps on > > failing runs? > > And indeed that too. I'm finding the "compiler bug" theory > palatable. Robins mentioned having built the compiler from > source, which theoretically should work, but maybe something > went wrong? Or it's missing some important bug fix? > > It might be interesting to back the animal's CFLAGS down > to -O0 and see if things get more stable. The last 25 consecutive runs have passed [1] after switching REL_12_STABLE to -O0 ! So I am wondering whether that confirms that the compiler version is to blame, and while we're still here, is there anything else I could try? If not, by Sunday, I am considering switching parula to gcc v12 (or even v14 experimental - given that massasauga [2] has been pretty stable since its upgrade a few days back). Reference: 1. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=parula=REL_12_STABLE 2. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE - robins
Re: Why is parula failing?
On Mon, 15 Apr 2024 at 14:55, David Rowley wrote: > If GetNowFloat() somehow was returning a negative number then we could > end up with a large delay. But if gettimeofday() was so badly broken > then wouldn't there be some evidence of this in the log timestamps on > failing runs? 3 things stand out for me here, unsure if they're related somehow: 1. Issue where reltuples=48 (in essence runs complete, but few tests fail) 2. SIGABRT - most of which are DDLs (runs complete, but engine crashes + many tests fail) 3. pg_sleep() stuck - (runs never complete, IIUC never gets reported to buildfarm) For #3, one thing I had done earlier (and then reverted) was to set the 'wait_timeout' from current undef to 2 hours. I'll set it again to 2hrs in hopes that #3 starts getting reported to buildfarm too. > I'm not that familiar with the buildfarm config, but I do see some > Valgrind related setting in there. Is PostgreSQL running under > Valgrind on these runs? Not yet. I was tempted, but valgrind has not yet been enabled on this member. IIUC by default they're disabled. 'use_valgrind' => undef, - robins
Re: Why is parula failing?
On Sun, 14 Apr 2024 at 00:12, Tom Lane wrote: > If we were only supposed to sleep 0.1 seconds, how is it waiting > for 60 ms (and, presumably, repeating that)? The logic in > pg_sleep is pretty simple, and it's hard to think of anything except > the system clock jumping (far) backwards that would make this > happen. Any chance of extracting the local variables from the > pg_sleep stack frame? - I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE - I'll keep them (stuck) for this week, in case there's more we can get from them (and to see how long they take) - Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt - REL_16_STABLE) A few things to add: - To reiterate, this instance has gcc v13.2 compiled without any flags (my first time ever TBH) IIRC 'make -k check' came out okay, so at this point I don't think I did something obviously wrong when building gcc from git. - I installed gcc v14.0.1 experimental on massasauga (also an aarch64 and built from git) and despite multiple runs, it seems to be doing okay [1]. - Next week (if I'm still scratching my head - and unless someone advises otherwise), I'll upgrade parula to gcc 14 experimental to see if this is about gcc maturity on graviton (for some reason). I don't expect much to come out of it though (given Tomas testing on rpi5, but doesn't hurt) Ref: 1. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE - robins [postgres@ip-172-31-18-25 ~]$ pstack 26147 #0 0xadeda954 in epoll_pwait () from /lib64/libc.so.6 #1 0x00842888 in WaitEventSetWaitBlock (nevents=1, occurred_events=, cur_timeout=60, set=0x3148fac0) at latch.c:1570 #2 WaitEventSetWait (set=0x3148fac0, timeout=timeout@entry=60, occurred_events=occurred_events@entry=0xd1194748, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516 #3 0x00842c44 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=60, wait_event_info=wait_event_info@entry=150994946) at latch.c:538 #4 0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406 #5 0x00698430 in ExecInterpExpr (state=0x316a6040, econtext=0x316a5e38, isnull=) at execExprInterp.c:764 #6 0x006d0898 in ExecEvalExprSwitchContext (isNull=0xd11948bf, econtext=0x316a5e38, state=) at ../../../src/include/executor/executor.h:356 #7 ExecProject (projInfo=) at ../../../src/include/executor/executor.h:390 #8 ExecResult (pstate=) at nodeResult.c:135 #9 0x006b92ec in ExecProcNode (node=0x316a5d28) at ../../../src/include/executor/executor.h:274 #10 gather_getnext (gatherstate=0x316a5b38) at nodeGather.c:287 #11 ExecGather (pstate=0x316a5b38) at nodeGather.c:222 #12 0x0069c36c in ExecProcNode (node=0x316a5b38) at ../../../src/include/executor/executor.h:274 #13 ExecutePlan (execute_once=, dest=0x31641e90, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x316a5b38, estate=0x316a5910) at execMain.c:1646 #14 standard_ExecutorRun (queryDesc=0x316459c0, direction=, count=0, execute_once=) at execMain.c:363 #15 0x00871564 in PortalRunSelect (portal=portal@entry=0x31512fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x31641e90) at pquery.c:924 #16 0x00872d80 in PortalRun (portal=portal@entry=0x31512fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x31641e90, altdest=altdest@entry=0x31641e90, qc=qc@entry=0xd1194c70) at pquery.c:768 #17 0x0086ea54 in exec_simple_query (query_string=query_string@entry=0x31493c90 "SELECT pg_sleep(0.1);") at postgres.c:1274 #18 0x0086f590 in PostgresMain (dbname=, username=) at postgres.c:4680 #19 0x0086ab20 in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:105 #20 0x007c54d8 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xd1195138 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xd1195140) at launch_backend.c:265 #21 0x007c8ec0 in BackendStartup (client_sock=0xd1195140) at postmaster.c:3593 #22 ServerLoop () at postmaster.c:1674 #23 0x007cab68 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x3148f320) at postmaster.c:1372 #24 0x00496cb8 in main (argc=8, argv=0x3148f320) at main.c:197 [postgres@ip-172-31-18-25 ~]$ 2072 root 20 0 117M 4376 3192 S 0.0 0.0 0:00.00 â ââ /usr/sbin/CROND -n 2087 postgres 20 0 20988 6496 5504 S 0.0 0.0 0:00.00 â ââ /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f postgres 2092 postgres 20 0 20960 6328 5336 S 0.0 0.0 0:00.00 â â ââ /usr/sbin/postdrop -r 2074 postgres 20 0 111M 2660 2488 S 0.0 0.0 0:00.00 â ââ /bin/sh -c cd
Re: Why is parula failing?
On Mon, 8 Apr 2024 at 21:25, Robins Tharakan wrote: > > > I'll keep an eye on this instance more often for the next few days. > (Let me know if I could capture more if a run gets stuck again) HEAD is stuck again on pg_sleep(), no CPU for the past hour or so. Stack trace seems to be similar to last time. $ pstack 24930 #0 0xb8280954 in epoll_pwait () from /lib64/libc.so.6 #1 0x00843408 in WaitEventSetWaitBlock (nevents=1, occurred_events=, cur_timeout=60, set=0x3b38dac0) at latch.c:1570 #2 WaitEventSetWait (set=0x3b38dac0, timeout=timeout@entry=60, occurred_events=occurred_events@entry=0xfd1d66c8, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516 #3 0x008437c4 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=60, wait_event_info=wait_event_info@entry=150994946) at latch.c:538 #4 0x0090c384 in pg_sleep (fcinfo=) at misc.c:406 #5 0x00699350 in ExecInterpExpr (state=0x3b5a41a0, econtext=0x3b5a3f98, isnull=) at execExprInterp.c:764 #6 0x006d1668 in ExecEvalExprSwitchContext (isNull=0xfd1d683f, econtext=0x3b5a3f98, state=) at ../../../src/include/executor/executor.h:356 #7 ExecProject (projInfo=) at ../../../src/include/executor/executor.h:390 #8 ExecResult (pstate=) at nodeResult.c:135 #9 0x006ba26c in ExecProcNode (node=0x3b5a3e88) at ../../../src/include/executor/executor.h:274 #10 gather_getnext (gatherstate=0x3b5a3c98) at nodeGather.c:287 #11 ExecGather (pstate=0x3b5a3c98) at nodeGather.c:222 #12 0x0069d28c in ExecProcNode (node=0x3b5a3c98) at ../../../src/include/executor/executor.h:274 #13 ExecutePlan (execute_once=, dest=0x3b5ae8e0, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x3b5a3c98, estate=0x3b5a3a70) at execMain.c:1646 #14 standard_ExecutorRun (queryDesc=0x3b59c250, direction=, count=0, execute_once=) at execMain.c:363 #15 0x008720e4 in PortalRunSelect (portal=portal@entry=0x3b410fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x3b5ae8e0) at pquery.c:924 #16 0x00873900 in PortalRun (portal=portal@entry=0x3b410fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x3b5ae8e0, altdest=altdest@entry=0x3b5ae8e0, qc=qc@entry=0xfd1d6bf0) at pquery.c:768 #17 0x0086f5d4 in exec_simple_query (query_string=query_string@entry=0x3b391c90 "SELECT pg_sleep(0.1);") at postgres.c:1274 #18 0x00870110 in PostgresMain (dbname=, username=) at postgres.c:4680 #19 0x0086b6a0 in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:105 #20 0x007c6268 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xfd1d70b8 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xfd1d70c0) at launch_backend.c:265 #21 0x007c9c50 in BackendStartup (client_sock=0xfd1d70c0) at postmaster.c:3593 #22 ServerLoop () at postmaster.c:1674 #23 0x007cb8f8 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x3b38d320) at postmaster.c:1372 #24 0x00496e18 in main (argc=8, argv=0x3b38d320) at main.c:197 CPU% MEM% TIME+ Command . . 0.0 0.0 0:00.00 │ └─ /bin/sh -c cd /opt/postgres/build-farm-14 && PATH=/opt/gcc/home/ec2-user/proj/gcc/target/bin/ 0.0 0.1 0:00.07 │└─ /usr/bin/perl ./run_build.pl --config=build-farm.conf HEAD --verbose 0.0 0.0 0:00.00 │ └─ sh -c { cd pgsql.build/src/test/regress && make NO_LOCALE=1 check; echo $? > /opt/postg 0.0 0.0 0:00.00 │ └─ make NO_LOCALE=1 check 0.0 0.0 0:00.00 │ └─ /bin/sh -c echo "# +++ regress check in src/test/regress +++" && PATH="/opt/postg 0.0 0.0 0:00.10 │└─ ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. 0.0 0.0 0:00.01 │ ├─ psql -X -a -q -d regression -v HIDE_TABLEAM=on -v HIDE_TOAST_COMPRESSION=on 0.0 0.1 0:02.64 │ └─ postgres -D /opt/postgres/build-farm-14/buildroot/HEAD/pgsql.build/src/test 0.0 0.2 0:00.05 │ ├─ postgres: postgres regression [local] SELECT 0.0 0.0 0:00.06 │ ├─ postgres: logical replication launcher 0.0 0.1 0:00.36 │ ├─ postgres: autovacuum launcher 0.0 0.1 0:00.34 │ ├─ postgres: walwriter 0.0 0.0 0:00.32 │ ├─ postgres: background writer 0.0 0.3 0:00.05 │ └─ postgres: checkpointer - robins >
Re: Why is parula failing?
On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: > > Master failed today for the first time since the compiler upgrade. > Again reltuples == 48. Here's what I can add over the past few days: - Almost all failures are either reltuples=48 or SIGABRTs - Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS - A little too coincidental? Recent crashes have stack-trace if interested. Barring the initial failures (during move to gcc 13.2), in the past week: - v15 somehow hasn't had a failure yet - v14 / v16 have got only 1 failure each - but v12 / v13 are lit up - failed multiple times. - robins
Re: Why is parula failing?
On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: > Master failed today for the first time since the compiler upgrade. > Again reltuples == 48. >From the buildfarm members page, parula seems to be the only aarch64 + gcc 13.2 combination today, and then I suspect whether this is about gcc v13.2 maturity on aarch64? I'll try to upgrade one of the other aarch64s I have (massasauga or snakefly) and see if this is more about gcc 13.2 maturity on this architecture. - robins
Re: Why is parula failing?
On Tue, 2 Apr 2024 at 15:01, Tom Lane wrote: > "Tharakan, Robins" writes: > > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries, > > of which some were even"signal 6: Aborted". > > Ugh... parula didn't send any reports to buildfarm for the past 44 hours. Logged in to see that postgres was stuck on pg_sleep(), which was quite odd! I captured the backtrace and triggered another run on HEAD, which came out okay. I'll keep an eye on this instance more often for the next few days. (Let me know if I could capture more if a run gets stuck again) (gdb) bt #0 0x952ae954 in epoll_pwait () from /lib64/libc.so.6 #1 0x0083e9c8 in WaitEventSetWaitBlock (nevents=1, occurred_events=, cur_timeout=297992, set=0x2816dac0) at latch.c:1570 #2 WaitEventSetWait (set=0x2816dac0, timeout=timeout@entry=60, occurred_events=occurred_events@entry=0xc395ed28, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516 #3 0x0083ed84 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=60, wait_event_info=wait_event_info@entry=150994946) at latch.c:538 #4 0x00907404 in pg_sleep (fcinfo=) at misc.c:406 #5 0x00696b10 in ExecInterpExpr (state=0x28384040, econtext=0x28383e38, isnull=) at execExprInterp.c:764 #6 0x006ceef8 in ExecEvalExprSwitchContext (isNull=0xc395ee9f, econtext=0x28383e38, state=) at ../../../src/include/executor/executor.h:356 #7 ExecProject (projInfo=) at ../../../src/include/executor/executor.h:390 #8 ExecResult (pstate=) at nodeResult.c:135 #9 0x006b7aec in ExecProcNode (node=0x28383d28) at ../../../src/include/executor/executor.h:274 #10 gather_getnext (gatherstate=0x28383b38) at nodeGather.c:287 #11 ExecGather (pstate=0x28383b38) at nodeGather.c:222 #12 0x0069aa4c in ExecProcNode (node=0x28383b38) at ../../../src/include/executor/executor.h:274 #13 ExecutePlan (execute_once=, dest=0x2831ffb0, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x28383b38, estate=0x28383910) at execMain.c:1646 #14 standard_ExecutorRun (queryDesc=0x283239c0, direction=, count=0, execute_once=) at execMain.c:363 #15 0x0086d454 in PortalRunSelect (portal=portal@entry=0x281f0fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x2831ffb0) at pquery.c:924 #16 0x0086ec70 in PortalRun (portal=portal@entry=0x281f0fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x2831ffb0, altdest=altdest@entry=0x2831ffb0, qc=qc@entry=0xc395f250) at pquery.c:768 #17 0x0086a944 in exec_simple_query (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at postgres.c:1274 #18 0x0086b480 in PostgresMain (dbname=, username=) at postgres.c:4680 #19 0x00866a0c in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:101 #20 0x007c1738 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xc395f718 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xc395f720) at launch_backend.c:265 #21 0x007c5120 in BackendStartup (client_sock=0xc395f720) at postmaster.c:3593 #22 ServerLoop () at postmaster.c:1674 #23 0x007c6dc8 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x2816d320) at postmaster.c:1372 #24 0x00496bb8 in main (argc=8, argv=0x2816d320) at main.c:197 > > The update_personality.pl script in the buildfarm client distro > is what to use to adjust OS version or compiler version data. > Thanks. Fixed that. - robins
Re: pg_upgrade failing for 200+ million Large Objects
On Thu, 28 Dec 2023 at 01:48, Tom Lane wrote: > Robins Tharakan writes: > > Applying all 4 patches, I also see good performance improvement. > > With more Large Objects, although pg_dump improved significantly, > > pg_restore is now comfortably an order of magnitude faster. > > Yeah. The key thing here is that pg_dump can only parallelize > the data transfer, while (with 0004) pg_restore can parallelize > large object creation and owner-setting as well as data transfer. > I don't see any simple way to improve that on the dump side, > but I'm not sure we need to. Zillions of empty objects is not > really the use case to worry about. I suspect that a more realistic > case with moderate amounts of data in the blobs would make pg_dump > look better. > Thanks for elaborating, and yes pg_dump times do reflect that expectation. The first test involved a fixed number (32k) of Large Objects (LOs) with varying sizes - I chose that number intentionally since this was being tested on a 32vCPU instance and the patch employs 1k batches. We again see that pg_restore is an order of magnitude faster. LO Size (bytes) restore-HEAD restore-patched improvement (Nx) 124.182 1.4 17x 1024.741 1.5 17x 10024.574 1.6 15x 1,00025.314 1.7 15x 10,00025.644 1.7 15x 100,00050.046 4.3 12x 1,000,000 281.54930.0 9x pg_dump also sees improvements. Really small sized LOs see a decent ~20% improvement which grows considerably as LOs get bigger (beyond ~10-100kb). LO Size (bytes) dump-HEAD dump-patchedimprovement (%) 112.9 10.7 18% 1012.9 10.4 19% 10012.8 10.3 20% 1,00013.0 10.3 21% 10,00014.2 10.3 27% 100,00032.8 11.5 65% 1,000,000 211.8 23.6 89% To test pg_restore scaling, 1 Million LOs (100kb each) were created and pg_restore times tested for increasing concurrency (on a 192vCPU instance). We see major speedup upto -j64 and the best time was at -j96, after which performance decreases slowly - see attached image. Concurrencypg_restore-patched 384 75.87 352 75.63 320 72.11 288 70.05 256 70.98 224 66.98 192 63.04 160 61.37 128 58.82 96 58.55 64 60.46 32 77.29 16 115.51 8 203.48 4 366.33 Test details: - Command used to generate SQL - create 1k LOs of 1kb each - echo "SELECT lo_from_bytea(0, '\x` printf 'ff%.0s' {1..1000}`') FROM generate_series(1,1000);" > /tmp/tempdel - Verify the LO size: select pg_column_size(lo_get(oid)); - Only GUC changed: max_connections=1000 (for the last test) - Robins Tharakan Amazon Web Services
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
x55d919752fb0) at fe-connect.c:4112 #5 0x7f96da543d55 in PQfinish (conn=0x55d919752fb0) at fe-connect.c:4134 #6 0x7f96d9ebd42b in libpqsrv_disconnect (conn=0x55d919752fb0) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #7 0x7f96d9ebddf1 in dblink_disconnect (fcinfo=0x55d91f2692a8) at dblink.c:357 Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7f5f6b632859 in __GI_abort () at abort.c:79 #2 0x7f5f6b69d26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #3 0x7f5f6b6a52fc in malloc_printerr ( str=str@entry=0x7f5f6b7c91e0 "munmap_chunk(): invalid pointer") at malloc.c:5347 #4 0x7f5f6b6a554c in munmap_chunk (p=) at malloc.c:2830 #5 0x7f5f50085efd in pqDropConnection (conn=0x55d12ebcd100, flushInput=true) at fe-connect.c:495 #6 0x7f5f5008bcb3 in closePGconn (conn=0x55d12ebcd100) at fe-connect.c:4112 #7 0x7f5f5008bd55 in PQfinish (conn=0x55d12ebcd100) at fe-connect.c:4134 #8 0x7f5f5006c42b in libpqsrv_disconnect (conn=0x55d12ebcd100) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7f5f6b632859 in __GI_abort () at abort.c:79 #2 0x7f5f6b69d26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #3 0x7f5f6b6a52fc in malloc_printerr ( str=str@entry=0x7f5f6b7c54c1 "free(): invalid pointer") at malloc.c:5347 #4 0x7f5f6b6a6b2c in _int_free (av=, p=, have_lock=0) at malloc.c:4173 #5 0x7f5f500fe6ed in freePGconn (conn=0x55d142273000) at fe-connect.c:3977 #6 0x7f5f500fed61 in PQfinish (conn=0x55d142273000) at fe-connect.c:4135 #7 0x7f5f501de42b in libpqsrv_disconnect (conn=0x55d142273000) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #8 0x7f5f501dedf1 in dblink_disconnect (fcinfo=0x55d1527998f8) at dblink.c:357 Core was generated by `postgres: e4602483e9@(HEAD detached at e4602483e9)@sqith: u73 postgres 127.0.0.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335) at malloc.c:3154 #0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335) at malloc.c:3154 #1 0x7fb7bc0a580a in pqCheckOutBufferSpace (bytes_needed=2139062148, conn=0x55b191aa9380) at fe-misc.c:329 #2 0x7fb7bc0a5b1c in pqPutMsgStart (msg_type=88 'X', conn=0x55b191aa9380) at fe-misc.c:476 #3 0x7fb7bc097c60 in sendTerminateConn (conn=0x55b191aa9380) at fe-connect.c:4076 #4 0x7fb7bc097c97 in closePGconn (conn=0x55b191aa9380) at fe-connect.c:4096 #5 0x7fb7bc097d55 in PQfinish (conn=0x55b191aa9380) at fe-connect.c:4134 #6 0x7fb7bc14a42b in libpqsrv_disconnect (conn=0x55b191aa9380) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #7 0x00007fb7bc14adf1 in dblink_disconnect (fcinfo=0x55b193894f00) at dblink.c:357 Thanks to SQLSmith for helping with this find. - Robins Tharakan Amazon Web Services
Missing CFI in iterate_word_similarity()
Hi, For long strings, iterate_word_similarity() can run into long-running tight-loops without honouring interrupts or statement_timeouts. For example: postgres=# set statement_timeout='1s'; SET postgres=# select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet'; ?column? -- (0 rows) Time: 29615.842 ms (00:29.616) The associated perf report: + 99.98% 0.00% postgres postgres [.] ExecQual + 99.98% 0.00% postgres postgres [.] ExecEvalExprSwitchContext + 99.98% 0.00% postgres pg_trgm.so [.] strict_word_similarity_commutator_op + 99.98% 0.00% postgres pg_trgm.so [.] calc_word_similarity + 99.68% 99.47% postgres pg_trgm.so [.] iterate_word_similarity 0.21% 0.03% postgres postgres [.] pg_qsort 0.16% 0.00% postgres [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt 0.16% 0.00% postgres [kernel.kallsyms] [k] sysvec_apic_timer_interrupt 0.16% 0.11% postgres [kernel.kallsyms] [k] __softirqentry_text_start 0.16% 0.00% postgres [kernel.kallsyms] [k] irq_exit_rcu Adding CHECK_FOR_INTERRUPTS() ensures that such queries respond to statement_timeout & Ctrl-C signals. With the patch applied, the above query will interrupt more quickly: postgres=# select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet'; ERROR: canceling statement due to statement timeout Time: 1000.768 ms (00:01.001) Please find the patch attached. The patch does not show any performance regressions when run against the above use-case. Thanks to SQLSmith for indirectly leading me to this scenario. - Robins Tharakan Amazon Web Services Patch applied to commit - 80d690721973f6a031143a24a34b78a0225101a2 SQL repro script CREATE EXTENSION IF NOT EXISTS pg_trgm; set statement_timeout = '1s'; show statement_timeout; \timing on select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet'; -- Check whether this change brought in any performance regressions set statement_timeout='0'; show statement_timeout; select COUNT(*) from generate_series(1,1) q(e) where repeat('1.1',1) %>> ('Lorem ipsum dolor sit amet'||e::text); select COUNT(*) from generate_series(1,10) q(e) where repeat('1.1',1) %>> ('Lorem ipsum dolor sit amet'||e::text); select COUNT(*) from generate_series(1,100) q(e) where repeat('1.1',1) %>> ('Lorem ipsum dolor sit amet'||e::text); SQL script output = CREATE EXTENSION SET statement_timeout --- 1s (1 row) Timing is on. psql:/home/ubuntu/proj/sqlsmithdata/repro1.sql:11: ERROR: canceling statement due to statement timeout Time: 1000.792 ms (00:01.001) SET Time: 0.093 ms statement_timeout --- 0 (1 row) Time: 0.077 ms count --- 0 (1 row) Time: 473.487 ms count --- 0 (1 row) Time: 4726.628 ms (00:04.727) count --- 0 (1 row) Time: 47231.271 ms (00:47.231) commit - 80d690721973f6a031143a24a34b78a0225101a2 SQL repro script CREATE EXTENSION IF NOT EXISTS pg_trgm; set statement_timeout = '1s'; show statement_timeout; \timing on select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet';SELECT 1; -- Check whether this change brought in any performance regressions set statement_timeout='0'; show statement_timeout; select COUNT(*) from generate_series(1,1) q(e) where repeat('1.1',1) %>> ('Lorem ipsum dolor sit amet'||e::text); select COUNT(*) from generate_series(1,10) q(e) where repeat('1.1',1) %>> ('Lorem ipsum dolor sit amet'||e::text); select COUNT(*) from generate_series(1,100) q(e) where repeat('1.1',1) %>> ('Lorem ipsum dolor sit amet'||e::text); SQL script output = CREATE EXTENSION SET statement_timeout --- 1s (1 row) Timing is on. ?column? -- (0 rows) Time: 29620.933 ms (00:29.621) psql:/home/ubuntu/proj/sqlsmithdata/repro1.sql:11: ERROR: canceling statement due to statement timeout Time: 0.073 ms SET Time: 0.159 ms statement_timeout --- 0 (1 row) Time: 0.100 ms count --- 0 (1 row) Time: 473.449 ms count --- 0 (1 row) Time: 4725.483 ms (00:04.725) count --- 0 (1 row) Time: 47222.223 ms (00:47.222) v1_cfi_iterate_word_similarity.patch Description: Binary data
autoprewarm worker failing to load
Hi, 089480c077056 seems to have broken pg_prewarm. When pg_prewarm is added to shared_preload_libraries, each new connection results in thousands of errors such as this: 2022-07-27 04:25:14.325 UTC [2903955] LOG: background worker "autoprewarm leader" (PID 2904146) exited with exit code 1 2022-07-27 04:25:14.325 UTC [2904148] ERROR: could not find function "autoprewarm_main" in file "/home/ubuntu/proj/tempdel/lib/postgresql/pg_prewarm.so" Checking pg_prewarm.so the function 'autoprewarm_main' visibility switched from GLOBAL to LOCAL. Per [1], using PGDLLEXPORT makes it GLOBAL again, which appears to fix the issue: Before commit (089480c077056) - ubuntu:~/proj/tempdel$ readelf -sW lib/postgresql/pg_prewarm.so | grep main 103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main 109: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main 128: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main 187: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main After commit (089480c077056) - 78: 2d79 609 FUNC LOCAL DEFAULT 14 autoprewarm_main 85: 35ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main After applying the attached fix: 103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main 84: 45ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main 129: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main Please let me know your thoughts on this approach. [1] https://www.postgresql.org/message-id/A737B7A37273E048B164557ADEF4A58B5393038C%40ntex2010a.host.magwien.gv.at diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index b2d6026093..ec619be9f2 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -82,7 +82,7 @@ typedef struct AutoPrewarmSharedState int prewarmed_blocks; } AutoPrewarmSharedState; -void autoprewarm_main(Datum main_arg); +PGDLLEXPORT void autoprewarm_main(Datum main_arg); void autoprewarm_database_main(Datum main_arg); PG_FUNCTION_INFO_V1(autoprewarm_start_worker); - Robins Tharakan Amazon Web Services
Re: 13dev failed assert: comparetup_index_btree(): ItemPointer values should never be equal
_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55bfa18f0e38, qc=0x7ffcfa60ad20) at utility.c:526 #20 0x55bf9fc3180e in PortalRunUtility (portal=0x55bfa197d020, pstmt=0x55bfa18f0d48, isTopLevel=true, setHoldSnapshot=false, dest=0x55bfa18f0e38, qc=0x7ffcfa60ad20) at pquery.c:1158 #21 0x55bf9fc31a84 in PortalRunMulti (portal=0x55bfa197d020, isTopLevel=true, setHoldSnapshot=false, dest=0x55bfa18f0e38, altdest=0x55bfa18f0e38, qc=0x7ffcfa60ad20) at pquery.c:1315 #22 0x55bf9fc30ef1 in PortalRun (portal=0x55bfa197d020, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x55bfa18f0e38, altdest=0x55bfa18f0e38, qc=0x7ffcfa60ad20) at pquery.c:791 #23 0x55bf9fc2a14f in exec_simple_query (query_string=0x55bfa18eff30 "REINDEX INDEX pg_class_tblspc_relfilenode_index;") at postgres.c:1250 #24 0x55bf9fc2ecdf in PostgresMain (dbname=0x55bfa1923be0 "postgres", username=0x55bfa18eb8f8 "ubuntu") at postgres.c:4544 #25 0x55bf9fb52e93 in BackendRun (port=0x55bfa19218a0) at postmaster.c:4504 #26 0x55bf9fb52778 in BackendStartup (port=0x55bfa19218a0) at postmaster.c:4232 #27 0x55bf9fb4ea5e in ServerLoop () at postmaster.c:1806 #28 0x55bf9fb4e1f7 in PostmasterMain (argc=3, argv=0x55bfa18e9830) at postmaster.c:1478 #29 0x55bf9fa3f864 in main (argc=3, argv=0x55bfa18e9830) at main.c:202 - Robins Tharakan Amazon Web Services
Re: buildfarm instance bichir stuck
On Fri, 9 Apr 2021 at 16:12, Thomas Munro wrote: > From your description it sounds like signals are not arriving at all, > rather than some more complicated race. Let's go back to basics... > what does the attached program print for you? I see: > > tmunro@x1:~/junk$ cc test-signalfd.c > tmunro@x1:~/junk$ ./a.out > blocking SIGURG... > creating a signalfd to receive SIGURG... > creating an epoll set... > adding signalfd to epoll set... > polling the epoll set... 0 > sending a signal... > polling the epoll set... 1 I get pretty much the same. Some additional info below, although not sure if it'd be of any help here. robins@WSLv1:~/proj/hackers$ cc test-signalfd.c robins@WSLv1:~/proj/hackers$ ./a.out blocking SIGURG... creating a signalfd to receive SIGURG... creating an epoll set... adding signalfd to epoll set... polling the epoll set... 0 sending a signal... polling the epoll set... 1 robins@WSLv1:~/proj/hackers$ cat /proc/cpuinfo | egrep 'flags|model' | sort -u flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave osxsave avx f16c rdrand lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt ibrs ibpb stibp ssbd model : 142 model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz robins@WSLv1:~/proj/hackers$ uname -a Linux WSLv1 4.4.0-19041-Microsoft #488-Microsoft Mon Sep 01 13:43:00 PST 2020 x86_64 x86_64 x86_64 GNU/Linux C:>wsl -l -v NAMESTATE VERSION * Ubuntu-18.04Running 1 - robins
Re: buildfarm instance bichir stuck
Thanks Andrew. The build's still running but the CPPFLAGS hint does seem to have helped (see below). Unless advised otherwise, I intend to let that option be, so as to get bichir back online. If a future commit 'fixes' things, I could rollback this flag to test things out (or try out other options if required). On Wed, 7 Apr 2021 at 21:49, Andrew Dunstan wrote: > On 4/7/21 2:16 AM, Thomas Munro wrote: > > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan wrote: > >> Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b. > > ...If it is indeed > > something like that and not a bug in my code, then I was thinking that > > the main tool available to deal with it would be to set WAIT_USE_POLL > > in the relevant template file, so that we don't use the combination of > > epoll + signalfd on illlumos, but then WSL1 thows a spanner in the > > works because AFAIK it's masquerading as Ubuntu, running PostgreSQL > > from an Ubuntu package with a freaky kernel. Hmm. > To test this the OP could just add > CPPFLAGS => '-DWAIT_USE_POLL', > to his animal's config's config_env stanza. This did help in getting past the previous hurdle. postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep CPPFLAGS configure.log| grep using configure: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2 configure:19511: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2 postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep -A2 "creating database" lastcommand.log == creating database "regression" == CREATE DATABASE ALTER DATABASE - thanks robins
Re: buildfarm instance bichir stuck
Hi Thomas, Thanks for taking a look at this promptly. On Wed, 7 Apr 2021 at 16:17, Thomas Munro wrote: > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan wrote: > > It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in. > > Oh, well at least it's easily reproducible then, that's something! Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch. > > That's actually the client. I guess there is also a backend process > stuck somewhere in epoll_wait()? You're right (and yes my bad, I was looking at the client). The server process is stuck in epoll_wait(). Let me know if you need me to give any other info that may be helpful. root@WSLv1:~# gdb -batch -ex bt -p 29887 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x7fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory. #0 0x7fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x7fa088c355dc in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7fffd2d4c090, cur_timeout=-1, set=0x7fffcbcc56e8) at latch.c:1428 #2 WaitEventSetWait (set=0x7fffcbcc56e8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fffd2d4c090, nevents=nevents@entry=1, wait_event_info=wait_ev #3 0x7fa088c35a14 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217733) at #4 0x7fa088c43ed8 in ConditionVariableTimedSleep (cv=0x7fa0873cc498, timeout=-1, wait_event_info=134217733) at condition_variable.c:163 #5 0x7fa088bba8bc in RequestCheckpoint (flags=flags@entry=44) at checkpointer.c:1017 #6 0x7fa088a46315 in createdb (pstate=pstate@entry=0x7fffcbcebbc0, stmt=stmt@entry=0x7fffcbcca558) at dbcommands.c:711 . . . - robins
buildfarm instance bichir stuck
Hi, Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b. It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in. Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't really production use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed it after a long wait - see [1]. Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not working may be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll. Tracing commits, it seems that the commit 6a2a70a02018d6362f9841cc2f499cc45405e86b broke things and I can confirm that 'make check' works if I rollback to the preceding commit ( 83709a0d5a46559db016c50ded1a95fd3b0d3be6 ). Not sure if many agree but 2 things stood out here: 1) Buildfarm never got the message that a commit broke an instance. Ideally I'd have expected buildfarm to have an optimistic timeout that could have helped - for e.g. right now, the CREATE DATABASE is still stuck since 18 hrs. 2) bichir is clearly not a production use-case (it takes 5 hrs to complete a HEAD run!), so let me know if this change is intentional (I guess I'll stop maintaining it if so) but thought I'd still put this out in case it interests someone. - thanks robins Reference: 1) Last run that I had to kill - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bichir=2021-03-31%2012%3A00%3A05 # The current run is running since yesterday. postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ tail -2 lastcommand.log running on port 5678 with PID 8715 == creating database "regression" == postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ date Wed Apr 7 12:48:26 AEST 2021 postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ ls -la total 840 drwxrwxr-x 1 postgres postgres 4096 Apr 6 09:00 . drwxrwxr-x 1 postgres postgres 4096 Apr 6 08:55 .. -rw-rw-r-- 1 postgres postgres 1358 Apr 6 08:55 SCM-checkout.log -rw-rw-r-- 1 postgres postgres 91546 Apr 6 08:56 configure.log -rw-rw-r-- 1 postgres postgres 40 Apr 6 08:55 githead.log -rw-rw-r-- 1 postgres postgres 2890 Apr 6 09:01 lastcommand.log -rw-rw-r-- 1 postgres postgres 712306 Apr 6 09:00 make.log root@WSLv1:~# pstack 8729 8729: psql -X -c CREATE DATABASE "regression" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C' postgres pstack: Bad address failed to read target. root@WSLv1:~# gdb -batch -ex bt -p 8729 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x7f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. #0 0x7f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x7f41a9bc8eb1 in poll (__timeout=, __nfds=1, __fds=0x7fffe13d7be8) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46 #2 pqSocketPoll (end_time=-1, forWrite=0, forRead=1, sock=) at fe-misc.c:1133 #3 pqSocketCheck (conn=0x7fffd979a0b0, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1075 #4 0x7f41a9bc8ff0 in pqWaitTimed (forRead=, forWrite=, conn=0x7fffd979a0b0, finish_time=) at fe-misc.c:1007 #5 0x7f41a9bc5ac9 in PQgetResult (conn=0x7fffd979a0b0) at fe-exec.c:1963 #6 0x7f41a9bc5ea3 in PQexecFinish (conn=0x7fffd979a0b0) at fe-exec.c:2306 #7 0x7f41a9bc5ef2 in PQexec (conn=, query=query@entry=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at fe-exec.c:2148 #8 0x7f41aa21e7a0 in SendQuery (query=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at common.c:1303 #9 0x7f41aa2160a6 in main (argc=, argv=) at startup.c:369 # Here we can see that 83709a0d5a46559db016c50ded1a95fd3b0d3be6 goes past 'CREATE DATABASE' === robins@WSLv1:~/proj/postgres/postgres$ git checkout 83709a0d5a46559db016c50ded1a95fd3b0d3be6 Previous HEAD position was 6a2a70a020 Use signalfd(2) for epoll latches. HEAD is now at 83709a0d5a Use SIGURG rather than SIGUSR1 for latches. robins@WSLv1:~/proj/postgres/postgres$ cd src/test/regress/ robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check make -C ../../../src/backend generated-headers rm -rf ./testtablespace make[1]: Entering directory
Re: pg_upgrade failing for 200+ million Large Objects
Hi Magnus, On Mon, 8 Mar 2021 at 23:34, Magnus Hagander wrote: > AFAICT at a quick check, pg_dump in binary upgrade mode emits one lo_create() and one ALTER ... OWNER TO for each large object - so with > 500M large objects that would be a billion statements, and thus a > billion xids. And without checking, I'm fairly sure it doesn't load in > a single transaction... > Your assumptions are pretty much correct. The issue isn't with pg_upgrade itself. During pg_restore, each Large Object (and separately each ALTER LARGE OBJECT OWNER TO) consumes an XID each. For background, that's the reason the v9.5 production instance I was reviewing, was unable to process more than 73 Million large objects since each object required a CREATE + ALTER. (To clarify, 73 million = (2^31 - 2 billion magic constant - 1 Million wraparound protection) / 2) Without looking, I would guess it's the schema reload using > pg_dump/pg_restore and not actually pg_upgrade itself. This is a known > issue in pg_dump/pg_restore. And if that is the case -- perhaps just > running all of those in a single transaction would be a better choice? > One could argue it's still not a proper fix, because we'd still have a > huge memory usage etc, but it would then only burn 1 xid instead of > 500M... > (I hope I am not missing something but) When I tried to force pg_restore to use a single transaction (by hacking pg_upgrade's pg_restore call to use --single-transaction), it too failed owing to being unable to lock so many objects in a single transaction. This still seems to just fix the symptoms and not the actual problem. > I agree that the patch doesn't address the root-cause, but it did get the upgrade to complete on a test-setup. Do you think that (instead of all objects) batching multiple Large Objects in a single transaction (and allowing the caller to size that batch via command line) would be a good / acceptable idea here? Please take a look at your email configuration -- all your emails are > lacking both References and In-reply-to headers. > Thanks for highlighting the cause here. Hopefully switching mail clients would help. - Robins Tharakan
Re: Brazil disables DST - 2019b update
On Fri, 12 Jul 2019 at 14:04, Michael Paquier wrote: > On Fri, Jul 12, 2019 at 01:42:59PM +1000, Robins Tharakan wrote: > So 2019b has been released on the 1st of July. Usually tzdata updates > happen just before a minor release, so this would get pulled in at the > beginning of August (https://www.postgresql.org/developer/roadmap/). > Tom, I guess that would be again the intention here? > -- > Michael > An August release does give a little more comfort. (I was expecting that the August date would get pushed out since 11.4 was an emergency release at the end of June). - robins
Brazil disables DST - 2019b update
Hi, The 2019b DST update [1] disables DST for Brazil. This would take effect starting November 2019. The last DST update in Postgres was 2019a in v11.3 (since this update came in after the recent-most Postgres release). Since a ~3 month release cycle may be too close for some users, are there any plans for an early 11.5 (or are such occurrences not a candidate for an early release)? Reference: a) https://mm.icann.org/pipermail/tz-announce/2019-July/56.html - robins
Re: Typo in recent commit
On 9 December 2017 at 16:11, Magnus Haganderwrote: > > > Thanks, fixed for the report. > > Thanks Magnus. However, although it was backpatched correctly, looks like the fix on master missed out identity.out related fix. Ref: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/test/regress/expected/identity.out;h=ddc69505937811059aef5c41bc096bc7459cb41e;hb=d8f632caec3fcc5eece9d53d7510322f11489fe4#l359 - robins
Typo in recent commit
Hi, Looks like a minor typo in the recent commit. s/identify/identity/ https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a2c6cf36608e10aa223fef49323b5feba344bfcf - robins | mobile