Re: Why is parula failing?

2024-05-16 Thread Robins Tharakan
On Tue, 14 May 2024 at 08:55, David Rowley  wrote:

> I've not seen any recent failures from Parula that relate to this
> issue.  The last one seems to have been about 4 weeks ago.
>
> I'm now wondering if it's time to revert the debugging code added in
> 1db689715.  Does anyone think differently?
>

Thanks for keeping an eye. Sadly the older machine was decommissioned
and thus parula hasn't been sending results to buildfarm the past few days.

I'll try to build a similar machine (but newer gcc etc.) and reopen this
thread in case I hit
something similar.
-
robins


Re: Why is parula failing?

2024-04-16 Thread Robins Tharakan
On Mon, 15 Apr 2024 at 16:02, Tom Lane  wrote:
> David Rowley  writes:
> > If GetNowFloat() somehow was returning a negative number then we could
> > end up with a large delay.  But if gettimeofday() was so badly broken
> > then wouldn't there be some evidence of this in the log timestamps on
> > failing runs?
>
> And indeed that too.  I'm finding the "compiler bug" theory
> palatable.  Robins mentioned having built the compiler from
> source, which theoretically should work, but maybe something
> went wrong?  Or it's missing some important bug fix?
>
> It might be interesting to back the animal's CFLAGS down
> to -O0 and see if things get more stable.

The last 25 consecutive runs have passed [1] after switching
REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
the compiler version is to blame, and while we're still here,
is there anything else I could try?

If not, by Sunday, I am considering switching parula to gcc v12 (or even
v14 experimental - given that massasauga [2] has been pretty stable since
its upgrade a few days back).

Reference:
1.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=parula=REL_12_STABLE
2.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE
-
robins


Re: Why is parula failing?

2024-04-14 Thread Robins Tharakan
On Mon, 15 Apr 2024 at 14:55, David Rowley  wrote:
> If GetNowFloat() somehow was returning a negative number then we could
> end up with a large delay.  But if gettimeofday() was so badly broken
> then wouldn't there be some evidence of this in the log timestamps on
> failing runs?

3 things stand out for me here, unsure if they're related somehow:

1. Issue where reltuples=48 (in essence runs complete, but few tests fail)
2. SIGABRT - most of which are DDLs (runs complete, but engine crashes +
many tests fail)
3. pg_sleep() stuck - (runs never complete, IIUC never gets reported to
buildfarm)

For #3, one thing I had done earlier (and then reverted) was to set the
'wait_timeout' from current undef to 2 hours. I'll set it again to 2hrs
in hopes that #3 starts getting reported to buildfarm too.

> I'm not that familiar with the buildfarm config, but I do see some
> Valgrind related setting in there. Is PostgreSQL running under
> Valgrind on these runs?

Not yet. I was tempted, but valgrind has not yet been enabled on
this member. IIUC by default they're disabled.

   'use_valgrind' => undef,

-
robins


Re: Why is parula failing?

2024-04-14 Thread Robins Tharakan
On Sun, 14 Apr 2024 at 00:12, Tom Lane  wrote:
> If we were only supposed to sleep 0.1 seconds, how is it waiting
> for 60 ms (and, presumably, repeating that)?  The logic in
> pg_sleep is pretty simple, and it's hard to think of anything except
> the system clock jumping (far) backwards that would make this
> happen.  Any chance of extracting the local variables from the
> pg_sleep stack frame?

- I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE
- I'll keep them (stuck) for this week, in case there's more we can get
from them (and to see how long they take)
- Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt -
REL_16_STABLE)

A few things to add:
- To reiterate, this instance has gcc v13.2 compiled without any
flags (my first time ever TBH) IIRC 'make -k check' came out okay,
so at this point I don't think I did something obviously wrong when
building gcc from git.
- I installed gcc v14.0.1 experimental on massasauga (also an aarch64
and built from git) and despite multiple runs, it seems to be doing okay
[1].
- Next week (if I'm still scratching my head - and unless someone advises
otherwise), I'll upgrade parula to gcc 14 experimental to see if this is
about
gcc maturity on graviton (for some reason). I don't expect much to come
out of it though (given Tomas testing on rpi5, but doesn't hurt)

Ref:
1.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE

-
robins
[postgres@ip-172-31-18-25 ~]$ pstack 26147
#0  0xadeda954 in epoll_pwait () from /lib64/libc.so.6
#1  0x00842888 in WaitEventSetWaitBlock (nevents=1, 
occurred_events=, cur_timeout=60, set=0x3148fac0) at 
latch.c:1570
#2  WaitEventSetWait (set=0x3148fac0, timeout=timeout@entry=60, 
occurred_events=occurred_events@entry=0xd1194748, nevents=nevents@entry=1, 
wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x00842c44 in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=41, timeout=60, 
wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406
#5  0x00698430 in ExecInterpExpr (state=0x316a6040, 
econtext=0x316a5e38, isnull=) at execExprInterp.c:764
#6  0x006d0898 in ExecEvalExprSwitchContext (isNull=0xd11948bf, 
econtext=0x316a5e38, state=) at 
../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=) at 
../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=) at nodeResult.c:135
#9  0x006b92ec in ExecProcNode (node=0x316a5d28) at 
../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x316a5b38) at nodeGather.c:287
#11 ExecGather (pstate=0x316a5b38) at nodeGather.c:222
#12 0x0069c36c in ExecProcNode (node=0x316a5b38) at 
../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=, dest=0x31641e90, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x316a5b38, 
estate=0x316a5910) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x316459c0, direction=, 
count=0, execute_once=) at execMain.c:363
#15 0x00871564 in PortalRunSelect (portal=portal@entry=0x31512fb0, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x31641e90) at pquery.c:924
#16 0x00872d80 in PortalRun (portal=portal@entry=0x31512fb0, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=run_once@entry=true, dest=dest@entry=0x31641e90, 
altdest=altdest@entry=0x31641e90, qc=qc@entry=0xd1194c70) at pquery.c:768
#17 0x0086ea54 in exec_simple_query 
(query_string=query_string@entry=0x31493c90 "SELECT pg_sleep(0.1);") at 
postgres.c:1274
#18 0x0086f590 in PostgresMain (dbname=, 
username=) at postgres.c:4680
#19 0x0086ab20 in BackendMain (startup_data=, 
startup_data_len=) at backend_startup.c:105
#20 0x007c54d8 in postmaster_child_launch 
(child_type=child_type@entry=B_BACKEND, 
startup_data=startup_data@entry=0xd1195138 "", 
startup_data_len=startup_data_len@entry=4, 
client_sock=client_sock@entry=0xd1195140) at launch_backend.c:265
#21 0x007c8ec0 in BackendStartup (client_sock=0xd1195140) at 
postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x007cab68 in PostmasterMain (argc=argc@entry=8, 
argv=argv@entry=0x3148f320) at postmaster.c:1372
#24 0x00496cb8 in main (argc=8, argv=0x3148f320) at main.c:197
[postgres@ip-172-31-18-25 ~]$


 2072 root   20   0  117M  4376  3192 S  0.0  0.0  0:00.00 │  └─ 
/usr/sbin/CROND -n
 2087 postgres   20   0 20988  6496  5504 S  0.0  0.0  0:00.00 │ ├─ 
/usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f postgres
 2092 postgres   20   0 20960  6328  5336 S  0.0  0.0  0:00.00 │ │  
└─ /usr/sbin/postdrop -r
 2074 postgres   20   0  111M  2660  2488 S  0.0  0.0  0:00.00 │ └─ 
/bin/sh -c cd 

Re: Why is parula failing?

2024-04-13 Thread Robins Tharakan
On Mon, 8 Apr 2024 at 21:25, Robins Tharakan  wrote:
>
>
> I'll keep an eye on this instance more often for the next few days.
> (Let me know if I could capture more if a run gets stuck again)


HEAD is stuck again on pg_sleep(), no CPU for the past hour or so.
Stack trace seems to be similar to last time.


$ pstack 24930
#0  0xb8280954 in epoll_pwait () from /lib64/libc.so.6
#1  0x00843408 in WaitEventSetWaitBlock (nevents=1,
occurred_events=, cur_timeout=60, set=0x3b38dac0) at
latch.c:1570
#2  WaitEventSetWait (set=0x3b38dac0, timeout=timeout@entry=60,
occurred_events=occurred_events@entry=0xfd1d66c8, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x008437c4 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=41, timeout=60,
wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x0090c384 in pg_sleep (fcinfo=) at misc.c:406
#5  0x00699350 in ExecInterpExpr (state=0x3b5a41a0,
econtext=0x3b5a3f98, isnull=) at execExprInterp.c:764
#6  0x006d1668 in ExecEvalExprSwitchContext (isNull=0xfd1d683f,
econtext=0x3b5a3f98, state=) at
../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=) at
../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=) at nodeResult.c:135
#9  0x006ba26c in ExecProcNode (node=0x3b5a3e88) at
../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x3b5a3c98) at nodeGather.c:287
#11 ExecGather (pstate=0x3b5a3c98) at nodeGather.c:222
#12 0x0069d28c in ExecProcNode (node=0x3b5a3c98) at
../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=, dest=0x3b5ae8e0,
direction=, numberTuples=0, sendTuples=,
operation=CMD_SELECT, use_parallel_mode=,
planstate=0x3b5a3c98, estate=0x3b5a3a70) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x3b59c250, direction=,
count=0, execute_once=) at execMain.c:363
#15 0x008720e4 in PortalRunSelect (portal=portal@entry=0x3b410fb0,
forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x3b5ae8e0) at pquery.c:924
#16 0x00873900 in PortalRun (portal=portal@entry=0x3b410fb0,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x3b5ae8e0,
altdest=altdest@entry=0x3b5ae8e0, qc=qc@entry=0xfd1d6bf0) at
pquery.c:768
#17 0x0086f5d4 in exec_simple_query
(query_string=query_string@entry=0x3b391c90
"SELECT pg_sleep(0.1);") at postgres.c:1274
#18 0x00870110 in PostgresMain (dbname=,
username=) at postgres.c:4680
#19 0x0086b6a0 in BackendMain (startup_data=,
startup_data_len=) at backend_startup.c:105
#20 0x007c6268 in postmaster_child_launch
(child_type=child_type@entry=B_BACKEND,
startup_data=startup_data@entry=0xfd1d70b8
"", startup_data_len=startup_data_len@entry=4,
client_sock=client_sock@entry=0xfd1d70c0)
at launch_backend.c:265
#21 0x007c9c50 in BackendStartup (client_sock=0xfd1d70c0) at
postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x007cb8f8 in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x3b38d320)
at postmaster.c:1372
#24 0x00496e18 in main (argc=8, argv=0x3b38d320) at main.c:197



 CPU% MEM%   TIME+  Command
.
.
  0.0  0.0  0:00.00 │ └─ /bin/sh -c cd /opt/postgres/build-farm-14 &&
PATH=/opt/gcc/home/ec2-user/proj/gcc/target/bin/
  0.0  0.1  0:00.07 │└─ /usr/bin/perl ./run_build.pl
--config=build-farm.conf HEAD --verbose
  0.0  0.0  0:00.00 │   └─ sh -c { cd pgsql.build/src/test/regress
&& make NO_LOCALE=1 check; echo $? > /opt/postg
  0.0  0.0  0:00.00 │  └─ make NO_LOCALE=1 check
  0.0  0.0  0:00.00 │ └─ /bin/sh -c echo "# +++ regress
check in src/test/regress +++" && PATH="/opt/postg
  0.0  0.0  0:00.10 │└─
../../../src/test/regress/pg_regress --temp-instance=./tmp_check
--inputdir=.
  0.0  0.0  0:00.01 │   ├─ psql -X -a -q -d regression
-v HIDE_TABLEAM=on -v HIDE_TOAST_COMPRESSION=on
  0.0  0.1  0:02.64 │   └─ postgres -D
/opt/postgres/build-farm-14/buildroot/HEAD/pgsql.build/src/test
  0.0  0.2  0:00.05 │  ├─ postgres: postgres
regression [local] SELECT
  0.0  0.0  0:00.06 │  ├─ postgres: logical
replication launcher
  0.0  0.1  0:00.36 │  ├─ postgres: autovacuum
launcher
  0.0  0.1  0:00.34 │  ├─ postgres: walwriter
  0.0  0.0  0:00.32 │  ├─ postgres: background
writer
  0.0  0.3  0:00.05 │  └─ postgres: checkpointer

-
robins

>


Re: Why is parula failing?

2024-04-13 Thread Robins Tharakan
On Wed, 10 Apr 2024 at 10:24, David Rowley  wrote:
>
> Master failed today for the first time since the compiler upgrade.
> Again reltuples == 48.

Here's what I can add over the past few days:
- Almost all failures are either reltuples=48 or SIGABRTs
- Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS
  - A little too coincidental? Recent crashes have stack-trace if
interested.

Barring the initial failures (during move to gcc 13.2), in the past week:
- v15 somehow hasn't had a failure yet
- v14 / v16 have got only 1 failure each
- but v12 / v13 are lit up - failed multiple times.

-
robins


Re: Why is parula failing?

2024-04-09 Thread Robins Tharakan
On Wed, 10 Apr 2024 at 10:24, David Rowley  wrote:
> Master failed today for the first time since the compiler upgrade.
> Again reltuples == 48.

>From the buildfarm members page, parula seems to be the only aarch64 + gcc
13.2
combination today, and then I suspect whether this is about gcc v13.2
maturity on aarch64?

I'll try to upgrade one of the other aarch64s I have (massasauga or
snakefly) and
see if this is more about gcc 13.2 maturity on this architecture.
-
robins


Re: Why is parula failing?

2024-04-08 Thread Robins Tharakan
On Tue, 2 Apr 2024 at 15:01, Tom Lane  wrote:
> "Tharakan, Robins"  writes:
> > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16)
all of which passed on subsequent-tries,
> > of which some were even"signal 6: Aborted".
>
> Ugh...


parula didn't send any reports to buildfarm for the past 44 hours. Logged in
to see that postgres was stuck on pg_sleep(), which was quite odd! I
captured
the backtrace and triggered another run on HEAD, which came out
okay.

I'll keep an eye on this instance more often for the next few days.
(Let me know if I could capture more if a run gets stuck again)


(gdb) bt
#0  0x952ae954 in epoll_pwait () from /lib64/libc.so.6
#1  0x0083e9c8 in WaitEventSetWaitBlock (nevents=1,
occurred_events=, cur_timeout=297992, set=0x2816dac0) at
latch.c:1570
#2  WaitEventSetWait (set=0x2816dac0, timeout=timeout@entry=60,
occurred_events=occurred_events@entry=0xc395ed28, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x0083ed84 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=41, timeout=60,
wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x00907404 in pg_sleep (fcinfo=) at misc.c:406
#5  0x00696b10 in ExecInterpExpr (state=0x28384040,
econtext=0x28383e38, isnull=) at execExprInterp.c:764
#6  0x006ceef8 in ExecEvalExprSwitchContext (isNull=0xc395ee9f,
econtext=0x28383e38, state=) at
../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=) at
../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=) at nodeResult.c:135
#9  0x006b7aec in ExecProcNode (node=0x28383d28) at
../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x28383b38) at nodeGather.c:287
#11 ExecGather (pstate=0x28383b38) at nodeGather.c:222
#12 0x0069aa4c in ExecProcNode (node=0x28383b38) at
../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=, dest=0x2831ffb0,
direction=, numberTuples=0, sendTuples=,
operation=CMD_SELECT, use_parallel_mode=,
planstate=0x28383b38, estate=0x28383910) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x283239c0, direction=,
count=0, execute_once=) at execMain.c:363
#15 0x0086d454 in PortalRunSelect (portal=portal@entry=0x281f0fb0,
forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x2831ffb0) at pquery.c:924
#16 0x0086ec70 in PortalRun (portal=portal@entry=0x281f0fb0,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x2831ffb0,
altdest=altdest@entry=0x2831ffb0, qc=qc@entry=0xc395f250) at
pquery.c:768
#17 0x0086a944 in exec_simple_query
(query_string=query_string@entry=0x28171c90
"SELECT pg_sleep(0.1);") at postgres.c:1274
#18 0x0086b480 in PostgresMain (dbname=,
username=) at postgres.c:4680
#19 0x00866a0c in BackendMain (startup_data=,
startup_data_len=) at backend_startup.c:101
#20 0x007c1738 in postmaster_child_launch
(child_type=child_type@entry=B_BACKEND,
startup_data=startup_data@entry=0xc395f718
"", startup_data_len=startup_data_len@entry=4,
client_sock=client_sock@entry=0xc395f720)
at launch_backend.c:265
#21 0x007c5120 in BackendStartup (client_sock=0xc395f720) at
postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x007c6dc8 in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x2816d320)
at postmaster.c:1372
#24 0x00496bb8 in main (argc=8, argv=0x2816d320) at main.c:197


>
> The update_personality.pl script in the buildfarm client distro
> is what to use to adjust OS version or compiler version data.
>
Thanks. Fixed that.

-
robins


Re: pg_upgrade failing for 200+ million Large Objects

2023-12-28 Thread Robins Tharakan
On Thu, 28 Dec 2023 at 01:48, Tom Lane  wrote:

> Robins Tharakan  writes:
> > Applying all 4 patches, I also see good performance improvement.
> > With more Large Objects, although pg_dump improved significantly,
> > pg_restore is now comfortably an order of magnitude faster.
>
> Yeah.  The key thing here is that pg_dump can only parallelize
> the data transfer, while (with 0004) pg_restore can parallelize
> large object creation and owner-setting as well as data transfer.
> I don't see any simple way to improve that on the dump side,
> but I'm not sure we need to.  Zillions of empty objects is not
> really the use case to worry about.  I suspect that a more realistic
> case with moderate amounts of data in the blobs would make pg_dump
> look better.
>


Thanks for elaborating, and yes pg_dump times do reflect that
expectation.

The first test involved a fixed number (32k) of
Large Objects (LOs) with varying sizes - I chose that number
intentionally since this was being tested on a 32vCPU instance
and the patch employs 1k batches.


We again see that pg_restore is an order of magnitude faster.

 LO Size (bytes)  restore-HEAD restore-patched  improvement (Nx)
   124.182 1.4  17x
  1024.741 1.5  17x
 10024.574 1.6  15x
   1,00025.314 1.7  15x
  10,00025.644 1.7  15x
 100,00050.046 4.3  12x
   1,000,000   281.54930.0   9x


pg_dump also sees improvements. Really small sized LOs
see a decent ~20% improvement which grows considerably as LOs
get bigger (beyond ~10-100kb).


 LO Size (bytes)  dump-HEAD  dump-patchedimprovement (%)
   112.9  10.7  18%
  1012.9  10.4  19%
 10012.8  10.3  20%
   1,00013.0  10.3  21%
  10,00014.2  10.3  27%
 100,00032.8  11.5  65%
   1,000,000   211.8  23.6  89%


To test pg_restore scaling, 1 Million LOs (100kb each)
were created and pg_restore times tested for increasing
concurrency (on a 192vCPU instance). We see major speedup
upto -j64 and the best time was at -j96, after which
performance decreases slowly - see attached image.

Concurrencypg_restore-patched
384  75.87
352  75.63
320  72.11
288  70.05
256  70.98
224  66.98
192  63.04
160  61.37
128  58.82
 96  58.55
 64  60.46
 32  77.29
 16 115.51
  8 203.48
  4 366.33



Test details:
- Command used to generate SQL - create 1k LOs of 1kb each
  - echo "SELECT lo_from_bytea(0, '\x`  printf 'ff%.0s' {1..1000}`') FROM
generate_series(1,1000);" > /tmp/tempdel
- Verify the LO size: select pg_column_size(lo_get(oid));
- Only GUC changed: max_connections=1000 (for the last test)

-
Robins Tharakan
Amazon Web Services


Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

2023-01-30 Thread Robins Tharakan
x55d919752fb0)
at fe-connect.c:4112
#5  0x7f96da543d55 in PQfinish (conn=0x55d919752fb0) at fe-connect.c:4134
#6  0x7f96d9ebd42b in libpqsrv_disconnect (conn=0x55d919752fb0)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7  0x7f96d9ebddf1 in dblink_disconnect (fcinfo=0x55d91f2692a8)
at dblink.c:357





Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f5f6b632859 in __GI_abort () at abort.c:79
#2  0x7f5f6b69d26e in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x7f5f6b6a52fc in malloc_printerr (
str=str@entry=0x7f5f6b7c91e0 "munmap_chunk(): invalid pointer")
at malloc.c:5347
#4  0x7f5f6b6a554c in munmap_chunk (p=) at malloc.c:2830
#5  0x7f5f50085efd in pqDropConnection (conn=0x55d12ebcd100,
flushInput=true) at fe-connect.c:495
#6  0x7f5f5008bcb3 in closePGconn (conn=0x55d12ebcd100)
at fe-connect.c:4112
#7  0x7f5f5008bd55 in PQfinish (conn=0x55d12ebcd100) at fe-connect.c:4134
#8  0x7f5f5006c42b in libpqsrv_disconnect (conn=0x55d12ebcd100)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117




Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f5f6b632859 in __GI_abort () at abort.c:79
#2  0x7f5f6b69d26e in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x7f5f6b6a52fc in malloc_printerr (
str=str@entry=0x7f5f6b7c54c1 "free(): invalid pointer") at malloc.c:5347
#4  0x7f5f6b6a6b2c in _int_free (av=, p=,
have_lock=0) at malloc.c:4173
#5  0x7f5f500fe6ed in freePGconn (conn=0x55d142273000)
at fe-connect.c:3977
#6  0x7f5f500fed61 in PQfinish (conn=0x55d142273000) at fe-connect.c:4135
#7  0x7f5f501de42b in libpqsrv_disconnect (conn=0x55d142273000)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#8  0x7f5f501dedf1 in dblink_disconnect (fcinfo=0x55d1527998f8)
at dblink.c:357





Core was generated by `postgres: e4602483e9@(HEAD detached at
e4602483e9)@sqith: u73 postgres 127.0.0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
at malloc.c:3154
#0  __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
at malloc.c:3154
#1  0x7fb7bc0a580a in pqCheckOutBufferSpace (bytes_needed=2139062148,
conn=0x55b191aa9380) at fe-misc.c:329
#2  0x7fb7bc0a5b1c in pqPutMsgStart (msg_type=88 'X', conn=0x55b191aa9380)
at fe-misc.c:476
#3  0x7fb7bc097c60 in sendTerminateConn (conn=0x55b191aa9380)
at fe-connect.c:4076
#4  0x7fb7bc097c97 in closePGconn (conn=0x55b191aa9380)
at fe-connect.c:4096
#5  0x7fb7bc097d55 in PQfinish (conn=0x55b191aa9380) at fe-connect.c:4134
#6  0x7fb7bc14a42b in libpqsrv_disconnect (conn=0x55b191aa9380)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7  0x00007fb7bc14adf1 in dblink_disconnect (fcinfo=0x55b193894f00)
at dblink.c:357



Thanks to SQLSmith for helping with this find.

-
Robins Tharakan
Amazon Web Services




Missing CFI in iterate_word_similarity()

2022-08-01 Thread Robins Tharakan
Hi,

For long strings, iterate_word_similarity() can run into long-running
tight-loops without honouring interrupts or statement_timeouts. For
example:

postgres=# set statement_timeout='1s';
SET
postgres=# select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet';
?column?
--
(0 rows)
Time: 29615.842 ms (00:29.616)

The associated perf report:

+ 99.98% 0.00% postgres postgres [.] ExecQual
+ 99.98% 0.00% postgres postgres [.] ExecEvalExprSwitchContext
+ 99.98% 0.00% postgres pg_trgm.so [.] strict_word_similarity_commutator_op
+ 99.98% 0.00% postgres pg_trgm.so [.] calc_word_similarity
+ 99.68% 99.47% postgres pg_trgm.so [.] iterate_word_similarity
0.21% 0.03% postgres postgres [.] pg_qsort
0.16% 0.00% postgres [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
0.16% 0.00% postgres [kernel.kallsyms] [k] sysvec_apic_timer_interrupt
0.16% 0.11% postgres [kernel.kallsyms] [k] __softirqentry_text_start
0.16% 0.00% postgres [kernel.kallsyms] [k] irq_exit_rcu

Adding CHECK_FOR_INTERRUPTS() ensures that such queries respond to
statement_timeout & Ctrl-C signals. With the patch applied, the
above query will interrupt more quickly:

postgres=# select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet';
ERROR: canceling statement due to statement timeout
Time: 1000.768 ms (00:01.001)

Please find the patch attached. The patch does not show any performance
regressions when run against the above use-case. Thanks to SQLSmith
for indirectly leading me to this scenario.

-
Robins Tharakan
Amazon Web Services
Patch applied to commit - 80d690721973f6a031143a24a34b78a0225101a2

SQL repro script


CREATE EXTENSION IF NOT EXISTS pg_trgm;
set statement_timeout = '1s';
show statement_timeout;
\timing on

select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet';


-- Check whether this change brought in any performance regressions
set statement_timeout='0';
show statement_timeout;

select COUNT(*) from generate_series(1,1) q(e) where repeat('1.1',1) %>> 
('Lorem ipsum dolor sit amet'||e::text);
select COUNT(*) from generate_series(1,10) q(e) where repeat('1.1',1) %>> 
('Lorem ipsum dolor sit amet'||e::text);
select COUNT(*) from generate_series(1,100) q(e) where repeat('1.1',1) %>> 
('Lorem ipsum dolor sit amet'||e::text);



SQL script output
=
CREATE EXTENSION
SET
 statement_timeout 
---
 1s
(1 row)
Timing is on.

psql:/home/ubuntu/proj/sqlsmithdata/repro1.sql:11: ERROR:  canceling statement 
due to statement timeout
Time: 1000.792 ms (00:01.001)



SET
Time: 0.093 ms
 statement_timeout 
---
 0
(1 row)

Time: 0.077 ms
 count 
---
 0
(1 row)

Time: 473.487 ms
 count 
---
 0
(1 row)

Time: 4726.628 ms (00:04.727)
 count 
---
 0
(1 row)

Time: 47231.271 ms (00:47.231)
commit - 80d690721973f6a031143a24a34b78a0225101a2

SQL repro script


CREATE EXTENSION IF NOT EXISTS pg_trgm;
set statement_timeout = '1s';
show statement_timeout;
\timing on

select 1 where repeat('1.1',8) %>> 'Lorem ipsum dolor sit amet';SELECT 1;


-- Check whether this change brought in any performance regressions
set statement_timeout='0';
show statement_timeout;

select COUNT(*) from generate_series(1,1) q(e) where repeat('1.1',1) %>> 
('Lorem ipsum dolor sit amet'||e::text);
select COUNT(*) from generate_series(1,10) q(e) where repeat('1.1',1) %>> 
('Lorem ipsum dolor sit amet'||e::text);
select COUNT(*) from generate_series(1,100) q(e) where repeat('1.1',1) %>> 
('Lorem ipsum dolor sit amet'||e::text);



SQL script output
=
CREATE EXTENSION
SET
 statement_timeout 
---
 1s
(1 row)

Timing is on.

 ?column? 
--
(0 rows)

Time: 29620.933 ms (00:29.621)
psql:/home/ubuntu/proj/sqlsmithdata/repro1.sql:11: ERROR:  canceling statement 
due to statement timeout
Time: 0.073 ms


SET
Time: 0.159 ms
 statement_timeout 
---
 0
(1 row)

Time: 0.100 ms
 count 
---
 0
(1 row)

Time: 473.449 ms
 count 
---
 0
(1 row)

Time: 4725.483 ms (00:04.725)
 count 
---
 0
(1 row)

Time: 47222.223 ms (00:47.222)


v1_cfi_iterate_word_similarity.patch
Description: Binary data


autoprewarm worker failing to load

2022-07-27 Thread Robins Tharakan
Hi,

089480c077056 seems to have broken pg_prewarm. When pg_prewarm
is added to shared_preload_libraries, each new connection results in
thousands of errors such as this:


2022-07-27 04:25:14.325 UTC [2903955] LOG: background worker
"autoprewarm leader" (PID 2904146) exited with exit code 1
2022-07-27 04:25:14.325 UTC [2904148] ERROR: could not find function
"autoprewarm_main" in file
"/home/ubuntu/proj/tempdel/lib/postgresql/pg_prewarm.so"

Checking pg_prewarm.so the function 'autoprewarm_main' visibility
switched from GLOBAL to LOCAL. Per [1], using PGDLLEXPORT
makes it GLOBAL again, which appears to fix the issue:

Before commit (089480c077056) -
ubuntu:~/proj/tempdel$ readelf -sW lib/postgresql/pg_prewarm.so | grep main
103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main
109: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main
128: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main
187: 45ad 873 FUNC GLOBAL DEFAULT 14 autoprewarm_database_main

After commit (089480c077056) -
78: 2d79 609 FUNC LOCAL DEFAULT 14 autoprewarm_main
85: 35ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main

After applying the attached fix:
103: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main
84: 45ad 873 FUNC LOCAL DEFAULT 14 autoprewarm_database_main
129: 3d79 609 FUNC GLOBAL DEFAULT 14 autoprewarm_main


Please let me know your thoughts on this approach.

[1] 
https://www.postgresql.org/message-id/A737B7A37273E048B164557ADEF4A58B5393038C%40ntex2010a.host.magwien.gv.at

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b2d6026093..ec619be9f2 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -82,7 +82,7 @@ typedef struct AutoPrewarmSharedState
int prewarmed_blocks;
} AutoPrewarmSharedState;

-void autoprewarm_main(Datum main_arg);
+PGDLLEXPORT void autoprewarm_main(Datum main_arg);
void autoprewarm_database_main(Datum main_arg);

PG_FUNCTION_INFO_V1(autoprewarm_start_worker);

-
Robins Tharakan
Amazon Web Services




Re: 13dev failed assert: comparetup_index_btree(): ItemPointer values should never be equal

2022-06-29 Thread Robins Tharakan
_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x55bfa18f0e38, qc=0x7ffcfa60ad20) at utility.c:526
#20 0x55bf9fc3180e in PortalRunUtility (portal=0x55bfa197d020,
pstmt=0x55bfa18f0d48, isTopLevel=true, setHoldSnapshot=false,
dest=0x55bfa18f0e38, qc=0x7ffcfa60ad20) at pquery.c:1158
#21 0x55bf9fc31a84 in PortalRunMulti (portal=0x55bfa197d020,
isTopLevel=true, setHoldSnapshot=false, dest=0x55bfa18f0e38,
altdest=0x55bfa18f0e38, qc=0x7ffcfa60ad20) at pquery.c:1315
#22 0x55bf9fc30ef1 in PortalRun (portal=0x55bfa197d020,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x55bfa18f0e38, altdest=0x55bfa18f0e38, qc=0x7ffcfa60ad20)
at pquery.c:791
#23 0x55bf9fc2a14f in exec_simple_query
(query_string=0x55bfa18eff30 "REINDEX INDEX
pg_class_tblspc_relfilenode_index;") at postgres.c:1250
#24 0x55bf9fc2ecdf in PostgresMain (dbname=0x55bfa1923be0
"postgres", username=0x55bfa18eb8f8 "ubuntu") at postgres.c:4544
#25 0x55bf9fb52e93 in BackendRun (port=0x55bfa19218a0) at postmaster.c:4504
#26 0x55bf9fb52778 in BackendStartup (port=0x55bfa19218a0) at
postmaster.c:4232
#27 0x55bf9fb4ea5e in ServerLoop () at postmaster.c:1806
#28 0x55bf9fb4e1f7 in PostmasterMain (argc=3, argv=0x55bfa18e9830)
at postmaster.c:1478
#29 0x55bf9fa3f864 in main (argc=3, argv=0x55bfa18e9830) at main.c:202

-
Robins Tharakan
Amazon Web Services




Re: buildfarm instance bichir stuck

2021-04-09 Thread Robins Tharakan
On Fri, 9 Apr 2021 at 16:12, Thomas Munro  wrote:
> From your description it sounds like signals are not arriving at all,
> rather than some more complicated race.  Let's go back to basics...
> what does the attached program print for you?  I see:
>
> tmunro@x1:~/junk$ cc test-signalfd.c
> tmunro@x1:~/junk$ ./a.out
> blocking SIGURG...
> creating a signalfd to receive SIGURG...
> creating an epoll set...
> adding signalfd to epoll set...
> polling the epoll set... 0
> sending a signal...
> polling the epoll set... 1


I get pretty much the same. Some additional info below, although not sure
if it'd be of any help here.

robins@WSLv1:~/proj/hackers$ cc test-signalfd.c

robins@WSLv1:~/proj/hackers$ ./a.out
blocking SIGURG...
creating a signalfd to receive SIGURG...
creating an epoll set...
adding signalfd to epoll set...
polling the epoll set... 0
sending a signal...
polling the epoll set... 1

robins@WSLv1:~/proj/hackers$ cat /proc/cpuinfo | egrep 'flags|model' | sort
-u
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave osxsave avx f16c rdrand lahf_lm abm
3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm
mpx rdseed adx smap clflushopt intel_pt ibrs ibpb stibp ssbd
model   : 142
model name  : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz

robins@WSLv1:~/proj/hackers$ uname -a
Linux WSLv1 4.4.0-19041-Microsoft #488-Microsoft Mon Sep 01 13:43:00 PST
2020 x86_64 x86_64 x86_64 GNU/Linux

C:>wsl -l -v
  NAMESTATE   VERSION
* Ubuntu-18.04Running 1

-
robins


Re: buildfarm instance bichir stuck

2021-04-07 Thread Robins Tharakan
Thanks Andrew.

The build's still running but the CPPFLAGS hint does seem to have helped
(see below).

Unless advised otherwise, I intend to let that option be, so as to get
bichir back online. If a future commit 'fixes' things, I could rollback
this flag to test things out (or try out other options if required).


On Wed, 7 Apr 2021 at 21:49, Andrew Dunstan  wrote:
> On 4/7/21 2:16 AM, Thomas Munro wrote:
> > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan 
wrote:
> >> Bichir's been stuck for the past month and is unable to run regression
tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.
> > ...If it is indeed
> > something like that and not a bug in my code, then I was thinking that
> > the main tool available to deal with it would be to set WAIT_USE_POLL
> > in the relevant template file, so that we don't use the combination of
> > epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
> > works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
> > from an Ubuntu package with a freaky kernel.  Hmm.
> To test this the OP could just add
> CPPFLAGS => '-DWAIT_USE_POLL',
> to his animal's config's config_env stanza.

This did help in getting past the previous hurdle.

postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$
grep CPPFLAGS configure.log| grep using
configure: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE
-I/usr/include/libxml2
configure:19511: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE
-I/usr/include/libxml2

postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$
grep -A2 "creating database" lastcommand.log
== creating database "regression" ==
CREATE DATABASE
ALTER DATABASE

-
thanks
robins


Re: buildfarm instance bichir stuck

2021-04-07 Thread Robins Tharakan
Hi Thomas,

Thanks for taking a look at this promptly.


On Wed, 7 Apr 2021 at 16:17, Thomas Munro  wrote:
> On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan  wrote:
> > It is interesting that that commit's a month old and probably no other
client has complained since, but diving in, I can see that it's been unable
to even start regression tests after that commit went in.
>
> Oh, well at least it's easily reproducible then, that's something!

Correct. This is easily reproducible on this test-instance, so let me know
if you want me to test a patch.


>
> That's actually the client.  I guess there is also a backend process
> stuck somewhere in epoll_wait()?

You're right (and yes my bad, I was looking at the client). The server
process is stuck in epoll_wait(). Let me know if you need me to give any
other info that may be helpful.


root@WSLv1:~# gdb -batch -ex bt -p 29887
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x7fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748,
maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30  ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
#0  0x7fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748,
maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x7fa088c355dc in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7fffd2d4c090, cur_timeout=-1, set=0x7fffcbcc56e8) at
latch.c:1428
#2  WaitEventSetWait (set=0x7fffcbcc56e8, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7fffd2d4c090, nevents=nevents@entry=1,
wait_event_info=wait_ev
#3  0x7fa088c35a14 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217733) at
#4  0x7fa088c43ed8 in ConditionVariableTimedSleep (cv=0x7fa0873cc498,
timeout=-1, wait_event_info=134217733) at condition_variable.c:163
#5  0x7fa088bba8bc in RequestCheckpoint (flags=flags@entry=44) at
checkpointer.c:1017
#6  0x7fa088a46315 in createdb (pstate=pstate@entry=0x7fffcbcebbc0,
stmt=stmt@entry=0x7fffcbcca558) at dbcommands.c:711
.
.
.

-
robins


buildfarm instance bichir stuck

2021-04-06 Thread Robins Tharakan
Hi,

Bichir's been stuck for the past month and is unable to run regression
tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.

It is interesting that that commit's a month old and probably no other
client has complained since, but diving in, I can see that it's been unable
to even start regression tests after that commit went in.

Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for
Linux inside Windows 10 - and so isn't really production use-case. The only
run that actually got submitted to Buildfarm was from a few days back when
I killed it after a long wait - see [1].

Since yesterday, I have another run that's again stuck on CREATE DATABASE
(see outputs below) and although pstack not working may be a limitation of
the architecture / installation (unsure), a trace shows it is stuck at poll.

Tracing commits, it seems that the commit
6a2a70a02018d6362f9841cc2f499cc45405e86b broke things and I can confirm
that 'make check' works if I rollback to the preceding commit (
83709a0d5a46559db016c50ded1a95fd3b0d3be6 ).

Not sure if many agree but 2 things stood out here:
1) Buildfarm never got the message that a commit broke an instance. Ideally
I'd have expected buildfarm to have an optimistic timeout that could have
helped - for e.g. right now, the CREATE DATABASE is still stuck since 18
hrs.

2) bichir is clearly not a production use-case (it takes 5 hrs to complete
a HEAD run!), so let me know if this change is intentional (I guess I'll
stop maintaining it if so) but thought I'd still put this out in case
it interests someone.

-
thanks
robins

Reference:
1) Last run that I had to kill -
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bichir=2021-03-31%2012%3A00%3A05

#
The current run is running since yesterday.


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$
tail -2 lastcommand.log
running on port 5678 with PID 8715
== creating database "regression" ==


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ date
Wed Apr  7 12:48:26 AEST 2021


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ ls
-la
total 840
drwxrwxr-x 1 postgres postgres   4096 Apr  6 09:00 .
drwxrwxr-x 1 postgres postgres   4096 Apr  6 08:55 ..
-rw-rw-r-- 1 postgres postgres   1358 Apr  6 08:55 SCM-checkout.log
-rw-rw-r-- 1 postgres postgres  91546 Apr  6 08:56 configure.log
-rw-rw-r-- 1 postgres postgres 40 Apr  6 08:55 githead.log
-rw-rw-r-- 1 postgres postgres   2890 Apr  6 09:01 lastcommand.log
-rw-rw-r-- 1 postgres postgres 712306 Apr  6 09:00 make.log


root@WSLv1:~# pstack 8729
8729: psql -X -c CREATE DATABASE "regression" TEMPLATE=template0
LC_COLLATE='C' LC_CTYPE='C' postgres
pstack: Bad address
failed to read target.


root@WSLv1:~# gdb -batch -ex bt -p 8729
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x7f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8,
nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0  0x7f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8,
nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x7f41a9bc8eb1 in poll (__timeout=, __nfds=1,
__fds=0x7fffe13d7be8) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  pqSocketPoll (end_time=-1, forWrite=0, forRead=1, sock=)
at fe-misc.c:1133
#3  pqSocketCheck (conn=0x7fffd979a0b0, forRead=1, forWrite=0, end_time=-1)
at fe-misc.c:1075
#4  0x7f41a9bc8ff0 in pqWaitTimed (forRead=,
forWrite=, conn=0x7fffd979a0b0, finish_time=)
at fe-misc.c:1007
#5  0x7f41a9bc5ac9 in PQgetResult (conn=0x7fffd979a0b0) at
fe-exec.c:1963
#6  0x7f41a9bc5ea3 in PQexecFinish (conn=0x7fffd979a0b0) at
fe-exec.c:2306
#7  0x7f41a9bc5ef2 in PQexec (conn=,
query=query@entry=0x7fffd9799f70
"CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C'
LC_CTYPE='C'") at fe-exec.c:2148
#8  0x7f41aa21e7a0 in SendQuery (query=0x7fffd9799f70 "CREATE DATABASE
\"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at
common.c:1303
#9  0x7f41aa2160a6 in main (argc=, argv=)
at startup.c:369



#



Here we can see that 83709a0d5a46559db016c50ded1a95fd3b0d3be6 goes past
'CREATE DATABASE'
===
robins@WSLv1:~/proj/postgres/postgres$ git checkout
83709a0d5a46559db016c50ded1a95fd3b0d3be6
Previous HEAD position was 6a2a70a020 Use signalfd(2) for epoll latches.
HEAD is now at 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.

robins@WSLv1:~/proj/postgres/postgres$ cd src/test/regress/

robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4
NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-08 Thread Robins Tharakan
Hi Magnus,

On Mon, 8 Mar 2021 at 23:34, Magnus Hagander  wrote:

> AFAICT at a quick check, pg_dump in binary upgrade mode emits one

lo_create() and one ALTER ... OWNER TO for each large object - so with
> 500M large objects that would be a billion statements, and thus a
> billion xids. And without checking, I'm fairly sure it doesn't load in
> a single transaction...
>

Your assumptions are pretty much correct.

The issue isn't with pg_upgrade itself. During pg_restore, each Large
Object (and separately each ALTER LARGE OBJECT OWNER TO) consumes an XID
each. For background, that's the reason the v9.5 production instance I was
reviewing, was unable to process more than 73 Million large objects since
each object required a CREATE + ALTER. (To clarify, 73 million = (2^31 - 2
billion magic constant - 1 Million wraparound protection) / 2)


Without looking, I would guess it's the schema reload using
> pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
> issue in pg_dump/pg_restore. And if that is the case -- perhaps just
> running all of those in a single transaction would be a better choice?
> One could argue it's still not a proper fix, because we'd still have a
> huge memory usage etc, but it would then only burn 1 xid instead of
> 500M...
>
(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.


This still seems to just fix the symptoms and not the actual problem.
>

I agree that the patch doesn't address the root-cause, but it did get the
upgrade to complete on a test-setup. Do you think that (instead of all
objects) batching multiple Large Objects in a single transaction (and
allowing the caller to size that batch via command line) would be a good /
acceptable idea here?

Please take a look at your email configuration -- all your emails are
> lacking both References and In-reply-to headers.
>

Thanks for highlighting the cause here. Hopefully switching mail clients
would help.
-
Robins Tharakan


Re: Brazil disables DST - 2019b update

2019-07-11 Thread Robins Tharakan
On Fri, 12 Jul 2019 at 14:04, Michael Paquier  wrote:

> On Fri, Jul 12, 2019 at 01:42:59PM +1000, Robins Tharakan wrote:
> So 2019b has been released on the 1st of July.  Usually tzdata updates
> happen just before a minor release, so this would get pulled in at the
> beginning of August (https://www.postgresql.org/developer/roadmap/).
> Tom, I guess that would be again the intention here?
> --
> Michael
>

An August release does give a little more comfort. (I was expecting that
the August
date would get pushed out since 11.4 was an emergency release at the end of
June).

-
robins


Brazil disables DST - 2019b update

2019-07-11 Thread Robins Tharakan
Hi,

The 2019b DST update [1] disables DST for Brazil. This would take effect
starting November 2019. The last DST update in Postgres was 2019a in v11.3
(since this update came in after the recent-most Postgres release).

Since a ~3 month release cycle may be too close for some users, are there
any plans for an early 11.5 (or are such occurrences not a candidate for an
early release)?

Reference:
a) https://mm.icann.org/pipermail/tz-announce/2019-July/56.html
-
robins


Re: Typo in recent commit

2017-12-09 Thread Robins Tharakan
On 9 December 2017 at 16:11, Magnus Hagander  wrote:

> ​​
>
> Thanks, fixed for the report.
>
>
Thanks Magnus.

However, although it was backpatched correctly, looks like the fix on
master missed out identity.out related fix.​

Ref:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/test/regress/expected/identity.out;h=ddc69505937811059aef5c41bc096bc7459cb41e;hb=d8f632caec3fcc5eece9d53d7510322f11489fe4#l359

-
robins
​


Typo in recent commit

2017-12-08 Thread Robins Tharakan
Hi,

Looks like a minor typo in the recent commit.

s/identify/identity/

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a2c6cf36608e10aa223fef49323b5feba344bfcf

-
robins | mobile