Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-31 Thread Justin Pryzby
On Thu, Aug 30, 2018 at 01:25:00PM -0700, Andres Freund wrote:
> On August 30, 2018 1:24:12 PM PDT, Justin Pryzby  wrote:
> >On Thu, Aug 30, 2018 at 01:18:59PM -0700, Andres Freund wrote:
> >> Could you check if both of the proposed attempts at fixing the issue
> >> also solves your problem?
> >
> >Just checking that you're referring to:
> >
> >1)  relcache-rebuild.diff
> >https://www.postgresql.org/message-id/20180829083730.n645apqhb2gyih3g%40alap3.anarazel.de
> >2)  fix-missed-inval-msg-accepts-1.patch
> >https://www.postgresql.org/message-id/25024.1535579899%40sss.pgh.pa.us
> 
> Right. Either should fix the issue.

Confirmed that relcache-rebuild.diff seems to fix/avoid the issue under 10.5,
and fix-missed-inval-msg-accepts-1.patch does so for HEAD (but doesn't apply
cleanly to 10.5 so not tested).  That's for both the the new easy-to-hit
symptom and the pre-existing hard-to-hit issue affecting pg10.4 and earlier.

Thanks,
Justin



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Tom Lane
Justin Pryzby  writes:
> Just curious, is there really any difficulty in reproducing this?

Once you have the right test case, it's not hard.  But it took us two
months to find one ...

regards, tom lane



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Justin Pryzby
On Thu, Aug 30, 2018 at 05:30:30PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> >> As far as we can tell, that bug is a dozen years old, so it's not clear
> >> why you find that you can reproduce it only in 10.5.  But there might be
> >> some subtle timing change accounting for that.
> 
> > It seems to me there's one root problem occurring in (at least) two slightly
> > different ways.  The issue/symptom that I've been seeing occurs in 10.5 but 
> > not
> > 10.4, and specifically at commit 2ce64ca, but not before. 
> 
> Yeah, as you probably saw in the other thread, we later realized that
> 2ce64ca created an additional pathway for ScanPgRelation to recurse;
> a pathway that's evidently easier to hit than the pre-existing ones.
> I note that both of your stack traces display ScanPgRelation recursion,
> so I'm feeling pretty confident that what you're seeing is the same
> thing.
> 
> But, as Andres says, it'd be great if you could confirm whether the
> draft patches fix it for you.

I tested with relcache-rebuild.diff which hasn't broken in 15min, so I'm
confident that doesn't hit the additional recusive pathway, but have to wait
awhile and see if autovacuum survives, too.

I tried to apply fix-missed-inval-msg-accepts-1.patch on top of PG10.5 but
patch didn't apply, so I can test HEAD after the first patch soaks awhile.

Just curious, is there really any difficulty in reproducing this?  Once I
realized this was a continuing issue and started to suspect pg10.5, it takes
just about nothing to reproduce anywhere I've tried.  I just tested 5 servers,
and only one took more than a handful of seconds to fail.  I gave up waiting
for a 6th server, because I found it was waiting on a pre-existing lock.

[pryzbyj@database ~]$ while :; do for a in pg_class_oid_index 
pg_class_relname_nsp_index pg_class_tblspc_relfilenode_index; do psql ts -qc 
"REINDEX INDEX $a"; done; done&
[pryzbyj@database ~]$ a=0; time while psql ts -qc ''; do a=$((1+a)); done ; 
echo "$a"
psql: FATAL:  could not read block 0 in file "base/16400/313581263": read only 
0 of 8192 bytes

real0m1.772s
user0m0.076s
sys 0m0.116s
47

Justin



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
>> As far as we can tell, that bug is a dozen years old, so it's not clear
>> why you find that you can reproduce it only in 10.5.  But there might be
>> some subtle timing change accounting for that.

> It seems to me there's one root problem occurring in (at least) two slightly
> different ways.  The issue/symptom that I've been seeing occurs in 10.5 but 
> not
> 10.4, and specifically at commit 2ce64ca, but not before. 

Yeah, as you probably saw in the other thread, we later realized that
2ce64ca created an additional pathway for ScanPgRelation to recurse;
a pathway that's evidently easier to hit than the pre-existing ones.
I note that both of your stack traces display ScanPgRelation recursion,
so I'm feeling pretty confident that what you're seeing is the same
thing.

But, as Andres says, it'd be great if you could confirm whether the
draft patches fix it for you.

regards, tom lane



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Andres Freund
Hi,

On 2018-08-30 15:00:28 -0500, Justin Pryzby wrote:
> On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> > This looks suspiciously like the issue under discussion in
> > 
> > https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us
> > 
> > As far as we can tell, that bug is a dozen years old, so it's not clear
> > why you find that you can reproduce it only in 10.5.  But there might be
> > some subtle timing change accounting for that.
> 
> I confirmed the stack appears to be same as here, modulo differences between
> 9.6 to 10.

Could you check if both of the proposed attempts at fixing the issue
also solves your problem?

Thanks,

Andres Freund



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Justin Pryzby
On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > I've seen this message now a handful of times recently.  It seems to happen
> > overnight, during a maintenance job which reindex things, including system
> > catalog indices.
> > It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.
> > while :; do for a in pg_class_oid_index pg_class_relname_nsp_index 
> > pg_class_tblspc_relfilenode_index; do psql ts -qc "REINDEX INDEX $a"; done; 
> > done&
> > time while psql ts -qc ''; do :; done
> 
> This looks suspiciously like the issue under discussion in
> 
> https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us
> 
> As far as we can tell, that bug is a dozen years old, so it's not clear
> why you find that you can reproduce it only in 10.5.  But there might be
> some subtle timing change accounting for that.

I confirmed the stack appears to be same as here, modulo differences between
9.6 to 10.
https://www.postgresql.org/message-id/fcab55ec-660a-bc43-7fe1-09af3665e89a%402ndquadrant.com

It seems to me there's one root problem occurring in (at least) two slightly
different ways.  The issue/symptom that I've been seeing occurs in 10.5 but not
10.4, and specifically at commit 2ce64ca, but not before. 

|commit 2ce64caaf793615d0f7a7ce4380e7255077d130c
|Author: Andres Freund 
|Date:   Tue Jun 12 11:13:21 2018 -0700
|
|Fix bugs in vacuum of shared rels, by keeping their relcache entries current.

However, the same message CAN happen under 10.4, though less easily, due to
autovacuum but (afaict) not user queries.  Maybe that's a coincidence, but my
experience mirrors the build logs: the problem since 10.5 is easily tickled in
a few seconds rather than an hour.

The issue I see frequently under 10.5 has this stack:

Core was generated by `postgres: pryzbyj postgres [local] startup   
 '.
#2  0x00709425 in mdread ()
#3  0x006e2bc8 in ReadBuffer_common ()
#4  0x006e3511 in ReadBufferExtended ()
#5  0x004c2da9 in _bt_getbuf ()
#6  0x004c32b2 in _bt_getroot ()
#7  0x004c799b in _bt_search ()
#8  0x004c8568 in _bt_first ()
#9  0x004c52c7 in btgettuple ()
#10 0x004bf936 in index_getnext_tid ()
#11 0x004bfb43 in index_getnext ()
#12 0x004beeae in systable_getnext ()
#13 0x007fa074 in ScanPgRelation ()
#14 0x007fe371 in RelationClearRelation ()
#15 0x007fe84b in RelationCacheInvalidateEntry ()
#16 0x007f7e15 in LocalExecuteInvalidationMessage ()
#17 0x006f3b38 in ReceiveSharedInvalidMessages ()
#18 0x006f6eb5 in LockRelationOid ()
#19 0x004ae495 in relation_open ()
#20 0x004bf336 in index_open ()
#21 0x004bee3c in systable_beginscan ()
#22 0x007fa069 in ScanPgRelation ()
#23 0x007fce84 in RelationBuildDesc ()
#24 0x007fe6f8 in RelationIdGetRelation ()
#25 0x004ae463 in relation_open ()
#26 0x004ae6a6 in heap_open ()
#27 0x00819237 in InitPostgres ()
#28 0x0070e2e9 in PostgresMain ()
#29 0x00477f36 in ServerLoop ()
#30 0x006a5e3e in PostmasterMain ()
#31 0x00478a08 in main ()

>From autovacum running REL_10_4 I've now seen (since trying to tickle the
problem yesterday):

#1  0x7fbfb5b9d01a in __GI_abort () at abort.c:89
#2  0x00708489 in mdread ()
#3  0x006e1c48 in ReadBuffer_common ()
#4  0x006e2591 in ReadBufferExtended ()
#5  0x004c2369 in _bt_getbuf ()
#6  0x004c2810 in _bt_getroot ()
#7  0x004c6f5b in _bt_search ()
#8  0x004c7b28 in _bt_first ()
#9  0x004c4887 in btgettuple ()
#10 0x004bf00e in index_getnext_tid ()
#11 0x004bf1f3 in index_getnext ()
#12 0x004be679 in systable_getnext ()
#13 0x007f9084 in ScanPgRelation ()
#14 0x007fbe54 in RelationBuildDesc ()
#15 0x007fcc86 in RelationClearRelation ()
#16 0x007fd7bb in RelationCacheInvalidateEntry ()
#17 0x007f6e25 in LocalExecuteInvalidationMessage ()
#18 0x006f2bb8 in ReceiveSharedInvalidMessages ()
#19 0x006f5f35 in LockRelationOid ()
#20 0x004ade05 in relation_open ()
#21 0x004bea96 in index_open ()
#22 0x004be606 in systable_beginscan ()
#23 0x007f9079 in ScanPgRelation ()
#24 0x007fbe54 in RelationBuildDesc ()
#25 0x007fd668 in RelationIdGetRelation ()
#26 0x004addd3 in relation_open ()
#27 0x004bea96 in index_open ()
#28 0x004be606 in systable_beginscan ()
#29 0x007fef60 in RelationGetIndexList ()
#30 0x005dba6c in vac_open_indexes ()
#31 0x005dc652 in lazy_vacuum_rel ()
#32 0x005da7b4 in vacuum_rel ()
#33 0x005db661 in vacuum ()
#34 0x004770db in do_autovacuum ()
#35 0x00696242 in AutoVacWorkerMain.isra.6 ()
#36 0x00696a49 in StartAutoVacWorker ()

Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Justin Pryzby  writes:
> I've seen this message now a handful of times recently.  It seems to happen
> overnight, during a maintenance job which reindex things, including system
> catalog indices.
> It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.
> while :; do for a in pg_class_oid_index pg_class_relname_nsp_index 
> pg_class_tblspc_relfilenode_index; do psql ts -qc "REINDEX INDEX $a"; done; 
> done&

This looks suspiciously like the issue under discussion in

https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us

As far as we can tell, that bug is a dozen years old, so it's not clear
why you find that you can reproduce it only in 10.5.  But there might be
some subtle timing change accounting for that.

regards, tom lane



10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-29 Thread Justin Pryzby
I've seen this message now a handful of times recently.  It seems to happen
overnight, during a maintenance job which reindex things, including system
catalog indices.

It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.

while :; do for a in pg_class_oid_index pg_class_relname_nsp_index 
pg_class_tblspc_relfilenode_index; do psql ts -qc "REINDEX INDEX $a"; done; 
done&

[pryzbyj@database ~]$ while :; do psql ts -qc ''; done
psql: FATAL:  could not read block 1 in file "base/16400/313430687": read only 
0 of 8192 bytes
psql: FATAL:  could not read block 0 in file "base/16400/313430708": read only 
0 of 8192 bytes
psql: FATAL:  could not read block 0 in file "base/16400/313430711": read only 
0 of 8192 bytes
^C

postgres=# SELECT * FROM postgres_log WHERE error_severity ='FATAL';
log_time   | 2018-08-28 22:19:58.822-05
user_name  | telsasoft
database   | ts
pid| 22445
connection_from| 192.168.122.12:58318
session_id | 5b8610de.57ad
session_line   | 1
command_tag| startup
session_start_time | 2018-08-28 22:19:58-05
virtual_transaction_id | 26/280967
transaction_id | 0
error_severity | FATAL
sql_state_code | XX001
message| could not read block 0 in file "base/16400/313316300": 
read only 0 of 8192 bytes

Justin