Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-09-07 Thread Tom Lane
I went ahead and pushed this, since the window for getting buildfarm testing done before Monday's wrap is closing fast. We can always improve on it later, but I think beta3 ought to carry some fix for the problem. regards, tom lane

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-09-06 Thread Tom Lane
Andres Freund writes: > Could you attach the current version of the patch, or were there no > meaningful changes? No changes. >> So I took that as license to proceed, but while doing a final round of >> testing I found out that a CLOBBER_CACHE_RECURSIVELY build fails, >> because now that's an

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-09-06 Thread Andres Freund
On 2018-09-06 17:38:38 -0400, Tom Lane wrote: > I wrote: > > So where are we on this? Should I proceed with my patch, or are we > > going to do further investigation? Does anyone want to do an actual > > patch review? > > [ crickets... ] Sorry, bit busy with postgres open, and a few people

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-09-06 Thread Tom Lane
I wrote: > So where are we on this? Should I proceed with my patch, or are we > going to do further investigation? Does anyone want to do an actual > patch review? [ crickets... ] So I took that as license to proceed, but while doing a final round of testing I found out that a

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-09-05 Thread Tom Lane
I wrote: > Andres Freund writes: >> One concern I have with your approach is that it isn't particularly >> bullet-proof for cases where the rebuild is triggered by something that >> doesn't hold a conflicting lock. > Wouldn't that be a bug in the something-else? So where are we on this? Should

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-09-01 Thread Tom Lane
Andres Freund writes: > One concern I have with your approach is that it isn't particularly > bullet-proof for cases where the rebuild is triggered by something that > doesn't hold a conflicting lock. Wouldn't that be a bug in the something-else? The entire relation cache system is based on the

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-31 Thread Andres Freund
Hi, On 2018-08-31 19:53:43 -0400, Tom Lane wrote: > My thought is to do (and back-patch) my change, and then work on yours > as a performance improvement for HEAD only. That does make sense. > I don't believe that yours would make mine redundant, either --- they > are good complementary changes

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-31 Thread Tom Lane
Andres Freund writes: > Leaving that aside, I think there's one architectural aspect of my > approach that I prefer over yours: Deduplicating eager cache rebuilds > like my approach does seems quite advantageous. That is attractive, for sure, but the other side of the coin is that getting there

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-31 Thread Andres Freund
On 2018-08-29 17:58:19 -0400, Tom Lane wrote: > I wrote: > > We could perhaps fix this with a less invasive change than what you > > suggest here, by attacking the missed-call-due-to-recursion aspect > > rather than monkeying with how relcache rebuild itself works. > > Seeing that rearranging the

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
I wrote: > We could perhaps fix this with a less invasive change than what you > suggest here, by attacking the missed-call-due-to-recursion aspect > rather than monkeying with how relcache rebuild itself works. Seeing that rearranging the relcache rebuild logic is looking less than trivial, I

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund writes: > On 2018-08-29 14:00:12 -0400, Tom Lane wrote: >> 2. I think we may need to address the same order-of-operations hazards >> as RelationCacheInvalidate() worries about. Alternatively, maybe we >> could simplify that function by making it use the same >> delayed-revalidation

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund writes: > On 2018-08-29 12:56:07 -0400, Tom Lane wrote: >> BTW, I now have a theory for why we suddenly started seeing this problem >> in mid-June: commits a54e1f158 et al added a ScanPgRelation call where >> there had been none before (in RelationReloadNailed, for non-index rels).

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Andres Freund
Hi, On 2018-08-29 14:00:12 -0400, Tom Lane wrote: > A couple thoughts after reading and reflecting for awhile: Thanks. This definitely is too complicated for a single brain :( > 1. I don't much like the pending_rebuilds list, mainly because of this > consideration: what happens if we hit an

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Andres Freund
Hi, On 2018-08-29 12:56:07 -0400, Tom Lane wrote: > I wrote: > > * We now recursively enter ScanPgRelation, which (again) needs to do a > > search using pg_class_oid_index, so it (again) opens and locks that. > > BUT: LockRelationOid sees that *this process already has share lock on > >

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund writes: > A bit of food, a coke and a talk later, here's a first draft *prototype* > of how this could be solved. ... > Obviously this is far from clean enough, but what do you think about the > basic approach? It does, in my limited testing, indeed solve the "could > not read

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
I wrote: > * We now recursively enter ScanPgRelation, which (again) needs to do a > search using pg_class_oid_index, so it (again) opens and locks that. > BUT: LockRelationOid sees that *this process already has share lock on > pg_class_oid_index*, so it figures it can skip

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund writes: > It's not OK to rebuild relcache entries in the middle of > ReceiveSharedInvalidMessages() - a later entry in the invalidation queue > might be relmapper invalidation, and thus immediately processing a > relcache invalidation might attempt to scan a relation that does not >

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Andres Freund
On 2018-08-28 20:29:08 -0700, Andres Freund wrote: > On 2018-08-28 20:27:14 -0700, Andres Freund wrote: > > Locally that triggers the problem within usually a few seconds. > > FWIW, it does so including versions as old as 9.2. > > Now I need to look for power for my laptop and some for me ;) A

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund
On 2018-08-28 23:32:51 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2018-08-28 20:27:14 -0700, Andres Freund wrote: > >> Locally that triggers the problem within usually a few seconds. > > > FWIW, it does so including versions as old as 9.2. 9.0 as well, so it's definitely not some

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Tom Lane
Andres Freund writes: > On 2018-08-28 20:27:14 -0700, Andres Freund wrote: >> Locally that triggers the problem within usually a few seconds. > FWIW, it does so including versions as old as 9.2. Interesting. One thing I'd like to know is why this only started showing up in the buildfarm a few

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund
On 2018-08-28 20:27:14 -0700, Andres Freund wrote: > Locally that triggers the problem within usually a few seconds. FWIW, it does so including versions as old as 9.2. Now I need to look for power for my laptop and some for me ;)

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund
On 2018-08-28 23:18:25 -0400, Tom Lane wrote: > Andres Freund writes: > > Tom, I think this could use your eyes. > > I've had no luck reproducing it locally ... do you have a recipe > for that? It can reproduce reliably with the three scripts attached: psql -c' drop table if exists t; create

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Tom Lane
Andres Freund writes: > Tom, I think this could use your eyes. I've had no luck reproducing it locally ... do you have a recipe for that? regards, tom lane

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund
On 2018-08-28 19:56:58 -0700, Andres Freund wrote: > Hi Everyone, > > > Tom, I think this could use your eyes. > > > On 2018-08-28 00:52:13 -0700, Andres Freund wrote: > > I've a few leads that I'm currently testing out. One observation I think > > might be crucial is that the problem, in

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund
Hi Everyone, Tom, I think this could use your eyes. On 2018-08-28 00:52:13 -0700, Andres Freund wrote: > I've a few leads that I'm currently testing out. One observation I think > might be crucial is that the problem, in Tomas' testcase with just > VACUUM FULL of pg_class and INSERTs into

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund
Hi, Tomas provided me with a machine where the problem was reproducible (Thanks again!). I first started to make sure a54e1f158 is unrelated - and indeed, the problem appears independently. I've a few leads that I'm currently testing out. One observation I think might be crucial is that the

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-14 Thread Peter Geoghegan
On Tue, Aug 14, 2018 at 2:07 PM, Todd A. Cook wrote: > Sorry, I just noticed this. Mantid is my animal, so I can set > "min_parallel_table_scan_size = 0" > on it if that would be helpful. (Please reply directly if so; I'm not able > to keep up with pgsql-hackers > right now.) We've already

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-14 Thread Todd A. Cook
On 8/9/18, 12:56 AM, "Peter Geoghegan" wrote: On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote: >> Anyway, "VACUUM FULL pg_class" should be expected to corrupt >> pg_class_oid_index when we happen to get a parallel build, since >> pg_class is a mapped relation, and I've identified

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-11 Thread Tomas Vondra
On 08/11/2018 04:08 PM, Andres Freund wrote: > Hi, > > On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote: >> For the record, I can actually reproduce this on 9.6 (haven't tried >> older releases, but I suspect it's there too). Instead of using the >> failing subscription, I've used another pgbench

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-11 Thread Andres Freund
Hi, On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote: > For the record, I can actually reproduce this on 9.6 (haven't tried > older releases, but I suspect it's there too). Instead of using the > failing subscription, I've used another pgbench script doing this: > SET statement_timeout = 5; >

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-11 Thread Tomas Vondra
On 08/11/2018 03:16 PM, Tomas Vondra wrote: > On 08/11/2018 05:02 AM, Tom Lane wrote: >> Peter Geoghegan writes: >>> I'm concerned that this item has the potential to delay the release, >>> since, as you said, we're back to the drawing board. >> >> Me too. I will absolutely not vote to release

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-11 Thread Tomas Vondra
On 08/11/2018 05:02 AM, Tom Lane wrote: > Peter Geoghegan writes: >> I'm concerned that this item has the potential to delay the release, >> since, as you said, we're back to the drawing board. > > Me too. I will absolutely not vote to release 11.0 before we've > solved this ... > Not sure. I

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-10 Thread Peter Geoghegan
On Fri, Aug 10, 2018 at 8:02 PM, Tom Lane wrote: > Me too. I will absolutely not vote to release 11.0 before we've > solved this ... I believe that that's the right call, assuming things don't change. This is spooky in a way that creates a lot of doubts in my mind. I don't think it's at all

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-10 Thread Tom Lane
Peter Geoghegan writes: > I'm concerned that this item has the potential to delay the release, > since, as you said, we're back to the drawing board. Me too. I will absolutely not vote to release 11.0 before we've solved this ... regards, tom lane

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-10 Thread Peter Geoghegan
On Fri, Aug 10, 2018 at 7:45 PM, Tom Lane wrote: > Didn't take long to show that the relmapper issue wasn't it: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu=2018-08-10%2021%3A21%3A40 > > So we're back to square one. Although Tomas' recent report might > give us something new

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-10 Thread Tom Lane
Peter Geoghegan writes: > On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote: >> Oooh ... but pg_class wouldn't be big enough to get a parallel >> index rebuild during that test, would it? > Typically not, but I don't think that we can rule it out right away. Didn't take long to show that the

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-09 Thread Peter Geoghegan
On Wed, Aug 8, 2018 at 10:08 PM, Peter Geoghegan wrote: >> Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether >> a parallel build happened or not, so that we can check the buildfarm logs >> next time we see that failure? > > I can do that tomorrow. Of course, it might be

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-09 Thread Andrew Dunstan
On 08/09/2018 01:03 AM, Tom Lane wrote: Peter Geoghegan writes: On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote: Oooh ... but pg_class wouldn't be big enough to get a parallel index rebuild during that test, would it? Typically not, but I don't think that we can rule it out right away.

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Peter Geoghegan
On Wed, Aug 8, 2018 at 10:03 PM, Tom Lane wrote: >> Typically not, but I don't think that we can rule it out right away. > > Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether > a parallel build happened or not, so that we can check the buildfarm logs > next time we see

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Tom Lane
Peter Geoghegan writes: > On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote: >> Oooh ... but pg_class wouldn't be big enough to get a parallel >> index rebuild during that test, would it? > Typically not, but I don't think that we can rule it out right away. Hmmm ... maybe we should temporarily

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Peter Geoghegan
On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote: >> Anyway, "VACUUM FULL pg_class" should be expected to corrupt >> pg_class_oid_index when we happen to get a parallel build, since >> pg_class is a mapped relation, and I've identified that as a problem >> for parallel CREATE INDEX [2]. If that was

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Tom Lane
Peter Geoghegan writes: > On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund wrote: >> I don't immediately see it being responsible, but I wonder if there's a >> chance it actually is: Note that it happens in a parallel group that >> includes vacuum.sql, which does a VACUUM FULL pg_class - but I

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Peter Geoghegan
On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund wrote: >> HEAD/REL_11_STABLE apparently solely being affected points elsewhere, >> but I don't immediatley know where. > > Hm, there was: > http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de > > > I don't

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-07-25 Thread Andres Freund
Hi, On 2018-07-20 13:24:50 -0700, Andres Freund wrote: > On 2018-07-20 16:15:14 -0400, Tom Lane wrote: > > We've seen several occurrences of $subject in the buildfarm in the past > > month or so. Scraping the logs, I find > > > > coypu| 2018-06-14 21:17:49 | HEAD | Check |

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-07-20 Thread Andres Freund
On 2018-07-20 16:15:14 -0400, Tom Lane wrote: > We've seen several occurrences of $subject in the buildfarm in the past > month or so. Scraping the logs, I find > > coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 > 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not

buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-07-20 Thread Tom Lane
We've seen several occurrences of $subject in the buildfarm in the past month or so. Scraping the logs, I find coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of