I went ahead and pushed this, since the window for getting buildfarm
testing done before Monday's wrap is closing fast. We can always
improve on it later, but I think beta3 ought to carry some fix
for the problem.
regards, tom lane
Andres Freund writes:
> Could you attach the current version of the patch, or were there no
> meaningful changes?
No changes.
>> So I took that as license to proceed, but while doing a final round of
>> testing I found out that a CLOBBER_CACHE_RECURSIVELY build fails,
>> because now that's an
On 2018-09-06 17:38:38 -0400, Tom Lane wrote:
> I wrote:
> > So where are we on this? Should I proceed with my patch, or are we
> > going to do further investigation? Does anyone want to do an actual
> > patch review?
>
> [ crickets... ]
Sorry, bit busy with postgres open, and a few people
I wrote:
> So where are we on this? Should I proceed with my patch, or are we
> going to do further investigation? Does anyone want to do an actual
> patch review?
[ crickets... ]
So I took that as license to proceed, but while doing a final round of
testing I found out that a
I wrote:
> Andres Freund writes:
>> One concern I have with your approach is that it isn't particularly
>> bullet-proof for cases where the rebuild is triggered by something that
>> doesn't hold a conflicting lock.
> Wouldn't that be a bug in the something-else?
So where are we on this? Should
Andres Freund writes:
> One concern I have with your approach is that it isn't particularly
> bullet-proof for cases where the rebuild is triggered by something that
> doesn't hold a conflicting lock.
Wouldn't that be a bug in the something-else? The entire relation cache
system is based on the
Hi,
On 2018-08-31 19:53:43 -0400, Tom Lane wrote:
> My thought is to do (and back-patch) my change, and then work on yours
> as a performance improvement for HEAD only.
That does make sense.
> I don't believe that yours would make mine redundant, either --- they
> are good complementary changes
Andres Freund writes:
> Leaving that aside, I think there's one architectural aspect of my
> approach that I prefer over yours: Deduplicating eager cache rebuilds
> like my approach does seems quite advantageous.
That is attractive, for sure, but the other side of the coin is that
getting there
On 2018-08-29 17:58:19 -0400, Tom Lane wrote:
> I wrote:
> > We could perhaps fix this with a less invasive change than what you
> > suggest here, by attacking the missed-call-due-to-recursion aspect
> > rather than monkeying with how relcache rebuild itself works.
>
> Seeing that rearranging the
I wrote:
> We could perhaps fix this with a less invasive change than what you
> suggest here, by attacking the missed-call-due-to-recursion aspect
> rather than monkeying with how relcache rebuild itself works.
Seeing that rearranging the relcache rebuild logic is looking less than
trivial, I
Andres Freund writes:
> On 2018-08-29 14:00:12 -0400, Tom Lane wrote:
>> 2. I think we may need to address the same order-of-operations hazards
>> as RelationCacheInvalidate() worries about. Alternatively, maybe we
>> could simplify that function by making it use the same
>> delayed-revalidation
Andres Freund writes:
> On 2018-08-29 12:56:07 -0400, Tom Lane wrote:
>> BTW, I now have a theory for why we suddenly started seeing this problem
>> in mid-June: commits a54e1f158 et al added a ScanPgRelation call where
>> there had been none before (in RelationReloadNailed, for non-index rels).
Hi,
On 2018-08-29 14:00:12 -0400, Tom Lane wrote:
> A couple thoughts after reading and reflecting for awhile:
Thanks. This definitely is too complicated for a single brain :(
> 1. I don't much like the pending_rebuilds list, mainly because of this
> consideration: what happens if we hit an
Hi,
On 2018-08-29 12:56:07 -0400, Tom Lane wrote:
> I wrote:
> > * We now recursively enter ScanPgRelation, which (again) needs to do a
> > search using pg_class_oid_index, so it (again) opens and locks that.
> > BUT: LockRelationOid sees that *this process already has share lock on
> >
Andres Freund writes:
> A bit of food, a coke and a talk later, here's a first draft *prototype*
> of how this could be solved. ...
> Obviously this is far from clean enough, but what do you think about the
> basic approach? It does, in my limited testing, indeed solve the "could
> not read
I wrote:
> * We now recursively enter ScanPgRelation, which (again) needs to do a
> search using pg_class_oid_index, so it (again) opens and locks that.
> BUT: LockRelationOid sees that *this process already has share lock on
> pg_class_oid_index*, so it figures it can skip
Andres Freund writes:
> It's not OK to rebuild relcache entries in the middle of
> ReceiveSharedInvalidMessages() - a later entry in the invalidation queue
> might be relmapper invalidation, and thus immediately processing a
> relcache invalidation might attempt to scan a relation that does not
>
On 2018-08-28 20:29:08 -0700, Andres Freund wrote:
> On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> > Locally that triggers the problem within usually a few seconds.
>
> FWIW, it does so including versions as old as 9.2.
>
> Now I need to look for power for my laptop and some for me ;)
A
On 2018-08-28 23:32:51 -0400, Tom Lane wrote:
> Andres Freund writes:
> > On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> >> Locally that triggers the problem within usually a few seconds.
>
> > FWIW, it does so including versions as old as 9.2.
9.0 as well, so it's definitely not some
Andres Freund writes:
> On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
>> Locally that triggers the problem within usually a few seconds.
> FWIW, it does so including versions as old as 9.2.
Interesting. One thing I'd like to know is why this only started
showing up in the buildfarm a few
On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> Locally that triggers the problem within usually a few seconds.
FWIW, it does so including versions as old as 9.2.
Now I need to look for power for my laptop and some for me ;)
On 2018-08-28 23:18:25 -0400, Tom Lane wrote:
> Andres Freund writes:
> > Tom, I think this could use your eyes.
>
> I've had no luck reproducing it locally ... do you have a recipe
> for that?
It can reproduce reliably with the three scripts attached:
psql -c' drop table if exists t; create
Andres Freund writes:
> Tom, I think this could use your eyes.
I've had no luck reproducing it locally ... do you have a recipe
for that?
regards, tom lane
On 2018-08-28 19:56:58 -0700, Andres Freund wrote:
> Hi Everyone,
>
>
> Tom, I think this could use your eyes.
>
>
> On 2018-08-28 00:52:13 -0700, Andres Freund wrote:
> > I've a few leads that I'm currently testing out. One observation I think
> > might be crucial is that the problem, in
Hi Everyone,
Tom, I think this could use your eyes.
On 2018-08-28 00:52:13 -0700, Andres Freund wrote:
> I've a few leads that I'm currently testing out. One observation I think
> might be crucial is that the problem, in Tomas' testcase with just
> VACUUM FULL of pg_class and INSERTs into
Hi,
Tomas provided me with a machine where the problem was reproducible
(Thanks again!). I first started to make sure a54e1f158 is unrelated -
and indeed, the problem appears independently.
I've a few leads that I'm currently testing out. One observation I think
might be crucial is that the
On Tue, Aug 14, 2018 at 2:07 PM, Todd A. Cook wrote:
> Sorry, I just noticed this. Mantid is my animal, so I can set
> "min_parallel_table_scan_size = 0"
> on it if that would be helpful. (Please reply directly if so; I'm not able
> to keep up with pgsql-hackers
> right now.)
We've already
On 8/9/18, 12:56 AM, "Peter Geoghegan" wrote:
On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote:
>> Anyway, "VACUUM FULL pg_class" should be expected to corrupt
>> pg_class_oid_index when we happen to get a parallel build, since
>> pg_class is a mapped relation, and I've identified
On 08/11/2018 04:08 PM, Andres Freund wrote:
> Hi,
>
> On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote:
>> For the record, I can actually reproduce this on 9.6 (haven't tried
>> older releases, but I suspect it's there too). Instead of using the
>> failing subscription, I've used another pgbench
Hi,
On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote:
> For the record, I can actually reproduce this on 9.6 (haven't tried
> older releases, but I suspect it's there too). Instead of using the
> failing subscription, I've used another pgbench script doing this:
> SET statement_timeout = 5;
>
On 08/11/2018 03:16 PM, Tomas Vondra wrote:
> On 08/11/2018 05:02 AM, Tom Lane wrote:
>> Peter Geoghegan writes:
>>> I'm concerned that this item has the potential to delay the release,
>>> since, as you said, we're back to the drawing board.
>>
>> Me too. I will absolutely not vote to release
On 08/11/2018 05:02 AM, Tom Lane wrote:
> Peter Geoghegan writes:
>> I'm concerned that this item has the potential to delay the release,
>> since, as you said, we're back to the drawing board.
>
> Me too. I will absolutely not vote to release 11.0 before we've
> solved this ...
>
Not sure. I
On Fri, Aug 10, 2018 at 8:02 PM, Tom Lane wrote:
> Me too. I will absolutely not vote to release 11.0 before we've
> solved this ...
I believe that that's the right call, assuming things don't change.
This is spooky in a way that creates a lot of doubts in my mind. I
don't think it's at all
Peter Geoghegan writes:
> I'm concerned that this item has the potential to delay the release,
> since, as you said, we're back to the drawing board.
Me too. I will absolutely not vote to release 11.0 before we've
solved this ...
regards, tom lane
On Fri, Aug 10, 2018 at 7:45 PM, Tom Lane wrote:
> Didn't take long to show that the relmapper issue wasn't it:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu=2018-08-10%2021%3A21%3A40
>
> So we're back to square one. Although Tomas' recent report might
> give us something new
Peter Geoghegan writes:
> On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote:
>> Oooh ... but pg_class wouldn't be big enough to get a parallel
>> index rebuild during that test, would it?
> Typically not, but I don't think that we can rule it out right away.
Didn't take long to show that the
On Wed, Aug 8, 2018 at 10:08 PM, Peter Geoghegan wrote:
>> Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
>> a parallel build happened or not, so that we can check the buildfarm logs
>> next time we see that failure?
>
> I can do that tomorrow. Of course, it might be
On 08/09/2018 01:03 AM, Tom Lane wrote:
Peter Geoghegan writes:
On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote:
Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?
Typically not, but I don't think that we can rule it out right away.
On Wed, Aug 8, 2018 at 10:03 PM, Tom Lane wrote:
>> Typically not, but I don't think that we can rule it out right away.
>
> Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
> a parallel build happened or not, so that we can check the buildfarm logs
> next time we see
Peter Geoghegan writes:
> On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote:
>> Oooh ... but pg_class wouldn't be big enough to get a parallel
>> index rebuild during that test, would it?
> Typically not, but I don't think that we can rule it out right away.
Hmmm ... maybe we should temporarily
On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane wrote:
>> Anyway, "VACUUM FULL pg_class" should be expected to corrupt
>> pg_class_oid_index when we happen to get a parallel build, since
>> pg_class is a mapped relation, and I've identified that as a problem
>> for parallel CREATE INDEX [2]. If that was
Peter Geoghegan writes:
> On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund wrote:
>> I don't immediately see it being responsible, but I wonder if there's a
>> chance it actually is: Note that it happens in a parallel group that
>> includes vacuum.sql, which does a VACUUM FULL pg_class - but I
On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund wrote:
>> HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
>> but I don't immediatley know where.
>
> Hm, there was:
> http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de
>
>
> I don't
Hi,
On 2018-07-20 13:24:50 -0700, Andres Freund wrote:
> On 2018-07-20 16:15:14 -0400, Tom Lane wrote:
> > We've seen several occurrences of $subject in the buildfarm in the past
> > month or so. Scraping the logs, I find
> >
> > coypu| 2018-06-14 21:17:49 | HEAD | Check |
On 2018-07-20 16:15:14 -0400, Tom Lane wrote:
> We've seen several occurrences of $subject in the buildfarm in the past
> month or so. Scraping the logs, I find
>
> coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14
> 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not
We've seen several occurrences of $subject in the buildfarm in the past
month or so. Scraping the logs, I find
coypu| 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14
23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not read block 3 in file
"base/16384/2662": read only 0 of
46 matches
Mail list logo