Re: [HACKERS] Some other odd buildfarm failures
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Still, I don't think this is a reasonable test design. We have absolutely no idea what behaviors are being triggered in the other tests, except that they are unrelated to what those tests think they are testing. I can of course move it to a separate parallel test, but I don't think that should be really necessary. I've not proven this rigorously, but it seems obvious in hindsight: what's happening is that when the object_address test drops everything with DROP CASCADE, other processes are sometimes just starting to execute the event trigger when the DROP commits. When they go to look up the trigger function, they don't find it, leading to cache lookup failed for function. The fact that the complained-of OID is slightly variable, but always in the range of OIDs that would be getting assigned around this point in a make check run, buttresses the theory. I thought about changing the object_address test so that it explicitly drops the event trigger first. But that would not be a fix, it would just make the timing harder to hit (ie, a victim process would need to lose control for longer at the critical point). Since I remain of the opinion that a test called object_address has no damn business causing global side-effects, I think there are two reasonable fixes: 1. Remove the event trigger. This would slightly reduce the test's coverage. 2. Run that whole test as a single transaction, so that the event trigger is created and dropped in one transaction and is never seen as valid by any concurrent test. A long-term idea is to try to fix things so that there's sufficient locking to make dropping an event trigger and immediately dropping its trigger function safe. But I'm not sure that's either possible or a good idea (the lock obtained by DROP would bring the entire database to a standstill ...). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Still, I don't think this is a reasonable test design. We have absolutely no idea what behaviors are being triggered in the other tests, except that they are unrelated to what those tests think they are testing. I can of course move it to a separate parallel test, but I don't think that should be really necessary. I've not proven this rigorously, but it seems obvious in hindsight: what's happening is that when the object_address test drops everything with DROP CASCADE, other processes are sometimes just starting to execute the event trigger when the DROP commits. When they go to look up the trigger function, they don't find it, leading to cache lookup failed for function. Hm, maybe we can drop the event trigger explicitely first, then wait a little bit, then drop the remaining objects with DROP CASCADE? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: I've not proven this rigorously, but it seems obvious in hindsight: what's happening is that when the object_address test drops everything with DROP CASCADE, other processes are sometimes just starting to execute the event trigger when the DROP commits. When they go to look up the trigger function, they don't find it, leading to cache lookup failed for function. Hm, maybe we can drop the event trigger explicitely first, then wait a little bit, then drop the remaining objects with DROP CASCADE? As I said, that's no fix; it just makes the timing harder to hit. Another process could be paused at the critical point for longer than whatever a little bit is. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: I've not proven this rigorously, but it seems obvious in hindsight: what's happening is that when the object_address test drops everything with DROP CASCADE, other processes are sometimes just starting to execute the event trigger when the DROP commits. When they go to look up the trigger function, they don't find it, leading to cache lookup failed for function. Hm, maybe we can drop the event trigger explicitely first, then wait a little bit, then drop the remaining objects with DROP CASCADE? As I said, that's no fix; it just makes the timing harder to hit. Another process could be paused at the critical point for longer than whatever a little bit is. Yeah, I was thinking we could play some games with the currently running XIDs from a txid_snapshot or some such, with a reasonable upper limit on the waiting time (for the rare cases with a server doing other stuff with long-running transactions.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Hm, maybe we can drop the event trigger explicitely first, then wait a little bit, then drop the remaining objects with DROP CASCADE? As I said, that's no fix; it just makes the timing harder to hit. Another process could be paused at the critical point for longer than whatever a little bit is. Yeah, I was thinking we could play some games with the currently running XIDs from a txid_snapshot or some such, with a reasonable upper limit on the waiting time (for the rare cases with a server doing other stuff with long-running transactions.) Whether that's sane or not, the whole problem is so far out-of-scope for a test of pg_get_object_address() that it's not even funny. I think we should adopt one of the two fixes I recommended and call it good. If you want to work on making DROP EVENT TRIGGER safer in the long run, that can be a separate activity. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Hm, maybe we can drop the event trigger explicitely first, then wait a little bit, then drop the remaining objects with DROP CASCADE? As I said, that's no fix; it just makes the timing harder to hit. Another process could be paused at the critical point for longer than whatever a little bit is. Yeah, I was thinking we could play some games with the currently running XIDs from a txid_snapshot or some such, with a reasonable upper limit on the waiting time (for the rare cases with a server doing other stuff with long-running transactions.) Whether that's sane or not, the whole problem is so far out-of-scope for a test of pg_get_object_address() that it's not even funny. I think we should adopt one of the two fixes I recommended and call it good. I think dropping the part involving an event trigger from the test is reasonable. I will go do that. If you want to work on making DROP EVENT TRIGGER safer in the long run, that can be a separate activity. This sounds like a huge project -- it's not like event triggers are the only objects in the system where this is an issue, is it? I'm sure there is value in fixing it, but I have enough other projects. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
On December 26, 2014 6:10:51 PM CET, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Hm, maybe we can drop the event trigger explicitely first, then wait a little bit, then drop the remaining objects with DROP CASCADE? As I said, that's no fix; it just makes the timing harder to hit. Another process could be paused at the critical point for longer than whatever a little bit is. Yeah, I was thinking we could play some games with the currently running XIDs from a txid_snapshot or some such, with a reasonable upper limit on the waiting time (for the rare cases with a server doing other stuff with long-running transactions.) Whether that's sane or not, the whole problem is so far out-of-scope for a test of pg_get_object_address() that it's not even funny. I think we should adopt one of the two fixes I recommended and call it good. I think dropping the part involving an event trigger from the test is reasonable. I will go do that. If you want to work on making DROP EVENT TRIGGER safer in the long run, that can be a separate activity. This sounds like a huge project -- it's not like event triggers are the only objects in the system where this is an issue, is it? I'm sure there is value in fixing it, but I have enough other projects. Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences timewise. -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Andres Freund and...@2ndquadrant.com writes: Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences timewise. That just leaves the door open for somebody to add more tests parallel to it in future. TBH, I think we could have done without this test altogether; but if we're going to have it, a minimum expectation is that it not be hazardous to other tests around it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences timewise. That just leaves the door open for somebody to add more tests parallel to it in future. I've been long wanted to add declarative dependencies to tests: each test file would declare what other tests it depends on, and we would have a special clause to state this one must not be run in concurrence with anything else. Of course, this is just wishful thinking at this point. TBH, I think we could have done without this test altogether; but if we're going to have it, a minimum expectation is that it not be hazardous to other tests around it. The number of assertion failures in get_object_address without all the sanity checks I added in pg_get_object_address was a bit surprising. That's the whole reason I decided to add the test. I don't want to blindly assume that all cases will work nicely in the future, particularly as other object types are added. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Andres Freund wrote: This sounds like a huge project -- it's not like event triggers are the only objects in the system where this is an issue, is it? I'm sure there is value in fixing it, but I have enough other projects. Can't we just move the test to run without parallelism? Its quite quick, so I don't it'd have noticeable consequences timewise. (I got this a minute too late anyhow.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: TBH, I think we could have done without this test altogether; but if we're going to have it, a minimum expectation is that it not be hazardous to other tests around it. The number of assertion failures in get_object_address without all the sanity checks I added in pg_get_object_address was a bit surprising. That's the whole reason I decided to add the test. I don't want to blindly assume that all cases will work nicely in the future, particularly as other object types are added. I'm surprised then that you didn't prefer the other solution (wrap the whole test in a single transaction). But we've probably spent more time on this than it deserves. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Some other odd buildfarm failures
These two recent failures look suspiciously similar: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundidt=2014-12-24%2021%3A03%3A05 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koupreydt=2014-12-25%2018%3A43%3A17 to wit: *** /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/expected/brin.out Wed Dec 24 22:03:07 2014 --- /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/results/brin.out Wed Dec 24 22:54:26 2014 *** *** 146,151 --- 146,154 end loop; end; $x$; + ERROR: cache lookup failed for function 30281 + CONTEXT: SQL statement create temp table qry_2_ss (tid tid) ON COMMIT DROP + PL/pgSQL function inline_code_block line 26 at EXECUTE statement INSERT INTO brintest SELECT repeat(stringu1, 42)::bytea, substr(stringu1, 1, 1)::char, *** /home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/expected/matview.out Thu Dec 25 18:43:31 2014 --- /home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/results/matview.out Thu Dec 25 18:45:25 2014 *** *** 90,97 --- 90,102 CREATE MATERIALIZED VIEW tvvm AS SELECT * FROM tvv; CREATE VIEW tvvmv AS SELECT * FROM tvvm; + ERROR: cache lookup failed for function 30311 CREATE MATERIALIZED VIEW bb AS SELECT * FROM tvvmv; When I saw jagarundi's failure yesterday, I figured it was something to do with CLOBBER_CACHE_ALWAYS ... but kouprey isn't using that option AFAICS, so that idea fails to hold water. I don't believe that the referenced function OIDs would ever actually exist in the database in the current regression tests; and the two failing statements have no reason to be accessing any user-defined functions anyway. So those OIDs are probably bogus. It seems likely that something is clobbering storage that is later expected to hold an OID. Whatever's going on, it's likely that this is a recently-introduced bug, because I don't recall seeing reports like these before. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
I wrote: These two recent failures look suspiciously similar: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundidt=2014-12-24%2021%3A03%3A05 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koupreydt=2014-12-25%2018%3A43%3A17 And I'd barely finished posting that before this one arrived: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shearwaterdt=2014-12-25%2020%3A24%3A32 *** /home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/expected/privileges.out Fri Dec 26 00:24:38 2014 --- /home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/results/privileges.out Fri Dec 26 00:25:54 2014 *** *** 197,202 --- 197,203 CREATE VIEW atestv1 AS SELECT * FROM atest1; -- ok /* The next *should* fail, but it's not implemented that way yet. */ CREATE VIEW atestv2 AS SELECT * FROM atest2; + ERROR: cache lookup failed for function 30274 CREATE VIEW atestv3 AS SELECT * FROM atest3; -- ok /* Empty view is a corner case that failed in 9.2. */ CREATE VIEW atestv0 AS SELECT 0 as x WHERE false; -- ok I find it suspicious that all three examples are in the same group of parallel tests. A possible theory is that one of these tests: test: brin gin gist spgist privileges security_label collate matview lock replica_identity rowsecurity object_address is doing something that has bad side-effects on concurrent sessions. In any case, it now seems dead certain that this is a recently introduced bug. Andres is fortunate that the first instance occurred before his recent batch of commits, or I'd be on him to revert them. As is, though, I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly be related. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Hi, On 2014-12-25 16:01:47 -0500, Tom Lane wrote: In any case, it now seems dead certain that this is a recently introduced bug. Andres is fortunate that the first instance occurred before his recent batch of commits, or I'd be on him to revert them. Yes, Phew. These look rather odd. As is, though, I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly be related. I really can't imagine how. If a additional barrier in that place can cause such problems we'd surely have more reports by an accidental wait in the right place than these. My guess is that it's related to d7ee82e50f. It seems realistic that the event trigger added by it to the object_address test can cause errors at varying times. I wonder if it'd not be a good idea if the event trigger code installed a error context callback? Since they can be called in situations we don't routinely expect that'd make diagnosis in many cases easier. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Andres Freund and...@2ndquadrant.com writes: I wonder if it'd not be a good idea if the event trigger code installed a error context callback? Since they can be called in situations we don't routinely expect that'd make diagnosis in many cases easier. +1 ... even if that's not related to the immediate issue, it seems like a good idea. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Andres Freund and...@2ndquadrant.com writes: My guess is that it's related to d7ee82e50f. It seems realistic that the event trigger added by it to the object_address test can cause errors at varying times. [ squint... ] Event triggers are global across the whole database, aren't they? Isn't it frickin insane to run a test like this in parallel with others? Not but what it seems to be exposing some bugs. Still, I don't think this is a reasonable test design. We have absolutely no idea what behaviors are being triggered in the other tests, except that they are unrelated to what those tests think they are testing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some other odd buildfarm failures
Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: My guess is that it's related to d7ee82e50f. It seems realistic that the event trigger added by it to the object_address test can cause errors at varying times. [ squint... ] Event triggers are global across the whole database, aren't they? Isn't it frickin insane to run a test like this in parallel with others? Well, the event trigger function is BEGIN END; so I don't think it should affect anything. Not but what it seems to be exposing some bugs. That seems to me a good thing ... a bit inconvenient of course, but it beats having users get strange behavior which they can't possibly debug. Still, I don't think this is a reasonable test design. We have absolutely no idea what behaviors are being triggered in the other tests, except that they are unrelated to what those tests think they are testing. I can of course move it to a separate parallel test, but I don't think that should be really necessary. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers