Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 Still, I don't think this is a reasonable test design.  We have
 absolutely no idea what behaviors are being triggered in the other
 tests, except that they are unrelated to what those tests think they
 are testing.

 I can of course move it to a separate parallel test, but I don't think
 that should be really necessary.

I've not proven this rigorously, but it seems obvious in hindsight:
what's happening is that when the object_address test drops everything
with DROP CASCADE, other processes are sometimes just starting to execute
the event trigger when the DROP commits.  When they go to look up the
trigger function, they don't find it, leading to cache lookup failed for
function.  The fact that the complained-of OID is slightly variable, but
always in the range of OIDs that would be getting assigned around this
point in a make check run, buttresses the theory.

I thought about changing the object_address test so that it explicitly
drops the event trigger first.  But that would not be a fix, it would
just make the timing harder to hit (ie, a victim process would need to
lose control for longer at the critical point).

Since I remain of the opinion that a test called object_address has no
damn business causing global side-effects, I think there are two
reasonable fixes:

1. Remove the event trigger.  This would slightly reduce the test's
coverage.

2. Run that whole test as a single transaction, so that the event trigger
is created and dropped in one transaction and is never seen as valid by
any concurrent test.

A long-term idea is to try to fix things so that there's sufficient
locking to make dropping an event trigger and immediately dropping its
trigger function safe.  But I'm not sure that's either possible or a good
idea (the lock obtained by DROP would bring the entire database to a
standstill ...).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  Still, I don't think this is a reasonable test design.  We have
  absolutely no idea what behaviors are being triggered in the other
  tests, except that they are unrelated to what those tests think they
  are testing.
 
  I can of course move it to a separate parallel test, but I don't think
  that should be really necessary.
 
 I've not proven this rigorously, but it seems obvious in hindsight:
 what's happening is that when the object_address test drops everything
 with DROP CASCADE, other processes are sometimes just starting to execute
 the event trigger when the DROP commits.  When they go to look up the
 trigger function, they don't find it, leading to cache lookup failed for
 function.

Hm, maybe we can drop the event trigger explicitely first, then wait a
little bit, then drop the remaining objects with DROP CASCADE?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 I've not proven this rigorously, but it seems obvious in hindsight:
 what's happening is that when the object_address test drops everything
 with DROP CASCADE, other processes are sometimes just starting to execute
 the event trigger when the DROP commits.  When they go to look up the
 trigger function, they don't find it, leading to cache lookup failed for
 function.

 Hm, maybe we can drop the event trigger explicitely first, then wait a
 little bit, then drop the remaining objects with DROP CASCADE?

As I said, that's no fix; it just makes the timing harder to hit.  Another
process could be paused at the critical point for longer than whatever a
little bit is.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  I've not proven this rigorously, but it seems obvious in hindsight:
  what's happening is that when the object_address test drops everything
  with DROP CASCADE, other processes are sometimes just starting to execute
  the event trigger when the DROP commits.  When they go to look up the
  trigger function, they don't find it, leading to cache lookup failed for
  function.
 
  Hm, maybe we can drop the event trigger explicitely first, then wait a
  little bit, then drop the remaining objects with DROP CASCADE?
 
 As I said, that's no fix; it just makes the timing harder to hit.  Another
 process could be paused at the critical point for longer than whatever a
 little bit is.

Yeah, I was thinking we could play some games with the currently running
XIDs from a txid_snapshot or some such, with a reasonable upper limit on
the waiting time (for the rare cases with a server doing other stuff
with long-running transactions.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Hm, maybe we can drop the event trigger explicitely first, then wait a
 little bit, then drop the remaining objects with DROP CASCADE?

 As I said, that's no fix; it just makes the timing harder to hit.  Another
 process could be paused at the critical point for longer than whatever a
 little bit is.

 Yeah, I was thinking we could play some games with the currently running
 XIDs from a txid_snapshot or some such, with a reasonable upper limit on
 the waiting time (for the rare cases with a server doing other stuff
 with long-running transactions.)

Whether that's sane or not, the whole problem is so far out-of-scope for
a test of pg_get_object_address() that it's not even funny.  I think
we should adopt one of the two fixes I recommended and call it good.
If you want to work on making DROP EVENT TRIGGER safer in the long run,
that can be a separate activity.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Hm, maybe we can drop the event trigger explicitely first, then wait a
  little bit, then drop the remaining objects with DROP CASCADE?
 
  As I said, that's no fix; it just makes the timing harder to hit.  Another
  process could be paused at the critical point for longer than whatever a
  little bit is.
 
  Yeah, I was thinking we could play some games with the currently running
  XIDs from a txid_snapshot or some such, with a reasonable upper limit on
  the waiting time (for the rare cases with a server doing other stuff
  with long-running transactions.)
 
 Whether that's sane or not, the whole problem is so far out-of-scope for
 a test of pg_get_object_address() that it's not even funny.  I think
 we should adopt one of the two fixes I recommended and call it good.

I think dropping the part involving an event trigger from the test is
reasonable.  I will go do that.

 If you want to work on making DROP EVENT TRIGGER safer in the long run,
 that can be a separate activity.

This sounds like a huge project -- it's not like event triggers are the
only objects in the system where this is an issue, is it?  I'm sure
there is value in fixing it, but I have enough other projects.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Andres Freund
On December 26, 2014 6:10:51 PM CET, Alvaro Herrera alvhe...@2ndquadrant.com 
wrote:
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Hm, maybe we can drop the event trigger explicitely first, then
wait a
  little bit, then drop the remaining objects with DROP CASCADE?
 
  As I said, that's no fix; it just makes the timing harder to hit. 
Another
  process could be paused at the critical point for longer than
whatever a
  little bit is.
 
  Yeah, I was thinking we could play some games with the currently
running
  XIDs from a txid_snapshot or some such, with a reasonable upper
limit on
  the waiting time (for the rare cases with a server doing other
stuff
  with long-running transactions.)
 
 Whether that's sane or not, the whole problem is so far out-of-scope
for
 a test of pg_get_object_address() that it's not even funny.  I think
 we should adopt one of the two fixes I recommended and call it good.

I think dropping the part involving an event trigger from the test is
reasonable.  I will go do that.

 If you want to work on making DROP EVENT TRIGGER safer in the long
run,
 that can be a separate activity.

This sounds like a huge project -- it's not like event triggers are the
only objects in the system where this is an issue, is it?  I'm sure
there is value in fixing it, but I have enough other projects.

Can't we just move the test to run without parallelism? Its quite quick, so I 
don't it'd have noticeable consequences timewise.


-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 Can't we just move the test to run without parallelism? Its quite quick, so I 
 don't it'd have noticeable consequences timewise.

That just leaves the door open for somebody to add more tests parallel to
it in future.

TBH, I think we could have done without this test altogether; but if we're
going to have it, a minimum expectation is that it not be hazardous to
other tests around it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Alvaro Herrera
Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  Can't we just move the test to run without parallelism? Its quite quick, so 
  I don't it'd have noticeable consequences timewise.
 
 That just leaves the door open for somebody to add more tests parallel to
 it in future.

I've been long wanted to add declarative dependencies to tests: each
test file would declare what other tests it depends on, and we would
have a special clause to state this one must not be run in concurrence
with anything else.  Of course, this is just wishful thinking at this
point.

 TBH, I think we could have done without this test altogether; but if we're
 going to have it, a minimum expectation is that it not be hazardous to
 other tests around it.

The number of assertion failures in get_object_address without all the
sanity checks I added in pg_get_object_address was a bit surprising.
That's the whole reason I decided to add the test.  I don't want to
blindly assume that all cases will work nicely in the future,
particularly as other object types are added.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Alvaro Herrera
Andres Freund wrote:

 This sounds like a huge project -- it's not like event triggers are the
 only objects in the system where this is an issue, is it?  I'm sure
 there is value in fixing it, but I have enough other projects.
 
 Can't we just move the test to run without parallelism? Its quite
 quick, so I don't it'd have noticeable consequences timewise.

(I got this a minute too late anyhow.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-26 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 TBH, I think we could have done without this test altogether; but if we're
 going to have it, a minimum expectation is that it not be hazardous to
 other tests around it.

 The number of assertion failures in get_object_address without all the
 sanity checks I added in pg_get_object_address was a bit surprising.
 That's the whole reason I decided to add the test.  I don't want to
 blindly assume that all cases will work nicely in the future,
 particularly as other object types are added.

I'm surprised then that you didn't prefer the other solution (wrap the
whole test in a single transaction).  But we've probably spent more
time on this than it deserves.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Some other odd buildfarm failures

2014-12-25 Thread Tom Lane
These two recent failures look suspiciously similar:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundidt=2014-12-24%2021%3A03%3A05
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koupreydt=2014-12-25%2018%3A43%3A17

to wit:

*** /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/expected/brin.out 
Wed Dec 24 22:03:07 2014
--- /usr/home/pgbf/buildenv/HEAD/pgsql.build/src/test/regress/results/brin.out  
Wed Dec 24 22:54:26 2014
***
*** 146,151 
--- 146,154 
  end loop;
  end;
  $x$;
+ ERROR:  cache lookup failed for function 30281
+ CONTEXT:  SQL statement create temp table qry_2_ss (tid tid) ON COMMIT DROP
+ PL/pgSQL function inline_code_block line 26 at EXECUTE statement
  INSERT INTO brintest SELECT
repeat(stringu1, 42)::bytea,
substr(stringu1, 1, 1)::char,


*** 
/home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/expected/matview.out  
Thu Dec 25 18:43:31 2014
--- 
/home/markwkm/buildroot/HEAD/pgsql.24814/src/test/regress/results/matview.out   
Thu Dec 25 18:45:25 2014
***
*** 90,97 
--- 90,102 
  
  CREATE MATERIALIZED VIEW tvvm AS SELECT * FROM tvv;
  CREATE VIEW tvvmv AS SELECT * FROM tvvm;
+ ERROR:  cache lookup failed for function 30311
  CREATE MATERIALIZED VIEW bb AS SELECT * FROM tvvmv;


When I saw jagarundi's failure yesterday, I figured it was something to do
with CLOBBER_CACHE_ALWAYS ... but kouprey isn't using that option AFAICS,
so that idea fails to hold water.

I don't believe that the referenced function OIDs would ever actually
exist in the database in the current regression tests; and the two failing
statements have no reason to be accessing any user-defined functions
anyway.  So those OIDs are probably bogus.  It seems likely that something
is clobbering storage that is later expected to hold an OID.  Whatever's
going on, it's likely that this is a recently-introduced bug, because
I don't recall seeing reports like these before.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-25 Thread Tom Lane
I wrote:
 These two recent failures look suspiciously similar:
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundidt=2014-12-24%2021%3A03%3A05
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koupreydt=2014-12-25%2018%3A43%3A17

And I'd barely finished posting that before this one arrived:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shearwaterdt=2014-12-25%2020%3A24%3A32

*** 
/home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/expected/privileges.out
Fri Dec 26 00:24:38 2014
--- 
/home/buildfarm/build_root/HEAD/pgsql.32096/src/test/regress/results/privileges.out
 Fri Dec 26 00:25:54 2014
***
*** 197,202 
--- 197,203 
  CREATE VIEW atestv1 AS SELECT * FROM atest1; -- ok
  /* The next *should* fail, but it's not implemented that way yet. */
  CREATE VIEW atestv2 AS SELECT * FROM atest2;
+ ERROR:  cache lookup failed for function 30274
  CREATE VIEW atestv3 AS SELECT * FROM atest3; -- ok
  /* Empty view is a corner case that failed in 9.2. */
  CREATE VIEW atestv0 AS SELECT 0 as x WHERE false; -- ok

I find it suspicious that all three examples are in the same group of
parallel tests.  A possible theory is that one of these tests:

test: brin gin gist spgist privileges security_label collate matview lock 
replica_identity rowsecurity object_address

is doing something that has bad side-effects on concurrent sessions.

In any case, it now seems dead certain that this is a recently introduced
bug.  Andres is fortunate that the first instance occurred before his
recent batch of commits, or I'd be on him to revert them.  As is, though,
I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly
be related.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-25 Thread Andres Freund
Hi,

On 2014-12-25 16:01:47 -0500, Tom Lane wrote:
 In any case, it now seems dead certain that this is a recently introduced
 bug.  Andres is fortunate that the first instance occurred before his
 recent batch of commits, or I'd be on him to revert them.

Yes, Phew. These look rather odd.

 As is, though,
 I'm wondering if 37de8de9e33606a043e98fee64b5595aedaa8254 could possibly
 be related.

I really can't imagine how. If a additional barrier in that place can
cause such problems we'd surely have more reports by an accidental wait
in the right place than these.

My guess is that it's related to d7ee82e50f. It seems realistic that the
event trigger added by it to the object_address test can cause errors at
varying times.

I wonder if it'd not be a good idea if the event trigger code installed
a error context callback? Since they can be called in situations we
don't routinely expect that'd make diagnosis in many cases easier.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-25 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 I wonder if it'd not be a good idea if the event trigger code installed
 a error context callback? Since they can be called in situations we
 don't routinely expect that'd make diagnosis in many cases easier.

+1 ... even if that's not related to the immediate issue, it seems like
a good idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-25 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 My guess is that it's related to d7ee82e50f. It seems realistic that the
 event trigger added by it to the object_address test can cause errors at
 varying times.

[ squint... ]  Event triggers are global across the whole database, aren't
they?  Isn't it frickin insane to run a test like this in parallel with
others?

Not but what it seems to be exposing some bugs.  Still, I don't think this
is a reasonable test design.  We have absolutely no idea what behaviors
are being triggered in the other tests, except that they are unrelated to
what those tests think they are testing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some other odd buildfarm failures

2014-12-25 Thread Alvaro Herrera
Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  My guess is that it's related to d7ee82e50f. It seems realistic that the
  event trigger added by it to the object_address test can cause errors at
  varying times.
 
 [ squint... ]  Event triggers are global across the whole database, aren't
 they?  Isn't it frickin insane to run a test like this in parallel with
 others?

Well, the event trigger function is BEGIN END; so I don't think it
should affect anything.

 Not but what it seems to be exposing some bugs.

That seems to me a good thing ... a bit inconvenient of course, but it
beats having users get strange behavior which they can't possibly debug.

 Still, I don't think this is a reasonable test design.  We have
 absolutely no idea what behaviors are being triggered in the other
 tests, except that they are unrelated to what those tests think they
 are testing.

I can of course move it to a separate parallel test, but I don't think
that should be really necessary.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers