Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-25 Thread Tom Lane
I wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 Maybe what we could do is set higher thresholds for the regression
 database with ALTER DATABASE.

 That seems to make sense at least as a short-term response.

I tried this, and it crashed and burned:

ERROR:  parameter autovacuum_analyze_threshold cannot be changed now

because the autovac parameters are marked PGC_SIGHUP so that they
can only be adjusted from postgresql.conf.

Someday we might like to allow this, but it seems to mean inventing a
new GUC context type, which I don't think I want to get into right now.

For the moment I guess the answer is to change the horology test case.
As far as I've seen, we have seen no other failures than that, so my
worry about autovac impacting other test cases may be overblown.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-25 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Someday we might like to allow this, but it seems to mean inventing a
 new GUC context type, which I don't think I want to get into right now.

Another option would be adding an option to initdb to override default config
settings in the postgreql.conf. Then the pg_regress could just specify options
like this for the regression database if necessary. 

It might make sense to tweak other settings in there as well to improve test
coverage for things which probably don't happen in the short interval the
regression test suite allows.

Also, it might let us set up a standby database in the regression suite which
would give us some xlog coverage as well which is a critical portion of the
system the regression suite doesn't test at all.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-25 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Wed, 2007-07-25 at 22:20 +0100, Gregory Stark wrote:
 Also, it might let us set up a standby database in the regression
 suite which would give us some xlog coverage as well which is a
 critical portion of the system the regression suite doesn't test at
 all.

 I like that general thought and will help where I can.

One pretty crude test I use a lot is to kill -9 the bgwriter or some
other subprocess immediately after the regression tests finish.
Assuming you have checkpoints far enough apart, this forces a replay
from WAL of the entire regression sequence.  Doesn't exercise archiving
though.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-25 Thread Simon Riggs
On Wed, 2007-07-25 at 22:20 +0100, Gregory Stark wrote:
 Also, it might let us set up a standby database in the regression
 suite which would give us some xlog coverage as well which is a
 critical portion of the system the regression suite doesn't test at
 all.

I like that general thought and will help where I can.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-25 Thread Andrew Dunstan



Gregory Stark wrote:

Tom Lane [EMAIL PROTECTED] writes:

  

Someday we might like to allow this, but it seems to mean inventing a
new GUC context type, which I don't think I want to get into right now.



Another option would be adding an option to initdb to override default config
settings in the postgreql.conf. Then the pg_regress could just specify options
like this for the regression database if necessary. 



  


I thought about that. But it makes things harder for the installcheck 
case. I think the regression set needs to be able to run against a 
pretty vanilla install (e.g. the only thing buildfarm specifies for 
initdb is --no-locale ).


cheers

andrew



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Stefan Kaltenbrunner

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13


any ideas ?


Stefan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Stefan Kaltenbrunner

Stefan Kaltenbrunner wrote:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13 


clownfish just hit the same problem:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=clownfishdt=2007-07-24%2013:08:29


Stefan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Zdenek Kotala

Stefan Kaltenbrunner wrote:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13 




any ideas ?



This test is very sensitive to floating point operations behavior. Any 
gcc, libc update on this machine?


Zdenek

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Stefan Kaltenbrunner

Zdenek Kotala wrote:

Stefan Kaltenbrunner wrote:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13 




any ideas ?



This test is very sensitive to floating point operations behavior. Any 
gcc, libc update on this machine?


nope - in fact nobody was even logged in on the box for over two weeks - 
and clownfish failed too (which is a completely different 
hardware/software combination) - so this seems in fact to be a result of 
the libpq changes ?!



Stefan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Gregory Stark
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:

 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13

That's just a faulty test:

SELECT t.d1 + i.f1 AS 102 FROM TIMESTAMP_TBL t, INTERVAL_TBL i
  WHERE t.d1 BETWEEN '1990-01-01' AND '2001-01-01'
AND i.f1 BETWEEN '00:00' AND '23:00';

Note that there's no ORDER BY on the test. I bet the planner came up with an
entirely different plan than usual which generated the records in a different
order.

Offhand I can only think of one kind of plan myself but I'm sure the planner
is more inventive than me :)

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13
 any ideas ?

I saw what I think was the identical failure last night on my own
machine, but it wasn't repeatable.  Evidently the planner is changing to
a different plan for those queries, but why has this only started
recently?  Maybe the recent changes to autovacuum defaults are causing
autovac to hit these tables when it never did before?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13
 any ideas ?

 I saw what I think was the identical failure last night on my own
 machine, but it wasn't repeatable.  Evidently the planner is changing to
 a different plan for those queries, but why has this only started
 recently?  Maybe the recent changes to autovacuum defaults are causing
 autovac to hit these tables when it never did before?

Indeed the only alternate plan I can imagine for this is to do the join the
other way around. And given the large difference in sizes between the two
tables the only way I could get that to happen was by obliterating the
statistics entirely for one table but having stats for the other.

This does raise a possible issue with autovacuum. Treating ANALYZE like VACUUM
and running it on individual tables one at a time is probably the wrong thing
to be doing. What really has to happen is it should run analyze on all tables
together in a single transaction and commit all the new stats together.
Out-of-sync stats can be worse than out-of-date stats.


With stats on timestamp_tbl but not on interval_tbl:

postgres-# 
QUERY PLAN 
---
 Nested Loop  (cost=2.02..48.29 rows=432 width=24) (actual time=0.112..1.515 
rows=104 loops=1)
   -  Seq Scan on interval_tbl i  (cost=0.00..36.55 rows=9 width=16) (actual 
time=0.036..0.070 rows=2 loops=1)
 Filter: ((f1 = '00:00:00'::interval) AND (f1 = '23:00:00'::interval))
   -  Materialize  (cost=2.02..2.50 rows=48 width=8) (actual time=0.030..0.377 
rows=52 loops=2)
 -  Seq Scan on timestamp_tbl t  (cost=0.00..1.97 rows=48 width=8) 
(actual time=0.048..0.333 rows=52 loops=1)
   Filter: ((d1 = '1990-01-01 00:00:00'::timestamp without time 
zone) AND (d1 = '2001-01-01 00:00:00'::timestamp without time zone))
 Total runtime: 1.904 ms
(7 rows)



All other combinations perform the join the other way around:

With both analyzed:

postgres-#  
QUERY PLAN  
-
 Nested Loop  (cost=1.15..6.37 rows=144 width=24) (actual time=0.109..1.653 
rows=104 loops=1)
   -  Seq Scan on timestamp_tbl t  (cost=0.00..1.97 rows=48 width=8) (actual 
time=0.063..0.356 rows=52 loops=1)
 Filter: ((d1 = '1990-01-01 00:00:00'::timestamp without time zone) 
AND (d1 = '2001-01-01 00:00:00'::timestamp without time zone))
   -  Materialize  (cost=1.15..1.18 rows=3 width=16) (actual time=0.003..0.008 
rows=2 loops=52)
 -  Seq Scan on interval_tbl i  (cost=0.00..1.15 rows=3 width=16) 
(actual time=0.017..0.052 rows=2 loops=1)
   Filter: ((f1 = '00:00:00'::interval) AND (f1 = 
'23:00:00'::interval))
 Total runtime: 2.025 ms
(7 rows)


With no stats at all:

postgres-# postgres-#   
   QUERY PLAN   
   
-
 Nested Loop  (cost=36.56..80.89 rows=99 width=24) (actual time=0.147..1.698 
rows=104 loops=1)
   -  Seq Scan on timestamp_tbl t  (cost=0.00..42.10 rows=11 width=8) (actual 
time=0.101..0.388 rows=52 loops=1)
 Filter: ((d1 = '1990-01-01 00:00:00'::timestamp without time zone) 
AND (d1 = '2001-01-01 00:00:00'::timestamp without time zone))
   -  Materialize  (cost=36.56..36.65 rows=9 width=16) (actual 
time=0.003..0.008 rows=2 loops=52)
 -  Seq Scan on interval_tbl i  (cost=0.00..36.55 rows=9 width=16) 
(actual time=0.017..0.053 rows=2 loops=1)
   Filter: ((f1 = '00:00:00'::interval) AND (f1 = 
'23:00:00'::interval))
 Total runtime: 2.063 ms
(7 rows)

With stats on interval_tbl but not timestamp_tbl:

postgres-# postgres-#   
   QUERY PLAN   
   
-
 Nested Loop  (cost=1.15..44.00 rows=33 width=24) (actual time=0.100..1.725 
rows=104 loops=1)
   -  Seq Scan on timestamp_tbl t  (cost=0.00..42.10 rows=11 width=8) (actual 
time=0.055..0.351 rows=52 loops=1)
 Filter: ((d1 = '1990-01-01 00:00:00'::timestamp without time zone) 
AND (d1 = '2001-01-01 00:00:00'::timestamp 

Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Alvaro Herrera
Gregory Stark wrote:
 Tom Lane [EMAIL PROTECTED] writes:
 
  Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
  http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lionfishdt=2007-07-24%2005:30:13
  any ideas ?
 
  I saw what I think was the identical failure last night on my own
  machine, but it wasn't repeatable.  Evidently the planner is changing to
  a different plan for those queries, but why has this only started
  recently?  Maybe the recent changes to autovacuum defaults are causing
  autovac to hit these tables when it never did before?

That's quite possible, because the change in threshold means it will hit
small tables earlier than it used to do.

 This does raise a possible issue with autovacuum. Treating ANALYZE like VACUUM
 and running it on individual tables one at a time is probably the wrong thing
 to be doing. What really has to happen is it should run analyze on all tables
 together in a single transaction and commit all the new stats together.
 Out-of-sync stats can be worse than out-of-date stats.

One problem with that is that it will keep the locks on each table until
the end of all analyzes.

What I don't understand is what you mean with it obliterating the
stats for a table.  I mean, when analyze runs, it _updates_ the stats
for the table, so there's never a time when the table does not have any
stats (unless, of course, analyze has never been run on the table).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Gregory Stark wrote:
 What really has to happen is it should run analyze on all tables
 together in a single transaction and commit all the new stats together.
 Out-of-sync stats can be worse than out-of-date stats.

 One problem with that is that it will keep the locks on each table until
 the end of all analyzes.

Yeah, that seems entirely infeasible, even if I agreed with the premise
which I don't think I do.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 I saw what I think was the identical failure last night on my own
 machine, but it wasn't repeatable.  Evidently the planner is changing to
 a different plan for those queries, but why has this only started
 recently?  Maybe the recent changes to autovacuum defaults are causing
 autovac to hit these tables when it never did before?

 Indeed the only alternate plan I can imagine for this is to do the join the
 other way around. And given the large difference in sizes between the two
 tables the only way I could get that to happen was by obliterating the
 statistics entirely for one table but having stats for the other.

Yeah, I turned off autovac and let the tests run through 'horology',
then stopped and looked at the plans for these queries.  For the first
one, what you get with no ANALYZE having been done is

explain SELECT t.d1 + i.f1 AS 102 FROM TIMESTAMP_TBL t, INTERVAL_TBL i
  
  WHERE t.d1 BETWEEN '1990-01-01' AND '2001-01-01'
AND i.f1 BETWEEN '00:00' AND '23:00';
 QUERY PLAN 
 
-
 Nested Loop  (cost=36.56..80.89 rows=99 width=24)
   -  Seq Scan on timestamp_tbl t  (cost=0.00..42.10 rows=11 width=8)
 Filter: ((d1 = '1990-01-01 00:00:00'::timestamp without time zone) 
AND (d1 = '2001-01-01 00:00:00'::timestamp without time zone))
   -  Materialize  (cost=36.56..36.65 rows=9 width=16)
 -  Seq Scan on interval_tbl i  (cost=0.00..36.55 rows=9 width=16)
   Filter: ((f1 = '00:00:00'::interval) AND (f1 = 
'23:00:00'::interval))

If timestamp_tbl is then ANALYZEd, the plan flips around to put i on the
outside of the nestloop (because the estimate of the number of matching
rows rises to 49, which is pretty good because the actual is 52).

OTOH, if interval_tbl is ANALYZEd, the estimate for it drops to 2 rows
(again a much better match to reality) and we go back to preferring
i on the inside, with or without timestamp_tbl having been analyzed.

And, at least in the serial-schedule case, the stats at this point
look like

 relid  | schemaname |relname| seq_scan | seq_tup_read | idx_scan | 
idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup | 
last_vacuum | last_autovacuum | last_analyze | last_autoanalyze 
++---+--+--+--+---+---+---+---+++-+-+--+--
 132885 | public | interval_tbl  |   22 |  210 |  | 
  |10 | 0 | 0 | 10 |  0 |   
  | |  | 
 132879 | public | timestamp_tbl |   45 | 2444 |  | 
  |74 | 0 | 8 | 66 |  8 |   
  | |  | 

So yesterday's change to reduce the analyze threshold to 50 means that
timestamp_tbl is now vulnerable to being asynchronously analyzed while
the tests run.

While I don't have any very strong objection to putting an ORDER BY
on these particular queries, I'm worried about how many other regression
tests will now start showing random failures.  We have an awful lot
of small tables in the tests ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Alvaro Herrera
Tom Lane wrote:

 While I don't have any very strong objection to putting an ORDER BY
 on these particular queries, I'm worried about how many other regression
 tests will now start showing random failures.  We have an awful lot
 of small tables in the tests ...

Maybe what we could do is set higher thresholds for the regression
database with ALTER DATABASE.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 What I don't understand is what you mean with it obliterating the
 stats for a table.

The point seems to be that if there is no pg_statistic data for these
tables, we fall back to default estimates of the selectivity of the
WHERE clauses, and those produce rowcount estimates that are pretty
far away from the truth.  As soon as some pg_statistic entries
exist, we obtain better rowcount estimates, and that changes the join
plan.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Alvaro Herrera [EMAIL PROTECTED] writes:
 Gregory Stark wrote:
 What really has to happen is it should run analyze on all tables
 together in a single transaction and commit all the new stats together.
 Out-of-sync stats can be worse than out-of-date stats.

 One problem with that is that it will keep the locks on each table until
 the end of all analyzes.

 Yeah, that seems entirely infeasible, even if I agreed with the premise
 which I don't think I do.

Well that's just what ANALYZE with no arguments at all does. It's also only a
ShareUpdateExclusiveLock which prevents other vacuums and DDL but not any
other DML. And ANALYZE goes by pretty quickly even on large tables.

Another idea is that perhaps it should only do this for all never-analyzed
tables together. That's where the out-of-sync stats is most likely to hurt.
But I'm not sure where to go with that since there's no guarantee that all the
never-analyzed tables will be the small ones.

Really it seems like having autovacuum touch never-analyzed tables and having
reasonable default stats for never-analyzed tables don't mix well together.

The reasonable default stats are there because if you analyze a fresh empty
table you'll get some degenerate plans which will behave terribly when you
start loading even a few records into it. The reasonable default stats give
you something akin to a rule-based plan until you have some reasonable data
loaded to analyze.

In a perfect world I would say autovacuum shouldn't analyze never-analyzed
tables, just print a warning for the DBA. But we get questions all the time
about bad plans that show tables which have never been analyzed so that
doesn't sound palatable either.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 Gregory Stark wrote:
 What really has to happen is it should run analyze on all tables
 together in a single transaction and commit all the new stats together.
 Out-of-sync stats can be worse than out-of-date stats.
 
 One problem with that is that it will keep the locks on each table until
 the end of all analyzes.
 
 Yeah, that seems entirely infeasible, even if I agreed with the premise
 which I don't think I do.

 Well that's just what ANALYZE with no arguments at all does.

Not unless you wrap it in a transaction block --- otherwise it does a
transaction per table.  If you try wrapping it in a transaction block
on a production system, you'll soon find you don't like it.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] strange buildfarm failure on lionfish

2007-07-24 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 While I don't have any very strong objection to putting an ORDER BY
 on these particular queries, I'm worried about how many other regression
 tests will now start showing random failures.  We have an awful lot
 of small tables in the tests ...

 Maybe what we could do is set higher thresholds for the regression
 database with ALTER DATABASE.

That seems to make sense at least as a short-term response.  We weren't
seeing buildfarm failures with the previous defaults, so setting those
values with ALTER oughta do it.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match