[PERFORM] partial index regarded more expensive

2005-08-10 Thread Tobias Brox
So, I have a table game with a timestamp attribute 'game_end', ranging from
jan-2005 to present.  The game table also have an attribute state, with live
games beeing in state 2, and ended games beeing in state 4 (so,
game_end+deltanow() usually means state=4).  There are also an insignificant
number of games in states 1,3.

This query puzzles me:

  select * from game where game_end'2005-07-30' and state in (3,4);
  
Now, one (at least me) should believe that the best index would be a partial
index,

resolved_game_by_date btree (game_end) WHERE ((state = 3) OR (state = 4))

NBET= explain analyze select * from game where game_end'2005-07-30' and state 
in (3,4);
   QUERY PLAN   


 Index Scan using resolved_game_by_date on game  (cost=0.00..7002.87 rows=7147 
width=555) (actual time=0.220..86.234 rows=3852 loops=1)
   Index Cond: (game_end  '2005-07-30 00:00:00'::timestamp without time zone)
   Filter: ((state = 3) OR (state = 4))
 Total runtime: 90.568 ms
(4 rows)
  
Since state has only two significant states, I wouldn't believe this index
to be any good:

game_by_state btree (state)


...and it seems like I'm right:

NBET= explain analyze select * from game where game_end'2005-07-30' and
state in (3,4);
   QUERY
PLAN   

 Index Scan using game_by_state, game_by_state on game  (cost=0.00..4413.78 
rows=7147 width=555) (actual time=0.074..451.771 rows=3851 loops=1)
   Index Cond: ((state = 3) OR (state = 4))
  Filter: (game_end  '2005-07-30 00:00:00'::timestamp without time zone)
 Total runtime: 457.132 ms
(4 rows)

Now, how can the planner believe the game_by_state-index to be better?

('vacuum analyze game' did not significantly impact the numbers, and I've
tried running the queries some times with and without the
game_by_state-index to rule out cacheing effects)

-- 
Tobias Brox
This signature has been virus scanned, and is probably safe to read.
This mail may contain confidential information, please keep your eyes closed.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] partial index regarded more expensive

2005-08-10 Thread PFC


why not simply create an index on (game_end, state) ?

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] partial index regarded more expensive

2005-08-10 Thread Tobias Brox
[PFC - Wed at 08:15:13PM +0200]
   why not simply create an index on (game_end, state) ?

No, the planner prefers to use the partial index (I dropped the index on
game(state)).

-- 
Tobias Brox, Nordicbet IT dept
This signature has been virus scanned, and is probably safe to read.
This mail may contain confidential information, please keep your eyes closed.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Planner doesn't look at LIMIT?

2005-08-10 Thread Ian Westmacott
I have a case that I though was an example of this issue,
and that this patch would correct.  I applied this patch
to an 8.0.3 source distribution, but it didn't seem to
solve my problem.

In a nutshell, I have a LIMIT query where the planner
seems to favor a merge join over a nested loop.  I've
simplified the query as much as possible:


itvtrackdata3= \d tableA
Table public.tableA
 Column |   Type   | Modifiers 
+--+---
 foo| bigint   | not null
 bar| smallint | not null
 bap| bigint   | not null
 bip| bigint   | not null
 bom| bigint   | not null
Indexes:
idx_tableA_bip btree (bip) WHERE (bip =
900::bigint)
idx_tableA_foo btree (foo)

itvtrackdata3= \d tableB
Table tableB
 Column  |   Type   | Modifiers 
-+--+---
 bim | bigint   | not null
 bif | smallint | not null
 baf | smallint | not null
 bof | smallint | not null
 buf | smallint | not null
 foo | bigint   | not null
Indexes:
idx_tableB_bim btree (bim, foo)

itvtrackdata3= set default_statistics_target to 1000;
SET
Time: 0.448 ms
itvtrackdata3= analyze tableA;
ANALYZE
Time: 4237.151 ms
itvtrackdata3= analyze tableB;
ANALYZE
Time: 46672.939 ms
itvtrackdata3= explain analyze SELECT * FROM tableB NATURAL JOIN tableA
WHERE bim=72555896091359 AND bim72555935412959 AND bim=bap ORDER BY
bim ASC LIMIT 1;

  QUERY PLAN
  
--
 Limit  (cost=149626.57..252987.71 rows=1 width=50) (actual
time=5684.013..5684.013 rows=1 loops=1)
   -  Merge Join  (cost=149626.57..252987.71 rows=1 width=50) (actual
time=5684.012..5684.012 rows=1 loops=1)
 Merge Cond: ((outer.bim = inner.bap) AND (outer.foo =
inner.foo))
 -  Index Scan using idx_tableB_bim on tableB 
(cost=0.00..97391.22 rows=55672 width=24) (actual time=0.017..0.059
rows=29 loops=1)
   Index Cond: ((bim = 72555896091359::bigint) AND (bim
 72555935412959::bigint))
 -  Sort  (cost=149626.57..151523.94 rows=758948 width=34)
(actual time=5099.300..5442.825 rows=560856 loops=1)
   Sort Key: tableA.bap, tableA.foo
   -  Seq Scan on tableA  (cost=0.00..47351.48 rows=758948
width=34) (actual time=0.021..1645.204 rows=758948 loops=1)
 Total runtime: 5706.655 ms
(9 rows)

Time: 5729.984 ms
itvtrackdata3= set enable_mergejoin to false;
SET
Time: 0.373 ms
itvtrackdata3= explain analyze SELECT * FROM tableB NATURAL JOIN tableA
WHERE bim=72555896091359 AND bim72555935412959 AND bim=bap ORDER BY
bim ASC LIMIT 1;

  QUERY PLAN
  
--
 Limit  (cost=0.00..432619.68 rows=1 width=50) (actual
time=11.149..11.150 rows=1 loops=1)
   -  Nested Loop  (cost=0.00..432619.68 rows=1 width=50) (actual
time=11.148..11.148 rows=1 loops=1)
 Join Filter: (outer.bim = inner.bap)
 -  Index Scan using idx_tableB_bim on tableB 
(cost=0.00..97391.22 rows=55672 width=24) (actual time=0.017..0.062
rows=29 loops=1)
   Index Cond: ((bim = 72555896091359::bigint) AND (bim
 72555935412959::bigint))
 -  Index Scan using idx_tableA_foo on tableA  (cost=0.00..6.01
rows=1 width=34) (actual time=0.007..0.379 rows=1 loops=29)
   Index Cond: (outer.foo = tableA.foo)
 Total runtime: 11.215 ms
(8 rows)

Time: 32.007 ms


Have I just flubbed the patch, or is there something else
going on here?

Thanks,

--Ian


On Fri, 2005-07-22 at 12:20, Tom Lane wrote:
 I wrote:
  Dawid Kuroczko [EMAIL PROTECTED] writes:
  qnex=# EXPLAIN SELECT * FROM log NATURAL JOIN useragents LIMIT 1;
 
  Limit  (cost=15912.20..15912.31 rows=1 width=272)
  -  Hash Join  (cost=15912.20..5328368.96 rows=47044336 width=272)
 
  If I set enable_hashjoin=false:
 
  qnex=# EXPLAIN ANALYZE SELECT * FROM log NATURAL LEFT JOIN useragents 
  LIMIT 1;
 
  Limit  (cost=0.00..3.07 rows=1 width=272) (actual time=74.214..74.216
  rows=1 loops=1)
  -  Nested Loop Left Join  (cost=0.00..144295895.01 rows=47044336
  width=272) (actual time=74.204..74.204 rows=1 loops=1)
 
  This is quite strange.  The nestloop plan definitely should be preferred
  in the context of the LIMIT, considering that it has far lower estimated
  cost.  And it is preferred in simple tests for me.
 
 After a suitable period of contemplating my navel, I figured out
 what is going on here: the total costs involved are large 

Re: [PERFORM] Planner doesn't look at LIMIT?

2005-08-10 Thread Tom Lane
Ian Westmacott [EMAIL PROTECTED] writes:
 In a nutshell, I have a LIMIT query where the planner
 seems to favor a merge join over a nested loop.

The planner is already estimating only one row out of the join, and so
the LIMIT doesn't affect its cost estimates at all.

It appears to me that the reason the nestloop plan is fast is just
chance: a suitable matching row is found very early in the scan of
tableB, so that the indexscan on it can stop after 29 rows, instead
of having to go through all 55000 rows in the given range of bim.
If it'd have had to go through, say, half of the rows to find a match,
the sort/merge plan would show up a lot better.

If this wasn't chance, but was expected because there are many matching
rows and not only one, then there's a statistical problem.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Speedier count(*)

2005-08-10 Thread Dan Harris
I have a web page for my customers that shows them count of records  
and some min/max date ranges in each table of a database, as this is  
how we bill them for service.  They can log in and check the counts  
at any time.  I'd like for the counts to be as fresh as possible by  
keeping this dynamic, but I will use a periodic 'snapshot'/cron job  
if that is the only option to speed this up.   I have thought about  
using the table statistics, but the estimate error is probably  
unacceptable because of the billing purposes.


For some reason, the SQL Server we migrated the app from can return  
count(*) in a split second on multi-million row tables, even though  
it is a MUCH slower box hardware-wise, but it's now taking many  
seconds to run. I have read in the archives the problems MVCC brings  
into the count(*) dilemma forcing Pg to run a seq scan to get  
counts.  Does SQLServer not use MVCC or have they found another  
approach for arriving at this number?  Compounding all the min/max  
and counts from other tables and all those queries take about a  
minute to run. The tables will contain anywhere from 1 million to 40  
million rows.


Also, I am using select ... group by ... order by .. limit 1 to get  
the min/max since I have already been bit by the issue of min() max()  
being slower.



-Dan



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Speedier count(*)

2005-08-10 Thread Joshua D. Drake


Also, I am using select ... group by ... order by .. limit 1 to get  
the min/max since I have already been bit by the issue of min() max()  
being slower.


This specific instance is fixed in 8.1

Sincerely,

Joshua D. Drake





-Dan



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Speedier count(*)

2005-08-10 Thread Michael Fuhr
On Wed, Aug 10, 2005 at 05:37:49PM -0600, Dan Harris wrote:
 Also, I am using select ... group by ... order by .. limit 1 to get  
 the min/max since I have already been bit by the issue of min() max()  
 being slower.

PostgreSQL 8.1 will have optimizations for certain MIN and MAX
queries.

http://archives.postgresql.org/pgsql-committers/2005-04/msg00163.php
http://archives.postgresql.org/pgsql-committers/2005-04/msg00168.php

-- 
Michael Fuhr

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Speedier count(*)

2005-08-10 Thread John A Meinel
Dan Harris wrote:
 I have a web page for my customers that shows them count of records  and
 some min/max date ranges in each table of a database, as this is  how we
 bill them for service.  They can log in and check the counts  at any
 time.  I'd like for the counts to be as fresh as possible by  keeping
 this dynamic, but I will use a periodic 'snapshot'/cron job  if that is
 the only option to speed this up.   I have thought about  using the
 table statistics, but the estimate error is probably  unacceptable
 because of the billing purposes.

 For some reason, the SQL Server we migrated the app from can return
 count(*) in a split second on multi-million row tables, even though  it
 is a MUCH slower box hardware-wise, but it's now taking many  seconds to
 run. I have read in the archives the problems MVCC brings  into the
 count(*) dilemma forcing Pg to run a seq scan to get  counts.  Does
 SQLServer not use MVCC or have they found another  approach for arriving
 at this number?  Compounding all the min/max  and counts from other
 tables and all those queries take about a  minute to run. The tables
 will contain anywhere from 1 million to 40  million rows.

I believe SQL Server doesn't use MVCC in the same way. At the very
least, it stores some row information in the index, so it can get some
info from just an index, without having to go to the actual page (MVCC
requires a main page visit to determine visibility.)

Depending on how much it impacts performance, you can create an
INSERT/UPDATE trigger so that whenever a new entry is added, it
automatically updates a statistics table. It would be maintained as you
go, rather than periodically like a cron job.

I would go Cron if things can be slightly out of date (like 1 hour at
least), and you need updates  inserts to not be slowed down.
Otherwise I think the trigger is nicer, since it doesn't do redundant
work, and means everything stays up-to-date.



 Also, I am using select ... group by ... order by .. limit 1 to get
 the min/max since I have already been bit by the issue of min() max()
 being slower.


 -Dan

John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] partial index regarded more expensive

2005-08-10 Thread Tom Lane
Tobias Brox [EMAIL PROTECTED] writes:
 This query puzzles me:
   select * from game where game_end'2005-07-30' and state in (3,4);
 ...
 Now, how can the planner believe the game_by_state-index to be better?

I suspect the problem has to do with lack of cross-column statistics.
The planner does not know that state=4 is correlated with game_end,
and it's probably coming up with some bogus guesses about the numbers
of index rows visited in each case.  You haven't given enough info to
quantify this, though.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] it is always delete temp table will slow down the postmaster?

2005-08-10 Thread Chun Yit(Chronos)




hi, i got one situation here, i create one 
pl/pgsql function that using temp table to store temporary data.
wherever i execute my function, i need to 
delete all the data inside the temp table, but this will slow down the 

searching function if i conitnue to run the 
server because old tuples are not really clear if just using delete 
command.
so i use drop table command and recreate the 
table. my question is, would it slow down the postmaster speed if i continue to 

run this searching function more than 300 time 
per day?, cause the speed for execute searching function will graduatelly 
increase 
after i test it for few day? anyway to test it 
is causing by the drop temp table and create temp table command?

regards
ivan


Re: [PERFORM] Speedier count(*)

2005-08-10 Thread Gavin Sherry
Hi Dan,

On Wed, 10 Aug 2005, Dan Harris wrote:

 I have a web page for my customers that shows them count of records
 and some min/max date ranges in each table of a database, as this is
 how we bill them for service.  They can log in and check the counts
 at any time.  I'd like for the counts to be as fresh as possible by
 keeping this dynamic, but I will use a periodic 'snapshot'/cron job
 if that is the only option to speed this up.   I have thought about
 using the table statistics, but the estimate error is probably
 unacceptable because of the billing purposes.

 For some reason, the SQL Server we migrated the app from can return
 count(*) in a split second on multi-million row tables, even though
 it is a MUCH slower box hardware-wise, but it's now taking many
 seconds to run. I have read in the archives the problems MVCC brings
 into the count(*) dilemma forcing Pg to run a seq scan to get
 counts.  Does SQLServer not use MVCC or have they found another

SQL Server probably jumps through a lot of hoops to do fast count(*)s. I'm
sure we could do something similar -- it's just a question of complexity,
resources, desirability, etc. The are other solutions, which makes the
idea of doing it less attractive still.

 approach for arriving at this number?  Compounding all the min/max
 and counts from other tables and all those queries take about a
 minute to run. The tables will contain anywhere from 1 million to 40
 million rows.

 Also, I am using select ... group by ... order by .. limit 1 to get
 the min/max since I have already been bit by the issue of min() max()
 being slower.

I generally pre generate the results. There are two ways to do this: the
'snapshot'/cronjon you mentioned or using rules and triggers to maintain
'count' tables. The idea is that if data is added, modified or removed
from your table, you modify counters in these other tables.

Alternatively, feel free to post your schema and sample queries with
explain analyze results to this list. Alternatively, jump on irc at
irc.freenode.net #postgresql and someone will be more than happy to look
through the problem in more detail.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Speedier count(*)

2005-08-10 Thread Mark Cotner
Here's a trigger I wrote to perform essentially the same purpose.  The nice
thing about this is it keeps the number up to date for you, but you do incur
slight overhead.

CREATE TABLE test (id serial primary key, name varchar(20));

CREATE TABLE rowcount (tablename varchar(50), rowcount bigint default 0);
CREATE INDEX rowcount_tablename ON rowcount(tablename);

CREATE OR REPLACE FUNCTION del_rowcount() RETURNS trigger AS $$
BEGIN
 UPDATE rowcount SET rowcount = rowcount-1 WHERE tablename = TG_RELNAME;
 RETURN OLD;
END;
$$ LANGUAGE PLPGSQL;

CREATE OR REPLACE FUNCTION add_rowcount() RETURNS trigger AS $$
BEGIN
 UPDATE rowcount SET rowcount = rowcount+1 WHERE tablename = TG_RELNAME;
 RETURN NEW;
END;
$$ LANGUAGE PLPGSQL;

CREATE TRIGGER del_rowcount_tr BEFORE DELETE ON test FOR EACH ROW EXECUTE
   PROCEDURE del_rowcount();
CREATE TRIGGER add_rowcount_tr BEFORE INSERT ON test FOR EACH ROW EXECUTE
   PROCEDURE add_rowcount();

INSERT INTO rowcount (tablename) VALUES ('test');

root=# select * from test;
 id | name 
+--
(0 rows)

Time: 0.934 ms
root=# select * from rowcount;
 tablename | rowcount
---+--
 test  |0
(1 row)

Time: 0.630 ms
root=# insert into test (name) values ('blah');
INSERT 1190671626 1
Time: 3.278 ms
root=# select * from test;
 id | name 
+--
  5 | blah
(1 row)

Time: 0.612 ms
root=# select * from rowcount;
 tablename | rowcount
---+--
 test  |1
(1 row)

Time: 0.640 ms
root=# insert into test (name) values ('blah');
INSERT 1190671627 1
Time: 1.677 ms
root=# select * from test;
 id | name 
+--
  5 | blah
  6 | blah
(2 rows)

Time: 0.653 ms
root=# select * from rowcount;
 tablename | rowcount
---+--
 test  |2
(1 row)

Time: 0.660 ms
root=# delete from test where id = 6;
DELETE 1
Time: 2.412 ms
root=# select * from test;
 id | name 
+--
  5 | blah
(1 row)

Time: 0.631 ms
root=# select * from rowcount;
 tablename | rowcount
---+--
 test  |1
(1 row)

Time: 0.609 ms

One thing to be mindful of . . . Truncate is NOT accounted for with this,
and unfortunately the rule system doesn't allow truncate operations so you
can't work around it that way.

'njoy,
Mark


On 8/10/05 11:52 PM, Gavin Sherry [EMAIL PROTECTED] wrote:

 Hi Dan,
 
 On Wed, 10 Aug 2005, Dan Harris wrote:
 
 I have a web page for my customers that shows them count of records
 and some min/max date ranges in each table of a database, as this is
 how we bill them for service.  They can log in and check the counts
 at any time.  I'd like for the counts to be as fresh as possible by
 keeping this dynamic, but I will use a periodic 'snapshot'/cron job
 if that is the only option to speed this up.   I have thought about
 using the table statistics, but the estimate error is probably
 unacceptable because of the billing purposes.
 
 For some reason, the SQL Server we migrated the app from can return
 count(*) in a split second on multi-million row tables, even though
 it is a MUCH slower box hardware-wise, but it's now taking many
 seconds to run. I have read in the archives the problems MVCC brings
 into the count(*) dilemma forcing Pg to run a seq scan to get
 counts.  Does SQLServer not use MVCC or have they found another
 
 SQL Server probably jumps through a lot of hoops to do fast count(*)s. I'm
 sure we could do something similar -- it's just a question of complexity,
 resources, desirability, etc. The are other solutions, which makes the
 idea of doing it less attractive still.
 
 approach for arriving at this number?  Compounding all the min/max
 and counts from other tables and all those queries take about a
 minute to run. The tables will contain anywhere from 1 million to 40
 million rows.
 
 Also, I am using select ... group by ... order by .. limit 1 to get
 the min/max since I have already been bit by the issue of min() max()
 being slower.
 
 I generally pre generate the results. There are two ways to do this: the
 'snapshot'/cronjon you mentioned or using rules and triggers to maintain
 'count' tables. The idea is that if data is added, modified or removed
 from your table, you modify counters in these other tables.
 
 Alternatively, feel free to post your schema and sample queries with
 explain analyze results to this list. Alternatively, jump on irc at
 irc.freenode.net #postgresql and someone will be more than happy to look
 through the problem in more detail.
 
 Thanks,
 
 Gavin
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq