[PERFORM] Optimizer differences between 7.2 and 7.3

2003-07-07 Thread Jeff Boes
Our production database is running under 7.2.4; our test database
running almost the same data is at 7.3.3.  One table has about 400,000
rows in each schema. A query against an indexed column uses an index
scan under 7.2.4, but a sequential scan under 7.3.3. A count of the
table in question shows that they have comparable numbers of matching
rows.

On 7.2.4:

select count(*) from articles;
 count

 420213

select count(*) from articles
 where path_base like 'http://news.findlaw.com/hdocs%';
 count
---
38

(and it returns this nearly instantaneously)

explain select count(*) from articles
   where path_base like 'http://news.findlaw.com/hdocs%'
 
Aggregate  (cost=6.02..6.02 rows=1 width=0)
  ->  Index Scan using ix_articles_3 on articles  (cost=0.00..6.01
rows=1 width=0)
 
On 7.3.3:

select count(*) from articles;
 count

 406319

select count(*) from articles
   where path_base like 'http://news.findlaw.com/hdocs%'
 count
---
23

(and it takes many seconds to return)

explain select count(*) from articles
   where path_base like 'http://news.findlaw.com/hdocs%'

 Aggregate  (cost=205946.65..205946.65 rows=1 width=0)
   ->  Seq Scan on articles  (cost=0.00..205946.65 rows=1 width=0)
 Filter: (path_base ~~ 'http://news.findlaw.com/hdocs%'::text)

I can't find any differences between the indexes (ix_articles_3 exists
in both schemas); the column statistics are set up the same (the
default); and the optimizer settings (costs in postgresql.conf) are the
same.

-- 
Jeff Boes  vox 269.226.9550 ext 24
Database Engineer fax 269.349.9076
Nexcerpt, Inc. http://www.nexcerpt.com
   ...Nexcerpt... Extend your Expertise


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] Optimizer difference using function index between 7.3 and 7.4

2004-02-18 Thread Jeff Boes
We have a large (several million row) table with a field containing 
URLs. Now, funny thing about URLs: they mostly start with a common 
substring ("http://www.";). But not all the rows start with this, so we 
can't just lop off the first N characters. However, we noticed some time 
ago that an index on this field wasn't as effective as an index on the 
REVERSE of the field. So ...

CREATE OR REPLACE FUNCTION fn_urlrev(text) returns text as '
return reverse(lc($_[0]))
' language 'plperl' with (iscachable,isstrict);
and then

CREATE UNIQUE INDEX ix_links_3 ON links
(fn_urlrev(path_base));
seemed to be much faster. When we have to look up a single entry in 
"links", we do so by something like --

SELECT * FROM links WHERE fn_urlrev(path_base) = ?;

and it's rather fast. When we have a bunch of them to do, under 7.3 we 
found it useful to create a temporary table, fill it with reversed URLs, 
and join:

INSERT INTO temp_link_urls VALUES (fn_urlrev(?));

SELECT l.path_base,l.link_id
   FROM links l
   JOIN temp_link_urls t
   ON (fn_urlrev(l.path_base) = t.rev_path_base);
Here are query plans from the two versions (using a temp table with 200 
rows, after ANALYZE on the temp table):

7.3:

# explain select link_id from links l join clm_tmp_links t on 
(fn_urlrev(l.path_base) = t.rev_path_base);
  QUERY PLAN
-
Nested Loop  (cost=0.00..3936411.13 rows=2000937 width=152)
  ->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 width=74)
  ->  Index Scan using ix_links_3 on links l  (cost=0.00..19531.96 
rows=10005 width=78)
Index Cond: (fn_urlrev(l.path_base) = "outer".rev_path_base)
(4 rows)

7.4:

# explain select link_id from links l join clm_tmp_links t on 
(fn_urlrev(l.path_base) = t.rev_path_base);
 QUERY PLAN
--
Hash Join  (cost=5.50..88832.88 rows=1705551 width=4)
  Hash Cond: (fn_urlrev("outer".path_base) = "inner".rev_path_base)
  ->  Seq Scan on links l  (cost=0.00..50452.50 rows=1705550 width=78)
  ->  Hash  (cost=5.00..5.00 rows=200 width=74)
->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 
width=74)
(5 rows)

Although the cost for the 7.4 query is lower, the 7.3 plan executes in 
about 3 seconds, while the 7.4 plan executes in 59.8 seconds!

Now the odd part: if I change the query to this:

# explain analyze select link_id from links l join clm_tmp_links t on 
(fn_urlrev(l.path_base) = fn_urlrev(t.rev_path_base));
 QUERY 
PLAN   
--
Merge Join  (cost=12.64..219974.16 rows=1705551 width=4) (actual 
time=17.928..17.928 rows=0 loops=1)
  Merge Cond: (fn_urlrev("outer".path_base) = "inner"."?column2?")
  ->  Index Scan using ix_links_3 on links l  (cost=0.00..173058.87 
rows=1705550 width=78) (actual time=0.229..0.285 rows=7 loops=1)
  ->  Sort  (cost=12.64..13.14 rows=200 width=74) (actual 
time=9.652..9.871 rows=200 loops=1)
Sort Key: fn_urlrev(t.rev_path_base)
->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 
width=74) (actual time=0.166..5.753 rows=200 loops=1)
Total runtime: 18.125 ms

(i.e., apply the function to the data in the temp table), it runs a 
whole lot faster! Is this a bug in the optimizer? Or did something 
change about the way functional indexes are used?

--
Jeff Boes  vox 269.226.9550 ext 24
Database Engineer fax 269.349.9076
Nexcerpt, Inc. http://www.nexcerpt.com
  ...Nexcerpt... Extend your Expertise
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html