Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Moritz Onken



Anfang der weitergeleiteten E-Mail:


Von: Moritz Onken [EMAIL PROTECTED]
Datum: 30. Juni 2008 09:16:06 MESZ
An: Steinar H. Gunderson [EMAIL PROTECTED]
Betreff: Re: [PERFORM] Planner should use index on a LIKE 'foo%' query


Am 28.06.2008 um 21:19 schrieb Steinar H. Gunderson:


On Sat, Jun 28, 2008 at 06:24:42PM +0200, Moritz Onken wrote:
SELECT distinct url from item where url like 'http://www.micro%'  
limit

10;


Here, the planner knows the pattern beforehand, and can see that  
it's a

simple prefix.

select *
from result
where exists
 (select * from item where item.url LIKE result.url || '%' limit 1)
limit 10;


Here it cannot (what if result.url was '%foo%'?).


That's right. Thanks for that hint. Is there a Postgres function  
which returns a constant (possibly an escape function)?



Try using something like (item.url = result.url  item.url =  
result.url ||

'z'), substituting an appropriately high character for 'z'.

The only explaination is that I don't use a constant when  
comparing the

values. But actually it is a constant...


I created a new column in item where I store the shortened url  
which makes = comparisons possible.


the result table has 20.000.000 records and the item table 5.000.000.
The query

select count(1) from result where url in (select shorturl from item  
where shorturl = result.url);


will take about 8 hours (still running, just guessing). Is this  
reasonable on a system with 1 GB of RAM and a AMD Athlon 64 3200+  
processor? (1 SATA HDD)


regards,

moritz




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Moritz Onken


Am 28.06.2008 um 21:19 schrieb Steinar H. Gunderson:


On Sat, Jun 28, 2008 at 06:24:42PM +0200, Moritz Onken wrote:
SELECT distinct url from item where url like 'http://www.micro%'  
limit

10;


Here, the planner knows the pattern beforehand, and can see that  
it's a

simple prefix.

select *
from result
where exists
 (select * from item where item.url LIKE result.url || '%' limit 1)
limit 10;


Here it cannot (what if result.url was '%foo%'?).


That's right. Thanks for that hint. Is there a Postgres function which  
returns a constant (possibly an escape function)?



Try using something like (item.url = result.url  item.url =  
result.url ||

'z'), substituting an appropriately high character for 'z'.

The only explaination is that I don't use a constant when comparing  
the

values. But actually it is a constant...


I created a new column in item where I store the shortened url which  
makes = comparisons possible.


the result table has 20.000.000 records and the item table 5.000.000.
The query

select count(1) from result where url in (select shorturl from item  
where shorturl = result.url);


will take about 8 hours (still running, just guessing). Is this  
reasonable on a system with 1 GB of RAM and a AMD Athlon 64 3200+  
processor? (1 SATA HDD)


regards,

moritz



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Matthew Wakeling

On Mon, 30 Jun 2008, Moritz Onken wrote:
I created a new column in item where I store the shortened url which makes 
= comparisons possible.


Good idea. Now create an index on that column.

select count(1) from result where url in (select shorturl from item where 
shorturl = result.url);


What on earth is wrong with writing it like this?

SELECT COUNT(*) FROM (SELECT DISTINCT result.url FROM result, item WHERE
   item.shorturl = result.url) AS a

That should do a fairly sensible join plan. There's no point in using 
fancy IN or EXISTS syntax when a normal join will do.


Matthew

--
I have an inferiority complex. But it's not a very good one.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Dimitri Fontaine
Hi,

Le samedi 28 juin 2008, Moritz Onken a écrit :
 select count(*)
   from result
   where exists
 (select * from item where item.url LIKE result.url || '%' limit 1);

 which basically returns the number of items which exist in table
 result and match a URL in table item by its prefix.

It seems you could benefit from the prefix project, which support indexing 
your case of prefix searches. Your query would then be:
  SELECT count(*) FROM result r JOIN item i ON r.url @ i.url;

The result.url column would have to made of type prefix_range, which casts 
automatically to text when needed.

Find out more about the prefix projects at those urls:
  http://pgfoundry.org/projects/prefix
  http://prefix.projects.postgresql.org/README.html

Regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Moritz Onken


Am 30.06.2008 um 12:19 schrieb Matthew Wakeling:


select count(1) from result where url in (select shorturl from item  
where shorturl = result.url);


What on earth is wrong with writing it like this?

SELECT COUNT(*) FROM (SELECT DISTINCT result.url FROM result, item  
WHERE

  item.shorturl = result.url) AS a


I tried the this approach but it's slower than WHERE IN in my case.



It seems you could benefit from the prefix project, which support  
indexing

your case of prefix searches. Your query would then be:
 SELECT count(*) FROM result r JOIN item i ON r.url @ i.url;

The result.url column would have to made of type prefix_range, which  
casts

automatically to text when needed.

Find out more about the prefix projects at those urls:
 http://pgfoundry.org/projects/prefix
 http://prefix.projects.postgresql.org/README.html

Regards,
--
dim


Thanks for that! looks interesting.

regards

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Matthew Wakeling

On Mon, 30 Jun 2008, Moritz Onken wrote:

SELECT COUNT(*) FROM (SELECT DISTINCT result.url FROM result, item WHERE
 item.shorturl = result.url) AS a


I tried the this approach but it's slower than WHERE IN in my case.


However there's a lot more scope for improving a query along these lines, 
like adding indexes, or CLUSTERing on an index. It depends what other 
queries you are wanting to run.


I don't know how much update/insert activity there will be on your 
database. However, if you were to add an index on the URL on both tables, 
then CLUSTER both tables on those indexes, and ANALYSE, then this query 
should run as a merge join, and be pretty quick.


However, this is always going to be a long-running query, because it 
accesses at least one whole table scan of a large table.


Matthew

--
Finger to spiritual emptiness underlying everything.
   -- How a foreign C manual referred to a pointer to void.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Moritz Onken


Am 30.06.2008 um 16:59 schrieb Steinar H. Gunderson:


On Mon, Jun 30, 2008 at 09:16:06AM +0200, Moritz Onken wrote:

the result table has 20.000.000 records and the item table 5.000.000.
The query

select count(1) from result where url in (select shorturl from item
where shorturl = result.url);

will take about 8 hours (still running, just guessing). Is this
reasonable on a system with 1 GB of RAM and a AMD Athlon 64 3200+
processor? (1 SATA HDD)


I really don't see what your query tries to accomplish. Why would  
you want

url IN (... where .. = url)? Wouldn't you want a different qualifier
somehow?


well, it counts the number of rows with urls which already exist in  
another

table.
How would you describe the query?
If the (select shorturl from item where shorturl = result.url)
clause is empty the row is not counted, that's what I want...

greetings,

moritz

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Matthew Wakeling

On Mon, 30 Jun 2008, Moritz Onken wrote:

select count(1) from result where url in (select shorturl from item
where shorturl = result.url);


I really don't see what your query tries to accomplish. Why would you want
url IN (... where .. = url)? Wouldn't you want a different qualifier
somehow?


well, it counts the number of rows with urls which already exist in another
table.
How would you describe the query?
If the (select shorturl from item where shorturl = result.url)
clause is empty the row is not counted, that's what I want...


The thing here is that you are effectively causing Postgres to run a 
sub-select for each row of the result table, each time generating either 
an empty list or a list with one or more identical URLs. This is 
effectively forcing a nested loop. In a way, you have two constraints 
where you only need one.


You can safely take out the constraint in the subquery, so it is like 
this:


SELECT COUNT(*) FROM result WHERE url IN (SELECT shorturl FROM item);

This will generate equivalent results, because those rows that didn't 
match the constraint wouldn't have affected the IN anyway. However, it 
will alter the performance, because the subquery will contain more 
results, but it will only be run once, rather than multiple times. This is 
effectively forcing a hash join (kind of).


Whereas if you rewrite the query as I demonstrated earlier, then you allow 
Postgres to make its own choice about which join algorithm will work best.


Matthew

--
Anyone who goes to a psychiatrist ought to have his head examined.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-30 Thread Moritz Onken


The thing here is that you are effectively causing Postgres to run a  
sub-select for each row of the result table, each time generating  
either an empty list or a list with one or more identical URLs. This  
is effectively forcing a nested loop. In a way, you have two  
constraints where you only need one.


You can safely take out the constraint in the subquery, so it is  
like this:


SELECT COUNT(*) FROM result WHERE url IN (SELECT shorturl FROM item);

This will generate equivalent results, because those rows that  
didn't match the constraint wouldn't have affected the IN anyway.  
However, it will alter the performance, because the subquery will  
contain more results, but it will only be run once, rather than  
multiple times. This is effectively forcing a hash join (kind of).


Whereas if you rewrite the query as I demonstrated earlier, then you  
allow Postgres to make its own choice about which join algorithm  
will work best.


Matthew


Thank you! I learned a lot today :-)
I thought the subquery will be run on every row thus I tried to make  
it as fast as possible by using a where clause. I didn't try your  
first query on the hole table so it could be faster than mine approach.


greetings,

moritz

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-28 Thread Moritz Onken

Hi,

I have a query

select count(*)
 from result
 where exists
   (select * from item where item.url LIKE result.url || '%' limit 1);

which basically returns the number of items which exist in table  
result and match a URL in table item by its prefix.
I read all about idexes (http://www.postgresql.org/docs/8.3/static/indexes-types.html 
) and especially this part:
The optimizer can also use a B-tree index for queries involving the  
pattern matching operators LIKE and ~ if the pattern is a constant and  
is anchored to the beginning of the string — for example, col LIKE 'foo 
%' or col ~ '^foo', but not col LIKE '%bar'.


Since my server does not use the C locale I created the index with

CREATE INDEX test_index
  ON item
  USING btree
  (url varchar_pattern_ops);

which works fine for queries like

 SELECT distinct url from item where url like 'http://www.micro%'  
limit 10;


explain analyze shows:
Limit  (cost=9.53..9.54 rows=1 width=34) (actual time=80.809..80.856  
rows=10 loops=1)
  -  Unique  (cost=9.53..9.54 rows=1 width=34) (actual  
time=80.806..80.835 rows=10 loops=1)
-  Sort  (cost=9.53..9.53 rows=1 width=34) (actual  
time=80.802..80.812 rows=11 loops=1)

  Sort Key: url
  Sort Method:  quicksort  Memory: 306kB
  -  Index Scan using test_index on item   
(cost=0.00..9.52 rows=1 width=34) (actual time=0.030..6.165 rows=2254  
loops=1)
Index Cond: (((url)::text ~=~ 'http:// 
www.micro'::text) AND ((url)::text ~~ 'http://www.micrp'::text))

Filter: ((url)::text ~~ 'http://www.micro%'::text)
Total runtime: 80.908 ms

which is great but if I run the query with the subselect it uses a  
sequence scan:


select *
 from result
 where exists
   (select * from item where item.url LIKE result.url || '%' limit 1)  
limit 10;


Limit  (cost=0.00..96.58 rows=10 width=36) (actual  
time=12.660..35295.928 rows=10 loops=1)
  -  Seq Scan on result  (cost=0.00..93886121.77 rows=9721314  
width=36) (actual time=12.657..35295.906 rows=10 loops=1)

Filter: (subplan)
SubPlan
  -  Limit  (cost=0.00..4.81 rows=1 width=42) (actual  
time=2715.061..2715.061 rows=1 loops=13)
-  Seq Scan on item  (cost=0.00..109589.49  
rows=22781 width=42) (actual time=2715.055..2715.055 rows=1 loops=13)
  Filter: ((url)::text ~~ (($0)::text ||  
'%'::text))

Total runtime: 35295.994 ms


The only explaination is that I don't use a constant when comparing  
the values. But actually it is a constant...



any help?

using postgres 8.3.3 on ubuntu.



Cheers,

moritz


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner should use index on a LIKE 'foo%' query

2008-06-28 Thread Steinar H. Gunderson
On Sat, Jun 28, 2008 at 06:24:42PM +0200, Moritz Onken wrote:
  SELECT distinct url from item where url like 'http://www.micro%' limit 
 10;

Here, the planner knows the pattern beforehand, and can see that it's a
simple prefix.
 select *
  from result
  where exists
(select * from item where item.url LIKE result.url || '%' limit 1)  
 limit 10;

Here it cannot (what if result.url was '%foo%'?).

Try using something like (item.url = result.url  item.url = result.url ||
'z'), substituting an appropriately high character for 'z'.

 The only explaination is that I don't use a constant when comparing the 
 values. But actually it is a constant...

It's not a constant at planning time.

Also note that you'd usually want to use IN instead of a WHERE EXISTS.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance