Re: [PERFORM] LIKE search and performance

2007-06-06 Thread James Mansion

[EMAIL PROTECTED] wrote:

What is a real life example where an intelligent and researched
database application would issue a like or ilike query as their
primary condition in a situation where they expected very high
selectivity?
  
In my case the canonical example is to search against textual keys where 
the search is
performed automatically if the user hs typed enough data and paused.  In 
almost all
cases the '%' trails, and I'm looking for 'starts with' in effect.  
usually the search will have
a specified upper number of returned rows, if that's an available 
facility.  I realise in this
case that matching against the index does not allow the match count 
unless we check

MVCC as we go, but I don't see why another thread can't be doing that.

James


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Richard Huxton

[EMAIL PROTECTED] wrote:

And since it's basically impossible to know the selectivity of this kind
of where condition, I doubt the planner would ever realistically want to
choose that plan anyway because of its poor worst-case behavior.


What is a real life example where an intelligent and researched
database application would issue a like or ilike query as their
primary condition in a situation where they expected very high
selectivity?

Avoiding a poor worst-case behaviour for a worst-case behaviour that
won't happen doesn't seem practical.


But if you are also filtering on e.g. date, and that has an index with 
good selectivity, you're never going to use the text index anyway are 
you? If you've only got a dozen rows to check against, might as well 
just read them in.


The only time it's worth considering the behaviour at all is *if* the 
worst-case is possible.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread mark
On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote:
 [EMAIL PROTECTED] wrote:
 And since it's basically impossible to know the selectivity of this kind
 of where condition, I doubt the planner would ever realistically want to
 choose that plan anyway because of its poor worst-case behavior.
 What is a real life example where an intelligent and researched
 database application would issue a like or ilike query as their
 primary condition in a situation where they expected very high
 selectivity?
 Avoiding a poor worst-case behaviour for a worst-case behaviour that
 won't happen doesn't seem practical.
 But if you are also filtering on e.g. date, and that has an index with 
 good selectivity, you're never going to use the text index anyway are 
 you? If you've only got a dozen rows to check against, might as well 
 just read them in.
 The only time it's worth considering the behaviour at all is *if* the 
 worst-case is possible.

I notice you did not provide a real life example as requested. :-)

This seems like an ivory tower restriction. Not allowing best performance
in a common situation vs not allowing worst performance in a not-so-common
situation.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Richard Huxton

[EMAIL PROTECTED] wrote:

On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote:

[EMAIL PROTECTED] wrote:

And since it's basically impossible to know the selectivity of this kind
of where condition, I doubt the planner would ever realistically want to
choose that plan anyway because of its poor worst-case behavior.

What is a real life example where an intelligent and researched
database application would issue a like or ilike query as their
primary condition in a situation where they expected very high
selectivity?
Avoiding a poor worst-case behaviour for a worst-case behaviour that
won't happen doesn't seem practical.
But if you are also filtering on e.g. date, and that has an index with 
good selectivity, you're never going to use the text index anyway are 
you? If you've only got a dozen rows to check against, might as well 
just read them in.
The only time it's worth considering the behaviour at all is *if* the 
worst-case is possible.


I notice you did not provide a real life example as requested. :-)


OK - any application that allows user-built queries: choose column: 
foo choose filter: contains choose target: bar


Want another? Any application that has a search by name box - users 
can (and do) put one letter in and hit enter.


Unfortunately you don't always have control over the selectivity of 
queries issued.



This seems like an ivory tower restriction. Not allowing best performance
in a common situation vs not allowing worst performance in a not-so-common
situation.


What best performance plan are you thinking of? I'm assuming we're 
talking about trailing-wildcard matches here, rather than contains 
style matches.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread PFC



OK - any application that allows user-built queries: choose column:  
foo choose filter: contains choose target: bar


Want another? Any application that has a search by name box - users  
can (and do) put one letter in and hit enter.


Unfortunately you don't always have control over the selectivity of  
queries issued.


-*- HOW TO MAKE A SEARCH FORM -*-

Imagine you have to code the search on IMDB.

This is what a smart developer would do

First, he uses AJAX autocompletion, so the thing is reactive.
	Then, he does not bother the user with a many-fields form. Instead of  
forcing the user to think (users HATE that), he writes smart code.
	Does Google Maps have separate fields for country, city, street, zipcode  
? No. Because Google is about as smart as it gets.


So, you parse the user query.

	If the user types, for instance, less than 3 letters (say, spi), he  
probably wants stuff that *begins* with those letters. There is no point  
in searching for the letter a in a million movie titles database.
	So, if the user types spi, you display name LIKE spi%, which is  
indexed, very fast. And since you're smart, you use AJAX. And you display  
only the most popular results (ie. most clicked on).


http://imdb.com/find?s=allq=spi

	Since 99% of the time the user wanted spiderman or spielberg, you're  
done and he's happy. Users like being happy.
	If the user just types a, you display the first 10 things that start  
with a, this is useless but the user will marvel at your AJAX skillz.  
Then he will probably type in a few other letters.


	Then, if the user uses his space bar and types spi 1980 you'll  
recognize a year and display spielberg's movies in 1980.
	Converting your strings to phonetics is also a good idea since about 0.7%  
of the l33T teenagers can spell stuff especially spiElberg.


	Only the guy who wants to know who had sex with marilyn monroe on the  
17th day of the shooting of Basic Instinct will need to use the Advanced  
search.


	If you detect several words, then switch to a prefix-based fulltext  
search like Xapian which utterly rocks.
	Example : the user types savin priv, you search for savin* NEAR  
priv* and you display saving private ryan before he has even finished  
typing the second word of his query. Users love that, they feel  
understood, they will click on your ads and buy your products.


	In all cases, search results should be limited to less than 100 to be  
easy on the database. The user doesn't care about a search returning more  
than 10-20 results, he will just rephrase the query, and the time taken to  
fetch those thousands of records with name LIKE '%a%' will have been  
utterly lost. Who goes to page 2 in google results ?


BOTTOM LINE : databases don't think, you do.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Richard Huxton

PFC wrote:


OK - any application that allows user-built queries: choose column: 
foo choose filter: contains choose target: bar


Want another? Any application that has a search by name box - users 
can (and do) put one letter in and hit enter.


Unfortunately you don't always have control over the selectivity of 
queries issued.


-*- HOW TO MAKE A SEARCH FORM -*-

Imagine you have to code the search on IMDB.

This is what a smart developer would do


All good domain-specific tips to provide users with a satisfying 
search-experience.


None of which address the question of what plan PG should produce for: 
SELECT * FROM bigtable WHERE foo LIKE 's%'


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Joshua D. Drake

[EMAIL PROTECTED] wrote:


I am speaking of contains, as contains is the one that was said to
require a seqscan. I am questioning why it requires a seqscan. The
claim was made that with MVCC, the index is insufficient to check
for visibility and that the table would need to be accessed anyways,
therefore a seqscan is required. I question whether a like '%bar%'
should be considered a high selectivity query in the general case.
I question whether a worst case should be assumed.


If you are doing %bar% you should be using pg_tgrm or tsearch2.

J




Perhaps I question too much? :-)

Cheers,
mark




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Richard Huxton

[EMAIL PROTECTED] wrote:

On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote:

I notice you did not provide a real life example as requested. :-)
OK - any application that allows user-built queries: choose column: 
foo choose filter: contains choose target: bar
Want another? Any application that has a search by name box - users 
can (and do) put one letter in and hit enter.
Unfortunately you don't always have control over the selectivity of 
queries issued.


The database has 10 million records. The user enters bar and it
translates to %bar%. You are suggesting that we expect bar to match
1 million+ records? :-)


I was saying that you don't know. At least, I don't know of any cheap 
way of gathering full substring stats or doing a full substring 
indexing. Even tsearch2 can't do that.



I hope not. I would define this as bad process. I would also use LIMIT
to something like 100.


Yes, but that's not the query we're talking about is it? If possible you 
don't do '%bar%' searches at all. If you do, you try to restrict it 
further or LIMIT the results. There's nothing to discuss in these cases.



This seems like an ivory tower restriction. Not allowing best performance
in a common situation vs not allowing worst performance in a not-so-common
situation.
What best performance plan are you thinking of? I'm assuming we're 
talking about trailing-wildcard matches here, rather than contains 
style matches.


Trailing-wildcard already uses B-Tree index, does it not?


Yes, it searches the btree and then checks the data for visibility. I 
thought that was what you felt could be worked around. It appears I was 
wrong.



I am speaking of contains, as contains is the one that was said to
require a seqscan. I am questioning why it requires a seqscan. 


Well, you seemed to be suggesting you had something better in mind. At 
least, that was my reading of your original post.


 The

claim was made that with MVCC, the index is insufficient to check
for visibility 


True, for PG's implementation of MVCC. You *could* have visibility in 
each index, but that obviously would take more space. For a table with 
many indexes, that could be a *lot* more space. You also have to update 
all that visibilty information too.


 and that the table would need to be accessed anyways,

therefore a seqscan is required. I question whether a like '%bar%'
should be considered a high selectivity query in the general case.
I question whether a worst case should be assumed.


Well, the general rule-of-thumb is only about 10% for the changeover 
between index  seq-scan. That is, once you are reading 10% of the rows 
on disk (to check visibility) you might as well read them all (since 
you'll be reading most of the blocks anyway if the rows are randomly 
distributed). If you are doing SELECT * from that table then you'll want 
all that data you read. If you are doing SELECT count(*) then you only 
wanted the visibility :-(


Now you and I can look at a substring and probably make a good guess how 
common it is (assuming we know the targets are British surnames or 
Japanese towns). PG needs one number - or rather, it picks one number 
for each length of search-string (afaik).



Perhaps I question too much? :-)


Not sure it's possible to question too much :-)
However, you need to provide answers occasionally too - what numbers 
would you pick? :-)


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Richard Huxton

PFC wrote:
None of which address the question of what plan PG should produce for: 
SELECT * FROM bigtable WHERE foo LIKE 's%'


Ah, this one already uses the btree since the '%' is at the end.
My point is that a search like this will yield too many results to 
be useful to the user anyway, so optimizing its performance is a kind of 
red herring.


At the *application level* yes.
At the *query planner* level no.

At the query planner level I just want it to come up with the best plan 
it can. The original argument was that PG's estimate of the number of 
matching rows was too optimistic (or pessimistic) in the case where we 
are doing a contains substring-search.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread PFC
None of which address the question of what plan PG should produce for:  
SELECT * FROM bigtable WHERE foo LIKE 's%'


Ah, this one already uses the btree since the '%' is at the end.
	My point is that a search like this will yield too many results to be  
useful to the user anyway, so optimizing its performance is a kind of red  
herring.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Gregory Stark
Richard Huxton [EMAIL PROTECTED] writes:

 Now you and I can look at a substring and probably make a good guess how 
 common
 it is (assuming we know the targets are British surnames or Japanese towns). 
 PG
 needs one number - or rather, it picks one number for each length of
 search-string (afaik).

I don't think that's true. Postgres calculates the lower and upper bound
implied by the search pattern and then uses the histogram to estimate how
selective that range is. It's sometimes surprisingly good but obviously it's
not perfect.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] LIKE search and performance

2007-05-25 Thread Richard Huxton

Gregory Stark wrote:

Richard Huxton [EMAIL PROTECTED] writes:


Now you and I can look at a substring and probably make a good guess how common
it is (assuming we know the targets are British surnames or Japanese towns). PG
needs one number - or rather, it picks one number for each length of
search-string (afaik).


I don't think that's true. Postgres calculates the lower and upper bound
implied by the search pattern and then uses the histogram to estimate how
selective that range is. It's sometimes surprisingly good but obviously it's
not perfect.


Sorry - I'm obviously picking my words badly today.

I meant for the contains substring match. It gives different (goes 
away and checks...yes) predictions based on string length. So it guesses 
that LIKE '%aaa%' will match more than LIKE '%%'. Of course, if we 
were matching surnames you and I could say that this is very unlikely, 
but without some big statistics table I guess there's not much more PG 
can do.


For a trailing wildcard LIKE 'aaa%' it can and does as you say convert 
this into something along the lines of (= 'aaa' AND  'aab'). Although 
IIRC that depends if your locale allows such (not sure, I don't really 
use non-C/non-English locales enough).


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread Andy
Thank you all for the answers. 
I will try your suggestions and see what that brings in terms of
performance. 

Andy. 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Rigmor Ukuhe
 Sent: Wednesday, May 23, 2007 6:52 PM
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] LIKE search and performance
 
 Andy wrote:
  Hi,
   
  I have a table with varchar and text columns, and I have to search 
  through these text in the whole table.
   
  An example would be:
  SELECT * FROM table
   WHERE name like '%john%' or 
 street like '%srt%'
   
  Anyway, the query planner always does seq scan on the whole 
 table and 
  that takes some time. How can this be optimized or made in 
 another way 
  to be faster?
 
 Use tsearch2 
 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for 
 full text indexing.
 
 Rigmor
 
   
  I tried to make indexes on the columns but no success.
   
  PG 8.2
   
  Regards,
  Andy.
 
 
 
 
 ---(end of 
 broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org
 
 


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread James Mansion

Alexander Staubo wrote:

On 5/23/07, Andy [EMAIL PROTECTED] wrote:

An example would be:
SELECT * FROM table
 WHERE name like '%john%' or street like 
'%srt%'


Anyway, the query planner always does seq scan on the whole table and 
that

takes some time. How can this be optimized or made in another way to be
faster?


There's no algorithm in existence that can index arbitrary
substrings the way you think. The only rational way to accomplish this
is to first break the text into substrings using some algorithm (eg.,
words delimited by whitespace and punctuation), and index the
substrings individually.
That seems rather harsh.  If I'd put an index on each of these colomns 
I'd certainly
expect it to use the indices - and I'm pretty sure that Sybase would.  
I'd expect
it to scan the index leaf pages instead of the table itself - they 
should be much

more compact and also likely to be hot in cache.

Why *wouldn't* the planner do this?

James


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread Magnus Hagander
James Mansion wrote:
 Alexander Staubo wrote:
 On 5/23/07, Andy [EMAIL PROTECTED] wrote:
 An example would be:
 SELECT * FROM table
  WHERE name like '%john%' or street like
 '%srt%'

 Anyway, the query planner always does seq scan on the whole table and
 that
 takes some time. How can this be optimized or made in another way to be
 faster?

 There's no algorithm in existence that can index arbitrary
 substrings the way you think. The only rational way to accomplish this
 is to first break the text into substrings using some algorithm (eg.,
 words delimited by whitespace and punctuation), and index the
 substrings individually.
 That seems rather harsh.  If I'd put an index on each of these colomns
 I'd certainly
 expect it to use the indices - and I'm pretty sure that Sybase would. 
 I'd expect
 it to scan the index leaf pages instead of the table itself - they
 should be much
 more compact and also likely to be hot in cache.

If Sybase is still like SQL Server (or the other way around), it *may*
end up scanning the index *IFF* the index is a clustered index. If it's
a normal index, it will do a sequential scan on the table.

Yes, the leaf page of the index is more compact, but you also have to
scan the intermediate pages to get to the leaf pages. But again, it can
be a win. On such a system.

It's not a win on PostgreSQL, because of our MVCC implementation. We
need to scan *both* index *and* data pages if we go down that route, in
which case it's a lot faster to just scan the data pages alone.

I don't really know how MSSQL deals with this now that they have
MVCC-ish behavior, and I have no idea at all if sybase has anything like
MVCC.


 Why *wouldn't* the planner do this?

The question should be why the optimizer doesn't consider it, and the
executor uses it. The planner really doesn't decide that part :-)
Hopefully, the answer can be found above.

//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread James Mansion



If Sybase is still like SQL Server (or the other way around), it *may*
end up scanning the index *IFF* the index is a clustered index. If it's
a normal index, it will do a sequential scan on the table.

  
Are you sure its not covered?  Have to check at work - but I'm off next 
week so it'll have to wait.



It's not a win on PostgreSQL, because of our MVCC implementation. We
need to scan *both* index *and* data pages if we go down that route, in
which case it's a lot faster to just scan the data pages alone.

  
Why do you need to go to all the data pages - doesn't the index 
structure contain all the keys so
you prefilter and then check to see if the *matched* items are still in 
view?  I'll be first to admit I
know zip about Postgres, but it seems odd - doesn't the index contain 
copies of the key values?.


I suspect that I mis-spoke with 'leaf'.  I really just mean 'all index 
pages with data', since the scan
does not even need to be in index order, just a good way to get at the 
data in a compact way.




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread Mark Lewis
On Thu, 2007-05-24 at 21:54 +0100, James Mansion wrote:
  If Sybase is still like SQL Server (or the other way around), it *may*
  end up scanning the index *IFF* the index is a clustered index. If it's
  a normal index, it will do a sequential scan on the table.
 

 Are you sure its not covered?  Have to check at work - but I'm off next 
 week so it'll have to wait.
 
  It's not a win on PostgreSQL, because of our MVCC implementation. We
  need to scan *both* index *and* data pages if we go down that route, in
  which case it's a lot faster to just scan the data pages alone.
 

 Why do you need to go to all the data pages - doesn't the index 
 structure contain all the keys so
 you prefilter and then check to see if the *matched* items are still in 
 view?  I'll be first to admit I
 know zip about Postgres, but it seems odd - doesn't the index contain 
 copies of the key values?.
 
 I suspect that I mis-spoke with 'leaf'.  I really just mean 'all index 
 pages with data', since the scan
 does not even need to be in index order, just a good way to get at the 
 data in a compact way.

PG could scan the index looking for matches first and only load the
actual rows if it found a match, but that could only be a possible win
if there were very few matches, because the difference in cost between a
full index scan and a sequential scan would need to be greater than the
cost of randomly fetching all of the matching data rows from the table
to look up the visibility information.  

So yes it would be possible, but the odds of it being faster than a
sequential scan are small enough to make it not very useful.

And since it's basically impossible to know the selectivity of this kind
of where condition, I doubt the planner would ever realistically want to
choose that plan anyway because of its poor worst-case behavior.

-- Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread Craig James

Mark Lewis wrote:


PG could scan the index looking for matches first and only load the
actual rows if it found a match, but that could only be a possible win
if there were very few matches, because the difference in cost between a
full index scan and a sequential scan would need to be greater than the
cost of randomly fetching all of the matching data rows from the table
to look up the visibility information.  


Just out of curiosity: Does Postgress store a duplicate of the data in the index, even for long strings?  I thought indexes only had to 
store the string up to the point where there was no ambiguity, for example, if I have missing, mississippi and 
misty, the index only needs missin, missis and mist in the actual index.  This would make 
it impossible to use a full index scan for a LIKE query.

Craig

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread Alvaro Herrera
Craig James wrote:
 Mark Lewis wrote:
 
 PG could scan the index looking for matches first and only load the
 actual rows if it found a match, but that could only be a possible win
 if there were very few matches, because the difference in cost between a
 full index scan and a sequential scan would need to be greater than the
 cost of randomly fetching all of the matching data rows from the table
 to look up the visibility information.  
 
 Just out of curiosity: Does Postgress store a duplicate of the data in the 
 index, even for long strings?  I thought indexes only had to store the 
 string up to the point where there was no ambiguity, for example, if I have 
 missing, mississippi and misty, the index only needs missin, 
 missis and mist in the actual index.

What would happen when you inserted a new tuple with just miss?  You
would need to expand all the other tuples in the index.

-- 
Alvaro Herrera http://www.flickr.com/photos/alvherre/
Puedes vivir solo una vez, pero si lo haces bien, una vez es suficiente

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread mark
On Thu, May 24, 2007 at 02:02:40PM -0700, Mark Lewis wrote:
 PG could scan the index looking for matches first and only load the
 actual rows if it found a match, but that could only be a possible win
 if there were very few matches, because the difference in cost between a
 full index scan and a sequential scan would need to be greater than the
 cost of randomly fetching all of the matching data rows from the table
 to look up the visibility information.  

 So yes it would be possible, but the odds of it being faster than a
 sequential scan are small enough to make it not very useful.

 And since it's basically impossible to know the selectivity of this kind
 of where condition, I doubt the planner would ever realistically want to
 choose that plan anyway because of its poor worst-case behavior.

What is a real life example where an intelligent and researched
database application would issue a like or ilike query as their
primary condition in a situation where they expected very high
selectivity?

Avoiding a poor worst-case behaviour for a worst-case behaviour that
won't happen doesn't seem practical.

What real life examples, that could not be implemented a better way,
would behave poorly if like/ilike looked at the index first to filter?

I don't understand... :-)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread Craig James

Alvaro Herrera wrote:
 Just out of curiosity: Does Postgress store a duplicate of the data in the 
index, even for long strings?  I thought indexes only had to store the 
string up to the point where there was no ambiguity, for example, if I have 
missing, mississippi and misty, the index only needs missin, 
missis and mist in the actual index.


What would happen when you inserted a new tuple with just miss?  You
would need to expand all the other tuples in the index.


That's right.  This technique used by some index implementations is a tradeoff between 
size and update speed.  Most words in most natural languages can be distinguished by the 
first few characters.  The chances of having to modify more than a few surrounding nodes 
when you insert miss is small, so some implementations choose this method.  
Other implementations choose to store the full string.  I was just curious which method 
Postgres uses.

Craig


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread PFC



PG could scan the index looking for matches first and only load the
actual rows if it found a match, but that could only be a possible win
if there were very few matches, because the difference in cost between a
full index scan and a sequential scan would need to be greater than the
cost of randomly fetching all of the matching data rows from the table
to look up the visibility information.


	If you need to do that kind of thing, ie. seq scanning a table checking  
only one column among a large table of many columns, then don't use an  
index. An index, being a btree, needs to be traversed in order (or else, a  
lot of locking problems come up) which means some random accesses.


	So, you could make a table, with 2 columns, updated via triggers : your  
text field, and the primary key of your main table. Scanning that would be  
faster.


Still, a better solution for searching in text is :

- tsearch2 if you need whole words
- trigrams for any substring match
- xapian for full text search with wildcards (ie. John* = Johnny)

	Speed-wise those three will beat any seq scan on a large table by a huge  
margin.


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[PERFORM] LIKE search and performance

2007-05-23 Thread Andy
Hi, 
 
I have a table with varchar and text columns, and I have to search through
these text in the whole table. 
 
An example would be:
SELECT * FROM table
 WHERE name like '%john%' or street like '%srt%'
 
Anyway, the query planner always does seq scan on the whole table and that
takes some time. How can this be optimized or made in another way to be
faster?
 
I tried to make indexes on the columns but no success. 
 
PG 8.2
 
Regards,
Andy.


Re: [PERFORM] LIKE search and performance

2007-05-23 Thread Richard Huxton

Andy wrote:

SELECT * FROM table
 WHERE name like '%john%' or street like '%srt%'
 
Anyway, the query planner always does seq scan on the whole table and that

takes some time. How can this be optimized or made in another way to be
faster?
 
I tried to make indexes on the columns but no success. 


None of the normal indexes will work for finding text in the middle of a 
string. If you do think of a simple way of solving this, drop a short 
letter explaining your idea to your local patent office followed by the 
Nobel prize committee.


However, one of the contrib packages is tsearch2 which is designed to 
do keyword searches on text for you. It'll also handle stemming (e.g. 
search will match searching etc.) with the right choice of 
dictionary. Loads of other clever stuff in it too.


It's one of the optional packages with most Linux packaging systems and 
on the Windows one too. If you install from source see the contrib/ 
directory for details.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] LIKE search and performance

2007-05-23 Thread Guido Neitzer

Am 23.05.2007 um 09:08 schrieb Andy:

I have a table with varchar and text columns, and I have to search  
through these text in the whole table.


An example would be:
SELECT * FROM table
 WHERE name like '%john%' or street  
like '%srt%'


Anyway, the query planner always does seq scan on the whole table  
and that takes some time. How can this be optimized or made in  
another way to be faster?


The problem is that normal indexes cannot be used for contains  
queries.


If you need fulltext search capabilities you have to take a look at  
tsearch2 or an external search engine like Lucene.


cug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] LIKE search and performance

2007-05-23 Thread Alexander Staubo

On 5/23/07, Andy [EMAIL PROTECTED] wrote:

An example would be:
SELECT * FROM table
 WHERE name like '%john%' or street like '%srt%'

Anyway, the query planner always does seq scan on the whole table and that
takes some time. How can this be optimized or made in another way to be
faster?


There's no algorithm in existence that can index arbitrary
substrings the way you think. The only rational way to accomplish this
is to first break the text into substrings using some algorithm (eg.,
words delimited by whitespace and punctuation), and index the
substrings individually.

You can do this using vanilla PostgreSQL, and you can use Tsearch2
and/or its GIN indexes to help speed up the searches. The simplest
solution would be to put all the substrings in a table, one row per
substring, along with an attribute referencing the source table/row.

At this point you have effectively reduced your search space: you can
use a query to isolate the words in your dictionary that contain the
substrings. So a query might be:

 select ... from persons where id in (
   select person_id from person_words
   where word like '%john%';
 )

The like search, even though it will use a sequential scan, is bound
to be much faster on these small words than searching for substrings
through large gobs of text in the persons table.

Note that PostgreSQL *can* exploit the index for *prefix* matching, if
you tell it to use the right operator class:

 create index persons_name_index on persons (name text_pattern_ops);

or, if you're using varchars (though there is rarely any reason to):

 create index persons_name_index on persons (name varchar_pattern_ops);

(These two may be identical internally. Check the docs.)

Now you can do:

 select ... from persons where name like 'john%';

which will yield a query plan such as this:

Index Scan using persons_name_index on persons  (cost=0.00..8.27
rows=1 width=29) (actual time=0.184..0.373 rows=51 loops=1)
  Index Cond: ((name ~=~ 'john'::text) AND (name ~~ 'joho'::text))
  Filter: (title ~~ 'john%'::text)

Alexander.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] LIKE search and performance

2007-05-23 Thread Rigmor Ukuhe

Andy wrote:

Hi,
 
I have a table with varchar and text columns, and I have to search 
through these text in the whole table.
 
An example would be:

SELECT * FROM table
 WHERE name like '%john%' or street like '%srt%'
 
Anyway, the query planner always does seq scan on the whole table and 
that takes some time. How can this be optimized or made in another way 
to be faster?


Use tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for full 
text indexing.

Rigmor

 
I tried to make indexes on the columns but no success.
 
PG 8.2
 
Regards,

Andy.





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org