Re: [HACKERS] [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-09 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 This rule works for all the locales I have installed ... but I don't
 have any Far Eastern locales installed.  Also, my test cases are only
 covering ASCII characters, and I believe many locales have some non-ASCII
 letters that sort after 'Z'.  I'm not sure how hard we need to try to
 cover those corner cases, though.  It is ultimately only an estimate...

If I understand correctly what we're talking about it's generating estimates
for LIKE 'foo%' using the algorithm which makes sense for C locale which means
generating the next range of values which start with 'foo%'.

It seems to me the problematic situations is when the most-frequent-values
come into play. Being off slightly in the histogram isn't going to generate
very inaccurate estimates but including or not a most-frequent-value could
throw off the estimate severely.

Could we not use the bogus range to calculate the histogram estimate but apply
the LIKE pattern directly to the most-frequent-values instead of applying the
bogus range? Or would that be too much code re-organization for now?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL 
training!

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-09 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Could we not use the bogus range to calculate the histogram estimate
 but apply the LIKE pattern directly to the most-frequent-values
 instead of applying the bogus range? Or would that be too much code
 re-organization for now?

We have already done that for quite some time.  It won't help
Guillaume's case anyhow: he's got no MCVs, presumably because the field
is unique.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-09 Thread Guillaume Smet
On Nov 9, 2007 5:33 PM, Tom Lane [EMAIL PROTECTED] wrote:
 he's got no MCVs, presumably because the field
 is unique.

It is. The ancestors field contains the current folder itself so the
id of the folder (which is the primary key) is in it.

--
Guillaume

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-09 Thread Guillaume Smet
Tom,

Just to confirm you that your last commit fixed the problem:

lbo=# explain analyze select * from cms_items where ancestors LIKE '1062/%';
  QUERY PLAN
---
 Seq Scan on cms_items  (cost=0.00..688.26 rows=*9097* width=103)
(actual time=0.011..22.605 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062/%'::text)
 Total runtime: 30.022 ms
(3 rows)

Thanks for your time.

--
Guillaume

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Guillaume Smet
Tom,

On Nov 8, 2007 12:14 AM, Tom Lane [EMAIL PROTECTED] wrote:
 I've applied a patch that might help you:
 http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php

AFAICS, it doesn't seem to fix the problem. I just compiled
REL8_1_STABLE branch and I still has the following behaviour:
lbo=# ANALYZE cms_items;
ANALYZE
lbo=# explain analyze select * from cms_items where ancestors LIKE '1062/%';
 QUERY PLAN

 Seq Scan on cms_items  (cost=0.00..688.26 rows=1 width=103) (actual
time=0.009..22.258 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062/%'::text)
 Total runtime: 29.835 ms
(3 rows)

lbo=# show lc_collate;
 lc_collate
-
 fr_FR.UTF-8
(1 row)

Do you see any reason why your patch doesn't change anything in this case?

Thanks.

--
Guillaume

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Tom Lane
Guillaume Smet [EMAIL PROTECTED] writes:
 On Nov 8, 2007 12:14 AM, Tom Lane [EMAIL PROTECTED] wrote:
 I've applied a patch that might help you:
 http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php

 AFAICS, it doesn't seem to fix the problem.

Hmm, can we see the pg_stats row for the ancestors column?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Tom Lane
Guillaume Smet [EMAIL PROTECTED] writes:
 On Nov 8, 2007 12:14 AM, Tom Lane [EMAIL PROTECTED] wrote:
 I've applied a patch that might help you:
 http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php

 AFAICS, it doesn't seem to fix the problem. I just compiled
 REL8_1_STABLE branch and I still has the following behaviour:

OK, I tried it in fr_FR locale and what I find is that

regression=# select '123/'  '1230'::text;
 ?column? 
--
 t
(1 row)

so make_greater_string() will still think that its first try at
generating an upper-bound string is good enough.  However

regression=# select '123/1'  '1230'::text;
 ?column? 
--
 f
(1 row)

so the data starting with '123/' is still outside the generated range,
leading to a wrong estimate.  I didn't see this behavior yesterday but
I was experimenting with en_US which I guess has different rules.

What I am tempted to do about this is have make_greater_string tack zz
onto the supplied prefix, so that it would have to find a string that
compares greater than 123/zz before reporting success.  This is
getting pretty klugy though, so cc'ing to pgsql-hackers to see if anyone
has a better idea.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Guillaume Smet
On Nov 8, 2007 4:01 PM, Tom Lane [EMAIL PROTECTED] wrote:
 Hmm, can we see the pg_stats row for the ancestors column?

Sure:
 public | cms_items | ancestors | 0 |32 |
-1 |  |   |
{10011/10010/10009/10018/2554055/,10011/10010/84022/23372040/,10011/2233043/2233042/2233041/,10011/3985097/5020039/,10011/872018/13335056/1051/,1062/22304709/22304714/,1062/2489/2492/27861901/,1062/2527/2530/29658392/,1062/2698/2705/6014040/,1062/52354/52355/255038/255037/,9846852/}
|   -0.151713

I can provide the data if needed, there's nothing confidential in them.

--
Guillaume

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Doesn't really strike at the core reason that this is so klugy though. Surely
 the right thing is to push the concept of open versus closed end-points
 through deeper into the estimation logic?

No, the right thing is to take the folk who defined dictionary sort
order out behind the barn and shoot 'em ;-).  This has got nothing to
do with open/closed endpoints and everything to do with the bizarre
sorting rules used by some locales.  In particular the reason I want to
append a letter is that some locales discriminate against non-letter
characters in the first pass of sorting.

I did do some experimentation and found that among the ASCII characters
(ie, codes 32-126), nearly all the non-C locales on my Fedora machine
sort Z last and z next-to-last or vice versa.  Most of the remainder
sort digits last and z or Z as the last non-digit character.  Since Z is
not that close to the end of the sort order in C locale, however, z
seems the best bet.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 What I am tempted to do about this is have make_greater_string tack zz
 onto the supplied prefix, so that it would have to find a string that
 compares greater than 123/zz before reporting success.  This is
 getting pretty klugy though, so cc'ing to pgsql-hackers to see if anyone
 has a better idea.

Hm, instead of zz is there a convenient way to find out what actual
character sorts last amongst all the single characters in the locale's
encoding?

Doesn't really strike at the core reason that this is so klugy though. Surely
the right thing is to push the concept of open versus closed end-points
through deeper into the estimation logic?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Tom Lane
I wrote:
 I did do some experimentation and found that among the ASCII characters
 (ie, codes 32-126), nearly all the non-C locales on my Fedora machine
 sort Z last and z next-to-last or vice versa.  Most of the remainder
 sort digits last and z or Z as the last non-digit character.  Since Z is
 not that close to the end of the sort order in C locale, however, z
 seems the best bet.

With still further experimentation, it seems that doesn't work very
well, because the locales that sort digits last also seem not to
discriminate against digits in their first pass.  What did seem to work
was:

* Determine which of the strings Z, z, y, 9 is seen as largest
by strcoll().

* Append this string to the given input.

* Search (using the CVS-HEAD make_greater_string logic) for a string
greater than that.

This rule works for all the locales I have installed ... but I don't
have any Far Eastern locales installed.  Also, my test cases are only
covering ASCII characters, and I believe many locales have some non-ASCII
letters that sort after 'Z'.  I'm not sure how hard we need to try to
cover those corner cases, though.  It is ultimately only an estimate...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-08 Thread Guillaume Smet
On Nov 9, 2007 3:08 AM, Tom Lane [EMAIL PROTECTED] wrote:
 This rule works for all the locales I have installed ... but I don't
 have any Far Eastern locales installed.  Also, my test cases are only
 covering ASCII characters, and I believe many locales have some non-ASCII
 letters that sort after 'Z'.  I'm not sure how hard we need to try to
 cover those corner cases, though.  It is ultimately only an estimate...

My opinion is that it's acceptable to fix the problem for most cases
in most locales because, as you said, it's only an estimate. We didn't
have any report of this problem for years so it seems that it's not a
common case or at least it's not common that the bad estimate leads to
noticeably bad plans.

As far as I understand what you plan to do, it doesn't seem to be
something that prevents us to fix the problem afterwards if someone
comes with an example which doesn't fit in the schema you're proposing
and has a real performance problem with it.

--
Guillaume

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
On 11/7/07, Tom Lane [EMAIL PROTECTED] wrote:
 Hmmm ... what locale are you working in?  I notice that the range
 estimator for this pattern would be ancestors = '1062/' AND
 ancestors  '10620', which will do the right thing in C locale
 but maybe not so much elsewhere.

Sorry for not having mentioned it before. Locale is UTF-8.

  Version is PostgreSQL 8.1.8 on i686-redhat-linux-gnu,

 You'd probably get better results with 8.2, which has a noticeably
 smarter LIKE-estimator, at least for histogram sizes of 100 or more.

It's not really possible to upgrade this application to 8.2 for now.
It's a very old app based on the thing formerly called as Red Hat WAF
and now known as APLAWS and validating WAF and this application with
8.2 will take quite some time. Moreover the db is big and we can't
afford the downtime of a migration.

I suppose my best bet is to remove the pg_statistic line and to set
the statistics to 0 for this column so that the stats are never
generated again for this column?

Thanks,

--
Guillaume

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Tom Lane
Guillaume Smet [EMAIL PROTECTED] writes:
 [ bad estimate for LIKE ]

Hmmm ... what locale are you working in?  I notice that the range
estimator for this pattern would be ancestors = '1062/' AND
ancestors  '10620', which will do the right thing in C locale
but maybe not so much elsewhere.

 Version is PostgreSQL 8.1.8 on i686-redhat-linux-gnu,

You'd probably get better results with 8.2, which has a noticeably
smarter LIKE-estimator, at least for histogram sizes of 100 or more.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
Hi all,

While studying a query taking forever after an ANALYZE on a never
analyzed database (a bad estimate causes a nested loop on a lot of
tuples), I found the following problem:
- without any stats (I removed the line from pg_statistic):
ccm_prod_20071106=# explain analyze select * from cms_items where
ancestors LIKE '1062/%';
  QUERY PLAN
--
 Seq Scan on cms_items  (cost=0.00..689.26 rows=114 width=587) (actual
time=0.008..21.692 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062/%'::text)
 Total runtime: 31.097 ms
- the estimate is bad (it's expected) but it's sufficient to prevent
the nested loop so it's my current workaround

- after analyzing the cms_items table (statistics is set to 10 but
it's exactly the same for 100):
ccm_prod_20071106=# explain analyze select * from cms_items where
ancestors LIKE '1062/%';
 QUERY PLAN

 Seq Scan on cms_items  (cost=0.00..689.26 rows=*1* width=103) (actual
time=0.010..22.024 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062/%'::text)
 Total runtime: 31.341 ms
- this estimate leads PostgreSQL to choose a nested loop which is
executed more than 11k times and causes the query to take forever.

- if I remove the / from the LIKE clause (which I can't as ancestors
is more or less a path):
ccm_prod_20071106=# explain analyze select * from cms_items where
ancestors LIKE '1062%';
  QUERY PLAN
---
 Seq Scan on cms_items  (cost=0.00..689.26 rows=*9097* width=103)
(actual time=0.043..25.251 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062%'::text)
 Total runtime: 34.778 ms

Which is a really good estimate.

Is it something expected?

The histogram does contain values beginning with '1062/' (5 out of 10)
and the cms_items table has ~ 22k rows.

Version is PostgreSQL 8.1.8 on i686-redhat-linux-gnu, compiled by GCC
gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3). I checked the release
notes between 8.1.8 and 8.1.10 and I didn't find anything relevant to
fix this problem.

Thanks for any help.

Regards,

--
Guillaume

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Tom Lane
Guillaume Smet [EMAIL PROTECTED] writes:
 On 11/7/07, Tom Lane [EMAIL PROTECTED] wrote:
 Hmmm ... what locale are you working in?  I notice that the range
 estimator for this pattern would be ancestors = '1062/' AND
 ancestors  '10620', which will do the right thing in C locale
 but maybe not so much elsewhere.

 Sorry for not having mentioned it before. Locale is UTF-8.

I wanted the locale (lc_collate), not the encoding.

 I suppose my best bet is to remove the pg_statistic line and to set
 the statistics to 0 for this column so that the stats are never
 generated again for this column?

That would optimize this particular query and probably pessimize
a lot of others.  I have another LIKE-estimation bug to go look at
today too; let me see if this one is fixable or not.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Alexander Staubo
On 11/7/07, Guillaume Smet [EMAIL PROTECTED] wrote:
 While studying a query taking forever after an ANALYZE on a never
 analyzed database (a bad estimate causes a nested loop on a lot of
 tuples), I found the following problem:
[snip]
  Total runtime: 31.097 ms
[snip]
  Total runtime: 31.341 ms
[snip]
  Total runtime: 34.778 ms

 Which is a really good estimate.

That's a difference of less than *three milliseconds* -- a difference
probably way within the expected overhead of running explain
analyze. Furthermore, all three queries use the same basic plan: a
sequential scan with a filter. At any rate you're microbenchmarking in
a way that is not useful to real-world queries. In what way are these
timings a problem?

Have you tried using an index which supports prefix searches? The
text_pattern_ops operator class lets yo do this with a plain B-tree
index:

  create index cms_items_ancestors_index on cms_items (ancestors
text_pattern_ops);
  analyze cms_items;

Now all like 'prefix%' queries should use the index.

Alexander.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
Alexander,

On 11/7/07, Alexander Staubo [EMAIL PROTECTED] wrote:
 That's a difference of less than *three milliseconds* -- a difference
 probably way within the expected overhead of running explain
 analyze. Furthermore, all three queries use the same basic plan: a
 sequential scan with a filter. At any rate you're microbenchmarking in
 a way that is not useful to real-world queries. In what way are these
 timings a problem?

If you read my previous email carefully, you'll see they aren't a
problem: the problem is the estimation, not the timing. This is a self
contained test case of a far more complex query which uses a bad plan
containing a nested loop due to the bad estimate.

 Now all like 'prefix%' queries should use the index.

Not when you retrieve 50% of this table of 22k rows but that's not my
problem anyway. A seqscan is perfectly fine in this case.

Thanks anyway.

--
Guillaume

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Tom Lane
I wrote:
 Guillaume Smet [EMAIL PROTECTED] writes:
 [ bad estimate for LIKE ]

 Hmmm ... what locale are you working in?  I notice that the range
 estimator for this pattern would be ancestors = '1062/' AND
 ancestors  '10620', which will do the right thing in C locale
 but maybe not so much elsewhere.

I've applied a patch that might help you:
http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
On 11/7/07, Tom Lane [EMAIL PROTECTED] wrote:
 I wanted the locale (lc_collate), not the encoding.

fr_FR.UTF-8

 That would optimize this particular query and probably pessimize
 a lot of others.

Sure but there aren't a lot of queries based on the ancestors field
and if they are a bit slower, it's not a problem. However having a
query taking forever is not acceptable as the content management app
is unaccessible.
So it can be an acceptable solution in this case, even if not perfect.

--
Guillaume

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
On 11/8/07, Tom Lane [EMAIL PROTECTED] wrote:
 I've applied a patch that might help you:
 http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php

Thanks. I'll build a RPM package tomorrow with this patch and let you
know if it fixes the problem.

--
Guillaume

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate