Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
On 11/7/07, Tom Lane [EMAIL PROTECTED] wrote:
 Hmmm ... what locale are you working in?  I notice that the range
 estimator for this pattern would be ancestors = '1062/' AND
 ancestors  '10620', which will do the right thing in C locale
 but maybe not so much elsewhere.

Sorry for not having mentioned it before. Locale is UTF-8.

  Version is PostgreSQL 8.1.8 on i686-redhat-linux-gnu,

 You'd probably get better results with 8.2, which has a noticeably
 smarter LIKE-estimator, at least for histogram sizes of 100 or more.

It's not really possible to upgrade this application to 8.2 for now.
It's a very old app based on the thing formerly called as Red Hat WAF
and now known as APLAWS and validating WAF and this application with
8.2 will take quite some time. Moreover the db is big and we can't
afford the downtime of a migration.

I suppose my best bet is to remove the pg_statistic line and to set
the statistics to 0 for this column so that the stats are never
generated again for this column?

Thanks,

--
Guillaume

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Tom Lane
Guillaume Smet [EMAIL PROTECTED] writes:
 [ bad estimate for LIKE ]

Hmmm ... what locale are you working in?  I notice that the range
estimator for this pattern would be ancestors = '1062/' AND
ancestors  '10620', which will do the right thing in C locale
but maybe not so much elsewhere.

 Version is PostgreSQL 8.1.8 on i686-redhat-linux-gnu,

You'd probably get better results with 8.2, which has a noticeably
smarter LIKE-estimator, at least for histogram sizes of 100 or more.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
Hi all,

While studying a query taking forever after an ANALYZE on a never
analyzed database (a bad estimate causes a nested loop on a lot of
tuples), I found the following problem:
- without any stats (I removed the line from pg_statistic):
ccm_prod_20071106=# explain analyze select * from cms_items where
ancestors LIKE '1062/%';
  QUERY PLAN
--
 Seq Scan on cms_items  (cost=0.00..689.26 rows=114 width=587) (actual
time=0.008..21.692 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062/%'::text)
 Total runtime: 31.097 ms
- the estimate is bad (it's expected) but it's sufficient to prevent
the nested loop so it's my current workaround

- after analyzing the cms_items table (statistics is set to 10 but
it's exactly the same for 100):
ccm_prod_20071106=# explain analyze select * from cms_items where
ancestors LIKE '1062/%';
 QUERY PLAN

 Seq Scan on cms_items  (cost=0.00..689.26 rows=*1* width=103) (actual
time=0.010..22.024 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062/%'::text)
 Total runtime: 31.341 ms
- this estimate leads PostgreSQL to choose a nested loop which is
executed more than 11k times and causes the query to take forever.

- if I remove the / from the LIKE clause (which I can't as ancestors
is more or less a path):
ccm_prod_20071106=# explain analyze select * from cms_items where
ancestors LIKE '1062%';
  QUERY PLAN
---
 Seq Scan on cms_items  (cost=0.00..689.26 rows=*9097* width=103)
(actual time=0.043..25.251 rows=11326 loops=1)
   Filter: ((ancestors)::text ~~ '1062%'::text)
 Total runtime: 34.778 ms

Which is a really good estimate.

Is it something expected?

The histogram does contain values beginning with '1062/' (5 out of 10)
and the cms_items table has ~ 22k rows.

Version is PostgreSQL 8.1.8 on i686-redhat-linux-gnu, compiled by GCC
gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3). I checked the release
notes between 8.1.8 and 8.1.10 and I didn't find anything relevant to
fix this problem.

Thanks for any help.

Regards,

--
Guillaume

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Tom Lane
Guillaume Smet [EMAIL PROTECTED] writes:
 On 11/7/07, Tom Lane [EMAIL PROTECTED] wrote:
 Hmmm ... what locale are you working in?  I notice that the range
 estimator for this pattern would be ancestors = '1062/' AND
 ancestors  '10620', which will do the right thing in C locale
 but maybe not so much elsewhere.

 Sorry for not having mentioned it before. Locale is UTF-8.

I wanted the locale (lc_collate), not the encoding.

 I suppose my best bet is to remove the pg_statistic line and to set
 the statistics to 0 for this column so that the stats are never
 generated again for this column?

That would optimize this particular query and probably pessimize
a lot of others.  I have another LIKE-estimation bug to go look at
today too; let me see if this one is fixable or not.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Alexander Staubo
On 11/7/07, Guillaume Smet [EMAIL PROTECTED] wrote:
 While studying a query taking forever after an ANALYZE on a never
 analyzed database (a bad estimate causes a nested loop on a lot of
 tuples), I found the following problem:
[snip]
  Total runtime: 31.097 ms
[snip]
  Total runtime: 31.341 ms
[snip]
  Total runtime: 34.778 ms

 Which is a really good estimate.

That's a difference of less than *three milliseconds* -- a difference
probably way within the expected overhead of running explain
analyze. Furthermore, all three queries use the same basic plan: a
sequential scan with a filter. At any rate you're microbenchmarking in
a way that is not useful to real-world queries. In what way are these
timings a problem?

Have you tried using an index which supports prefix searches? The
text_pattern_ops operator class lets yo do this with a plain B-tree
index:

  create index cms_items_ancestors_index on cms_items (ancestors
text_pattern_ops);
  analyze cms_items;

Now all like 'prefix%' queries should use the index.

Alexander.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
Alexander,

On 11/7/07, Alexander Staubo [EMAIL PROTECTED] wrote:
 That's a difference of less than *three milliseconds* -- a difference
 probably way within the expected overhead of running explain
 analyze. Furthermore, all three queries use the same basic plan: a
 sequential scan with a filter. At any rate you're microbenchmarking in
 a way that is not useful to real-world queries. In what way are these
 timings a problem?

If you read my previous email carefully, you'll see they aren't a
problem: the problem is the estimation, not the timing. This is a self
contained test case of a far more complex query which uses a bad plan
containing a nested loop due to the bad estimate.

 Now all like 'prefix%' queries should use the index.

Not when you retrieve 50% of this table of 22k rows but that's not my
problem anyway. A seqscan is perfectly fine in this case.

Thanks anyway.

--
Guillaume

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] index stat

2007-11-07 Thread Kevin Grittner
 On Mon, Nov 5, 2007 at 10:42 AM, in message
[EMAIL PROTECTED], Campbell,
Lance [EMAIL PROTECTED] wrote: 

 How can I [. . .] get rid of some unnecessary indexes
 
Here's what I periodically run to look for unused indexes:
 
select relname, indexrelname
  from pg_stat_user_indexes
  where indexrelname not like '%_pkey'
and idx_scan = 0
  order by relname, indexrelname
;
 
We omit the primary keys from the list (based on our naming
convention) because they are needed to ensure integrity.
 
-Kevin
 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Tom Lane
I wrote:
 Guillaume Smet [EMAIL PROTECTED] writes:
 [ bad estimate for LIKE ]

 Hmmm ... what locale are you working in?  I notice that the range
 estimator for this pattern would be ancestors = '1062/' AND
 ancestors  '10620', which will do the right thing in C locale
 but maybe not so much elsewhere.

I've applied a patch that might help you:
http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
On 11/7/07, Tom Lane [EMAIL PROTECTED] wrote:
 I wanted the locale (lc_collate), not the encoding.

fr_FR.UTF-8

 That would optimize this particular query and probably pessimize
 a lot of others.

Sure but there aren't a lot of queries based on the ancestors field
and if they are a bit slower, it's not a problem. However having a
query taking forever is not acceptable as the content management app
is unaccessible.
So it can be an acceptable solution in this case, even if not perfect.

--
Guillaume

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Estimation problem with a LIKE clause containing a /

2007-11-07 Thread Guillaume Smet
On 11/8/07, Tom Lane [EMAIL PROTECTED] wrote:
 I've applied a patch that might help you:
 http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php

Thanks. I'll build a RPM package tomorrow with this patch and let you
know if it fixes the problem.

--
Guillaume

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] partitioned table and ORDER BY indexed_field DESC LIMIT 1

2007-11-07 Thread Mark Kirkwood

Gregory Stark wrote:

Mark Kirkwood [EMAIL PROTECTED] writes:

  

I spent today looking at getting this patch into a self contained state.
Working against HEAD I'm getting bogged down in the PathKeyItem to
PathKey/EquivalenceClass/EquivalenceMember(s) change. So I figured I'd divide
and conquer to some extent, and initially provide a patch:

- against 8.2.(5)
- self contained  (i.e no mystery functions)



That would be helpful for me. It would include the bits I'm looking for.

  

The next step would be to update to to HEAD. That would hopefully provide some
useful material for others working on this.



If that's not too much work then that would be great but if it's a lot of work
then it may not be worth it if I'm planning to only take certain bits. On the
other hand if it's good then we might just want to take it wholesale and then
add to it.

  


Here is a (somewhat hurried) self-contained version of the patch under 
discussion. It applies to 8.2.5 and the resultant code compiles and 
runs. I've left in some unneeded parallel stuff (PathLocus struct), 
which I can weed out in a subsequent version if desired. I also removed 
the 'cdb ' from  most of the function names and (I  hope) any Greenplum 
copyrights.


I discovered that the patch solves a slightly different problem... it 
pulls up index scans as a viable path choice, (but not for the DESC 
case) but does not push down the LIMIT to the child tables ... so the 
actual performance improvement is zero - however hopefully the patch 
provides useful raw material to help.


e.g - using the examine schema from the OP email - but removing the DESC 
from the query:


part=# set enable_seqscan=off;
SET
part=# explain SELECT * FROM n_traf ORDER BY date_time LIMIT 1;
  QUERY 
PLAN   
-

Limit  (cost=198367.14..198367.15 rows=1 width=20)
  -  Sort  (cost=198367.14..200870.92 rows=1001510 width=20)
Sort Key: public.n_traf.date_time
-  Result  (cost=0.00..57464.92 rows=1001510 width=20)
  -  Append  (cost=0.00..57464.92 rows=1001510 width=20)
-  Index Scan using n_traf_date_time_login_id on 
n_traf  (cost=0.00..66.90 rows=1510 width=20)
-  Index Scan using 
n_traf_y2007m01_date_time_login_id on n_traf_y2007m01 n_traf  
(cost=0.00..4748.38 rows=83043 width=20)
-  Index Scan using 
n_traf_y2007m02_date_time_login_id on n_traf_y2007m02 n_traf  
(cost=0.00..4772.60 rows=83274 width=20)
-  Index Scan using 
n_traf_y2007m03_date_time_login_id on n_traf_y2007m03 n_traf  
(cost=0.00..4782.12 rows=83330 width=20)
-  Index Scan using 
n_traf_y2007m04_date_time_login_id on n_traf_y2007m04 n_traf  
(cost=0.00..4818.29 rows=83609 width=20)
-  Index Scan using 
n_traf_y2007m05_date_time_login_id on n_traf_y2007m05 n_traf  
(cost=0.00..4721.85 rows=82830 width=20)
-  Index Scan using 
n_traf_y2007m06_date_time_login_id on n_traf_y2007m06 n_traf  
(cost=0.00..4766.56 rows=83357 width=20)
-  Index Scan using 
n_traf_y2007m07_date_time_login_id on n_traf_y2007m07 n_traf  
(cost=0.00..4800.44 rows=83548 width=20)
-  Index Scan using 
n_traf_y2007m08_date_time_login_id on n_traf_y2007m08 n_traf  
(cost=0.00..4787.55 rows=83248 width=20)
-  Index Scan using 
n_traf_y2007m09_date_time_login_id on n_traf_y2007m09 n_traf  
(cost=0.00..4830.67 rows=83389 width=20)
-  Index Scan using 
n_traf_y2007m10_date_time_login_id on n_traf_y2007m10 n_traf  
(cost=0.00..4795.78 rows=82993 width=20)
-  Index Scan using 
n_traf_y2007m11_date_time_login_id on n_traf_y2007m11 n_traf  
(cost=0.00..4754.26 rows=83351 width=20)
-  Index Scan using 
n_traf_y2007m12_date_time_login_id on n_traf_y2007m12 n_traf  
(cost=0.00..4819.51 rows=84028 width=20)

(18 rows)








index-order-by-wip.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] partitioned table and ORDER BY indexed_field DESC LIMIT 1

2007-11-07 Thread Mark Kirkwood

Luke Lonergan wrote:

On 11/7/07 10:21 PM, Gregory Stark [EMAIL PROTECTED] wrote:

  

part=# explain SELECT * FROM n_traf ORDER BY date_time LIMIT 1;
  QUERY PLAN
-

Limit  (cost=198367.14..198367.15 rows=1 width=20)
  -  Sort  (cost=198367.14..200870.92 rows=1001510 width=20)
Sort Key: public.n_traf.date_time
-  Result  (cost=0.00..57464.92 rows=1001510 width=20)
  -  Append  (cost=0.00..57464.92 rows=1001510 width=20)
-  Index Scan using n_traf_date_time_login_id on n_traf
(cost=0.00..66.90 rows=1510 width=20)
  

That looks suspicious. There's likely no good reason to be using the index
scan unless it avoids the sort node above the Append node. That's what I hope
to do by having the Append executor code do what's necessary to maintain the
order.



Yah - the way it works in GPDB is that you get a non-sorting plan with an
index scan below the parent - that was the point of the fix. Hmm.

  


Unfortunately our plan in GPDB looks exactly the same in this case - so 
we have a bit of work to do as well! Initially I wondered if I have got 
something wrong in the patch... and checked on GPDB - only to see the 
same behaviour! (see prev comment about LIMIT).


Cheers

Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] partitioned table and ORDER BY indexed_field DESC LIMIT 1

2007-11-07 Thread Luke Lonergan
On 11/7/07 10:21 PM, Gregory Stark [EMAIL PROTECTED] wrote:

 part=# explain SELECT * FROM n_traf ORDER BY date_time LIMIT 1;
   QUERY PLAN
 -
 
 Limit  (cost=198367.14..198367.15 rows=1 width=20)
   -  Sort  (cost=198367.14..200870.92 rows=1001510 width=20)
 Sort Key: public.n_traf.date_time
 -  Result  (cost=0.00..57464.92 rows=1001510 width=20)
   -  Append  (cost=0.00..57464.92 rows=1001510 width=20)
 -  Index Scan using n_traf_date_time_login_id on n_traf
 (cost=0.00..66.90 rows=1510 width=20)
 
 That looks suspicious. There's likely no good reason to be using the index
 scan unless it avoids the sort node above the Append node. That's what I hope
 to do by having the Append executor code do what's necessary to maintain the
 order.

Yah - the way it works in GPDB is that you get a non-sorting plan with an
index scan below the parent - that was the point of the fix. Hmm.

- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] partitioned table and ORDER BY indexed_field DESC LIMIT 1

2007-11-07 Thread Gregory Stark
Mark Kirkwood [EMAIL PROTECTED] writes:

 Here is a (somewhat hurried) self-contained version of the patch under
 discussion. It applies to 8.2.5 and the resultant code compiles and runs. I've
 left in some unneeded parallel stuff (PathLocus struct), which I can weed out
 in a subsequent version if desired. I also removed the 'cdb ' from  most of 
 the
 function names and (I  hope) any Greenplum copyrights.

Thanks, I'll take a look at it.

 I discovered that the patch solves a slightly different problem... it pulls up
 index scans as a viable path choice, (but not for the DESC case) but does not
 push down the LIMIT to the child tables ... so the actual performance
 improvement is zero - however hopefully the patch provides useful raw material
 to help.


 SET
 part=# explain SELECT * FROM n_traf ORDER BY date_time LIMIT 1;
   QUERY PLAN
 -
 Limit  (cost=198367.14..198367.15 rows=1 width=20)
   -  Sort  (cost=198367.14..200870.92 rows=1001510 width=20)
 Sort Key: public.n_traf.date_time
 -  Result  (cost=0.00..57464.92 rows=1001510 width=20)
   -  Append  (cost=0.00..57464.92 rows=1001510 width=20)
 -  Index Scan using n_traf_date_time_login_id on n_traf
 (cost=0.00..66.90 rows=1510 width=20)

That looks suspicious. There's likely no good reason to be using the index
scan unless it avoids the sort node above the Append node. That's what I hope
to do by having the Append executor code do what's necessary to maintain the
order.

From skimming your patch previously I thought the main point was when there
was only one subnode. In that case it was able to pull the subnode entirely
out of the append node and pull up the paths of the subnode. In Postgres that
would never happen because constraint exclusion will never be able to prune
down to a single partition because of the parent table problem but I expect
we'll change that.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match