[PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


Hi, I'm hoping you guys can help with improving this query I'm having  
a problem with. The main problem is that the query plan changes  
depending on the value of the LIMIT clause, with small values using a  
poor plan and running very slowly. The two times are roughly 5 minutes  
for the bad plan and 1.5 secs for the good plan.


I have read a little about how the query planner takes into account  
the limit clause, and I can see the effect this has on the costs shown  
by explain. The problem is that the estimated cost ends up being  
wildly inaccurate. I'm not sure if this a problem with the planner or  
if it is something I am doing wrong on my end.


the query (without the limit clause):

SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM  
ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY  
LOWER(FILE_NAME) ASC


The ps_image table has about 24 million rows, ps_gallery_image has  
about 14 million. The query above produces roughly 50 thousand rows.


When looking at the explain with the limit, I can see the  
interpolation that the planner does for the limit node (arriving at a  
final cost of 458.32 for this example) but not sure why it is  
inaccurate compared to the actual times.


Thanks in advance for taking a look at this, let me know if there is  
additional information I should provide.


Some information about the tables  and the explains follow below.

James Nelson

[ja...@db2 ~] psql --version
psql (PostgreSQL) 8.3.5
contains support for command-line editing

photoshelter=# \d ps_image
  Table public.ps_image
   Column |   Type   | Modifiers
---+-- 
+---

id| character varying(16)| not null
user_id   | character varying(16)|
album_id  | character varying(16)| not null
parent_id | character varying(16)|
file_name | character varying(200)   |
file_size | bigint   |
 20 rows snipped 
Indexes:
   ps_image_pkey PRIMARY KEY, btree (id)
   i_file_name_l btree (lower(file_name::text))
 indexes, fk constraints and triggers snipped 

photoshelter=# \d ps_gallery_image
 Table public.ps_gallery_image
   Column |   Type   |   Modifiers
---+--+
gallery_id| character varying(16)| not null
image_id  | character varying(16)| not null
display_order | integer  | not null default 0
caption   | character varying(2000)  |
ctime | timestamp with time zone | not null default now()
mtime | timestamp with time zone | not null default now()
id| character varying(16)| not null
Indexes:
   ps_gallery_image_pkey PRIMARY KEY, btree (id)
   gi_gallery_id btree (gallery_id)
   gi_image_id btree (image_id)
Foreign-key constraints:
   ps_gallery_image_gallery_id_fkey FOREIGN KEY (gallery_id)  
REFERENCES ps_gallery(id) ON DELETE CASCADE
   ps_gallery_image_image_id_fkey FOREIGN KEY (image_id) REFERENCES  
ps_image(id) ON DELETE CASCADE

Triggers:
   ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON  
ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync()


= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 


explain analyze for bad plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;
QUERY 
 PLAN

-
Limit  (cost=0.00..458.32 rows=1 width=36) (actual  
time=709831.847..709831.847 rows=1 loops=1)
  -  Nested Loop IN Join  (cost=0.00..17700128.78 rows=38620  
width=36) (actual time=709831.845..709831.845 rows=1 loops=1)
-  Index Scan using i_file_name_l on ps_image   
(cost=0.00..1023863.22 rows=24460418 width=36) (actual  
time=0.063..271167.293 rows=8876340 loops=1)
-  Index Scan using gi_image_id on ps_gallery_image   
(cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0  
loops=8876340)
  Index Cond: ((ps_gallery_image.image_id)::text =  
(ps_image.id)::text)
  Filter: ((ps_gallery_image.gallery_id)::text =  
'G7ejKGoWS_cY'::text)

Total runtime: 709831.932 ms

= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 


explain analyze for good plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 600;

Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Adam Ruth
You could try changing the IN to an EXISTS, that may alter how the  
optimizer weighs the limit.


SELECT ID FROM ps_image WHERE EXISTS (SELECT null FROM  
ps_gallery_image WHERE gallery_id ='G7ejKGoWS_cY' and image_id =  
ps_image.id) ORDER BY LOWER(FILE_NAME) ASC


On 30/04/2009, at 3:51 AM, James Nelson wrote:



Hi, I'm hoping you guys can help with improving this query I'm  
having a problem with. The main problem is that the query plan  
changes depending on the value of the LIMIT clause, with small  
values using a poor plan and running very slowly. The two times are  
roughly 5 minutes for the bad plan and 1.5 secs for the good plan.


I have read a little about how the query planner takes into account  
the limit clause, and I can see the effect this has on the costs  
shown by explain. The problem is that the estimated cost ends up  
being wildly inaccurate. I'm not sure if this a problem with the  
planner or if it is something I am doing wrong on my end.


the query (without the limit clause):

SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM  
ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY  
LOWER(FILE_NAME) ASC


The ps_image table has about 24 million rows, ps_gallery_image has  
about 14 million. The query above produces roughly 50 thousand rows.


When looking at the explain with the limit, I can see the  
interpolation that the planner does for the limit node (arriving at  
a final cost of 458.32 for this example) but not sure why it is  
inaccurate compared to the actual times.


Thanks in advance for taking a look at this, let me know if there is  
additional information I should provide.


Some information about the tables  and the explains follow below.

James Nelson

[ja...@db2 ~] psql --version
psql (PostgreSQL) 8.3.5
contains support for command-line editing

photoshelter=# \d ps_image
 Table public.ps_image
  Column |   Type   | Modifiers
---+-- 
+---

id| character varying(16)| not null
user_id   | character varying(16)|
album_id  | character varying(16)| not null
parent_id | character varying(16)|
file_name | character varying(200)   |
file_size | bigint   |
 20 rows snipped 
Indexes:
  ps_image_pkey PRIMARY KEY, btree (id)
  i_file_name_l btree (lower(file_name::text))
 indexes, fk constraints and triggers snipped 

photoshelter=# \d ps_gallery_image
Table public.ps_gallery_image
  Column |   Type   |   Modifiers
---+--+
gallery_id| character varying(16)| not null
image_id  | character varying(16)| not null
display_order | integer  | not null default 0
caption   | character varying(2000)  |
ctime | timestamp with time zone | not null default now()
mtime | timestamp with time zone | not null default now()
id| character varying(16)| not null
Indexes:
  ps_gallery_image_pkey PRIMARY KEY, btree (id)
  gi_gallery_id btree (gallery_id)
  gi_image_id btree (image_id)
Foreign-key constraints:
  ps_gallery_image_gallery_id_fkey FOREIGN KEY (gallery_id)  
REFERENCES ps_gallery(id) ON DELETE CASCADE
  ps_gallery_image_image_id_fkey FOREIGN KEY (image_id) REFERENCES  
ps_image(id) ON DELETE CASCADE

Triggers:
  ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON  
ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync()


= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
==

explain analyze for bad plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;
   QUERY 
 PLAN

-
Limit  (cost=0.00..458.32 rows=1 width=36) (actual  
time=709831.847..709831.847 rows=1 loops=1)
 -  Nested Loop IN Join  (cost=0.00..17700128.78 rows=38620  
width=36) (actual time=709831.845..709831.845 rows=1 loops=1)
   -  Index Scan using i_file_name_l on ps_image   
(cost=0.00..1023863.22 rows=24460418 width=36) (actual  
time=0.063..271167.293 rows=8876340 loops=1)
   -  Index Scan using gi_image_id on ps_gallery_image   
(cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0  
loops=8876340)
 Index Cond: ((ps_gallery_image.image_id)::text =  
(ps_image.id)::text)
 Filter: ((ps_gallery_image.gallery_id)::text =  
'G7ejKGoWS_cY'::text)

Total runtime: 709831.932 ms

= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 

Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Grzegorz Jaśkiewicz
EXISTS won't help much either, postgresql is not too fast, when it
comes to that sort of approach.
join is always going to be fast, it is about time you learn joins and
use them ;)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Grzegorz Jaśkiewicz
use join instead of where in();

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Many left outer joins with limit performance

2009-05-01 Thread Gerhard Wiesinger

Hello,

I want to use postgresql for data entries (every minute) from a central heating 
system where the timestamp is logged in a table log. For flexibility in the 
future for future values and for implementing several high level types I've 
modelled the values in a separate key/value table called log_details.


A Query for the last valid entry for today looks like (also defined as a view), 
sometimes used without the limit:

SELECT
  l.id AS id,
  l.datetime AS datetime,
  l.tdate AS tdate,
  l.ttime AS ttime,
  d1.value  AS Raumsolltemperatur,
  d2.value  AS Raumtemperatur,
-- a lot more here, stripped for readibility, see link
FROM
  log l
-- Order is relevant here
LEFT OUTER JOIN key_description k1  ON k1.description = 'Raumsolltemperatur'
LEFT OUTER JOIN log_details d1  ON l.id = d1.fk_id AND d1.fk_keyid = 
k1.keyid

-- Order is relevant here
LEFT OUTER JOIN key_description k2  ON k2.description = 'Raumtemperatur'
LEFT OUTER JOIN log_details d2  ON l.id = d2.fk_id AND d2.fk_keyid = 
k2.keyid

-- a lot more here, stripped for readibility, see link
WHERE
  -- 86400 entries in that timeframe
  datetime= '1970-01-01 00:00:00+02'
  AND datetime  '1970-01-02 00:00:00+02'
ORDER BY
  datetime DESC
LIMIT 1;

For me a perfect query plan would look like:
1.) Fetch the one and only id from table log (or fetch even all necessary id 
entries when no limit is specifie)

2.) Make the left outer joins

Details (machine details, table definition, query plans, etc.) 
can be found to due size limitations at:

http://www.wiesinger.com/tmp/pg_perf.txt

Any ideas how to improve the performance on left outer joins only and how to 
improve the planner to get better results?


For this special case a better solution exists but I thing the planner has to 
do the work.

-- ...
WHERE
-- Also slow: id IN
-- OK: id =
id = (
  SELECT
id
  FROM
log
  WHERE
datetime= '1970-01-01 00:00:00+02'
AND datetime  '1970-01-02 00:00:00+02'
  ORDER BY
datetime DESC
  LIMIT 1
)
ORDER BY
datetime DESC LIMIT 1;

Any ideas?

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Transparent table partitioning in future version of PG?

2009-05-01 Thread henk de wit

Hi,
I was looking at the support that PostgreSQL offers for table partitioning at 
http://www.postgresql.org/docs/8.4/static/ddl-partitioning.html. The concept 
looks promising, but its maybe fair to say that PG itself doesn't really 
supports partitioning natively, but one can simulate it using some of the 
existing PG features (namely inheritance, triggers, rules and constraint 
exclusion). This simulating does seem to work, but there are some disadvantages 
and caveats. 
A major disadvantage is obviously that you need to set up and maintain the 
whole structure yourself (which is somewhat dangerous, or at least involves a 
lot of maintenance overhead). Next to that, it seemingly becomes hard to do 
simple queries likes 'select * from foo where bar 1000 and bar  5000', in 
case the answer to this query spans multiple partitions. constraint exclusion 
works to some degree, but the document I referred to above tells me I can no 
longer use prepared statements then.
I wonder if there are any plans to incorporate 'native' or 'transparent' 
partitioning in some future version of PG? With this I mean that I would 
basically be able to say something like (pseudo): alter table foo partition on 
bar range 100, and PG would then simply start doing internally what we now 
have to do manually.
Is something like this on the radar or is it just wishful thinking of me?
Kind regards


_
What can you do with the new Windows Live? Find out
http://www.microsoft.com/windows/windowslive/default.aspx

Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Tom Lane
James Nelson ja...@photoshelter.com writes:
 Hi, I'm hoping you guys can help with improving this query I'm having  
 a problem with. The main problem is that the query plan changes  
 depending on the value of the LIMIT clause, with small values using a  
 poor plan and running very slowly. The two times are roughly 5 minutes  
 for the bad plan and 1.5 secs for the good plan.

 photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
 (SELECT image_id FROM ps_gallery_image WHERE  
 gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;

The problem here is an overoptimistic assessment of how long it will
take to find a match to gallery_id='G7ejKGoWS_cY' while searching
in file_name order.  You might be able to fix that by increasing the
statistics target for gallery_id.  However, if the issue is not so
much how many occurrences of 'G7ejKGoWS_cY' there are as that
they're all associated with high values of file_name, that won't
help.  In that case I think it would work to restructure the query
along the lines of

select * from (
  SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
 gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC
  offset 0
  ) ss
limit 1;

The OFFSET should act as an optimization fence to prevent the LIMIT
from being used in the planning of the subquery.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


I had tried using exists but both the forms of the query (with limit  
and without) performed much worse.


   James

On May 1, 2009, at 4:22 AM, Adam Ruth wrote:

You could try changing the IN to an EXISTS, that may alter how the  
optimizer weighs the limit.



SELECT ID FROM ps_image WHERE EXISTS (SELECT null FROM  
ps_gallery_image WHERE gallery_id ='G7ejKGoWS_cY' and image_id =  
ps_image.id) ORDER BY LOWER(FILE_NAME) ASC


On 30/04/2009, at 3:51 AM, James Nelson wrote:



Hi, I'm hoping you guys can help with improving this query I'm  
having a problem with. The main problem is that the query plan  
changes depending on the value of the LIMIT clause, with small  
values using a poor plan and running very slowly. The two times are  
roughly 5 minutes for the bad plan and 1.5 secs for the good plan.


I have read a little about how the query planner takes into account  
the limit clause, and I can see the effect this has on the costs  
shown by explain. The problem is that the estimated cost ends up  
being wildly inaccurate. I'm not sure if this a problem with the  
planner or if it is something I am doing wrong on my end.


the query (without the limit clause):

SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM  
ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY  
LOWER(FILE_NAME) ASC


The ps_image table has about 24 million rows, ps_gallery_image has  
about 14 million. The query above produces roughly 50 thousand rows.


When looking at the explain with the limit, I can see the  
interpolation that the planner does for the limit node (arriving at  
a final cost of 458.32 for this example) but not sure why it is  
inaccurate compared to the actual times.


Thanks in advance for taking a look at this, let me know if there  
is additional information I should provide.


Some information about the tables  and the explains follow below.

James Nelson

[ja...@db2 ~] psql --version
psql (PostgreSQL) 8.3.5
contains support for command-line editing

photoshelter=# \d ps_image
Table public.ps_image
 Column |   Type   | Modifiers
---+-- 
+---

id| character varying(16)| not null
user_id   | character varying(16)|
album_id  | character varying(16)| not null
parent_id | character varying(16)|
file_name | character varying(200)   |
file_size | bigint   |
 20 rows snipped 
Indexes:
 ps_image_pkey PRIMARY KEY, btree (id)
 i_file_name_l btree (lower(file_name::text))
 indexes, fk constraints and triggers snipped 

photoshelter=# \d ps_gallery_image
   Table public.ps_gallery_image
 Column |   Type   |   Modifiers
---+--+
gallery_id| character varying(16)| not null
image_id  | character varying(16)| not null
display_order | integer  | not null default 0
caption   | character varying(2000)  |
ctime | timestamp with time zone | not null default now()
mtime | timestamp with time zone | not null default now()
id| character varying(16)| not null
Indexes:
 ps_gallery_image_pkey PRIMARY KEY, btree (id)
 gi_gallery_id btree (gallery_id)
 gi_image_id btree (image_id)
Foreign-key constraints:
 ps_gallery_image_gallery_id_fkey FOREIGN KEY (gallery_id)  
REFERENCES ps_gallery(id) ON DELETE CASCADE
 ps_gallery_image_image_id_fkey FOREIGN KEY (image_id) REFERENCES  
ps_image(id) ON DELETE CASCADE

Triggers:
 ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON  
ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync()


= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
=

explain analyze for bad plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;
  QUERY 
 PLAN

-
Limit  (cost=0.00..458.32 rows=1 width=36) (actual  
time=709831.847..709831.847 rows=1 loops=1)
-  Nested Loop IN Join  (cost=0.00..17700128.78 rows=38620  
width=36) (actual time=709831.845..709831.845 rows=1 loops=1)
  -  Index Scan using i_file_name_l on ps_image   
(cost=0.00..1023863.22 rows=24460418 width=36) (actual  
time=0.063..271167.293 rows=8876340 loops=1)
  -  Index Scan using gi_image_id on ps_gallery_image   
(cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0  
loops=8876340)
Index Cond: ((ps_gallery_image.image_id)::text =  
(ps_image.id)::text)
Filter: 

Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


The 'in' form and 'join' form produce identical plans for both limit  
and non-limit versions of the query, which I actually think reflects  
well on the query planner. I also tried a form of the query with the  
subselect in the from clause to try and force the order the tables  
were evaluated but the query planner saw through that one too.  
Basically this query:


SELECT ps_image.id FROM
	(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') as ids
INNER JOIN ps_image on ps_image.id = ids.image_id ORDER BY  
LOWER(FILE_NAME) ASC limit 1;


produces the same plan as the 'in' or the 'join' form when the limit  
clause is present.


 James



On May 1, 2009, at 4:32 AM, Grzegorz Jaśkiewicz wrote:


use join instead of where in();



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


I looked into the distribution of the filenames, in particular I ran a  
query to see how for into the table the 1st filename would be found.


photoshelter=# select count(*) from ps_image where lower(file_name)   
'a-400-001.jpg';

  count
-
 8915832


As you can see the first row is almost 9 million rows into the table.  
(a-400-001.jpg is the first filename returned by the query) which  
implies the distribution is heavily non-uniform. (For uniform  
distribution the first row should have been within the first 500 rows,  
give or take)


I tried the query you suggest below but it did not work well, but  
using it as inspiration the following query does work:


photoshelter=# explain analyze select * from (
 SELECT ID, lower(file_name) as lfn FROM ps_image WHERE id IN
   (SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G7ejKGoWS_cY')
 offset 0
 ) ss
ORDER BY lfn ASC
limit 1;
 QUERY 
 PLAN


 Limit  (cost=158946.43..158946.43 rows=1 width=52) (actual  
time=1539.615..1539.615 rows=1 loops=1)
   -  Sort  (cost=158946.43..159044.80 rows=39350 width=52) (actual  
time=1539.613..1539.613 rows=1 loops=1)

 Sort Key: (lower((ps_image.file_name)::text))
 Sort Method:  top-N heapsort  Memory: 17kB
 -  Limit  (cost=43197.34..158356.18 rows=39350 width=36)  
(actual time=74.530..1499.328 rows=50237 loops=1)
   -  Nested Loop  (cost=43197.34..158356.18 rows=39350  
width=36) (actual time=74.529..1475.378 rows=50237 loops=1)
 -  HashAggregate  (cost=43197.34..43590.84  
rows=39350 width=17) (actual time=74.468..110.638 rows=50237 loops=1)
   -  Index Scan using gi_gallery_id on  
ps_gallery_image  (cost=0.00..43072.80 rows=49816 width=17) (actual  
time=0.049..46.926 rows=50237 loops=1)
 Index Cond: ((gallery_id)::text =  
'G7ejKGoWS_cY'::text)
 -  Index Scan using ps_image_pkey on ps_image   
(cost=0.00..2.90 rows=1 width=36) (actual time=0.025..0.025 rows=1  
loops=50237)
   Index Cond: ((ps_image.id)::text =  
(ps_gallery_image.image_id)::text)

 Total runtime: 1540.032 ms
(12 rows)

Interestingly to me, while the  'offest 0'  did not work as an  
optimization fence in the query you provided, it works as one in the  
query above. I had tried removing it from the above query, and the  
plan reverted back to the bad form.


The non-uniform distribution leads me to another question, would it be  
possible to use partial indexes or some other technique to help the  
planner. Or would the fact that the relevant information, gallery ids  
and filenames, are split across two tables foil any attempt?


In any case, I'd like to thank everyone for their input. The query  
above will be a big help.


be well,

  James


On May 1, 2009, at 10:57 AM, Tom Lane wrote:


James Nelson ja...@photoshelter.com writes:

Hi, I'm hoping you guys can help with improving this query I'm having
a problem with. The main problem is that the query plan changes
depending on the value of the LIMIT clause, with small values using a
poor plan and running very slowly. The two times are roughly 5  
minutes

for the bad plan and 1.5 secs for the good plan.



photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN
(SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;


The problem here is an overoptimistic assessment of how long it will
take to find a match to gallery_id='G7ejKGoWS_cY' while searching
in file_name order.  You might be able to fix that by increasing the
statistics target for gallery_id.  However, if the issue is not so
much how many occurrences of 'G7ejKGoWS_cY' there are as that
they're all associated with high values of file_name, that won't
help.  In that case I think it would work to restructure the query
along the lines of

select * from (
 SELECT ID FROM ps_image WHERE id IN
   (SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC
 offset 0
 ) ss
limit 1;

The OFFSET should act as an optimization fence to prevent the LIMIT
from being used in the planning of the subquery.

regards, tom lane



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Many left outer joins with limit performance

2009-05-01 Thread Tom Lane
Gerhard Wiesinger li...@wiesinger.com writes:
 FROM
log l
 -- Order is relevant here
 LEFT OUTER JOIN key_description k1  ON k1.description = 'Raumsolltemperatur'
 LEFT OUTER JOIN log_details d1  ON l.id = d1.fk_id AND d1.fk_keyid = 
 k1.keyid

Surely this query is just plain broken?  You're forming a cross product
of the relevant log lines with the k1 rows having description =
'Raumsolltemperatur' (I assume this isn't unique, else it's not clear
what the point is) and then the subsequent left join cannot get rid of
anything.  I think probably you meant something different, like

FROM
   log l
LEFT OUTER JOIN log_details d1  ON l.id = d1.fk_id
LEFT OUTER JOIN key_description k1  ON k1.description = 'Raumsolltemperatur' 
AND d1.fk_keyid = k1.keyid

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Transparent table partitioning in future version of PG?

2009-05-01 Thread Scott Carey
On 5/1/09 7:32 AM, henk de wit henk53...@hotmail.com wrote:

 Hi,
 
 I was looking at the support that PostgreSQL offers for table partitioning
 at http://www.postgresql.org/docs/8.4/static/ddl-partitioning.html. The
 concept looks promising, but its maybe fair to say that PG itself doesn't
 really supports partitioning natively, but one can simulate it using some of
 the existing PG features (namely inheritance, triggers, rules and constraint
 exclusion). This simulating does seem to work, but there are some
 disadvantages and caveats. 
 
 A major disadvantage is obviously that you need to set up and maintain the
 whole structure yourself (which is somewhat dangerous, or at least involves a
 lot of maintenance overhead). Next to that, it seemingly becomes hard to do
 simple queries likes 'select * from foo where bar 1000 and bar  5000', in
 case the answer to this query spans multiple partitions. constraint exclusion
 works to some degree, but the document I referred to above tells me I can no
 longer use prepared statements then.

More caveats:  

Query plans go bad pretty quickly because the planner doesn't aggregate
statistics correctly when scanning more than one table.

Constraint exclusion code is completely and utterly broken if the table
count gets large on DELETE or UPDATE queries -- I can get the query planner
/ constraint exclusion stuff to eat up 7GB of RAM trying to figure out what
table to access when the number of partitions ~=6000.
The same thing in select form  doesn't consume that memory but still takes
over a second. 

This is not a bug.
http://www.nabble.com/8.3.5:-Query-Planner-takes-15%2B-seconds-to-plan-Updat
e-or-Delete-queries-on-partitioned-tables.-td21992054.html

Its pretty much faster to do merge joins or hash joins client side on
multiple tables -- basically doing partitioning client side -- after a point
and for any more complicated aggregation or join.

There is a lot of talk about overly complicated partitioning or
auto-partitioning, but two much more simple things would go a long way to
making this fairly workable:

Make stat aggregation across tables better -- use weighted average for
estimating row width, aggregate distinct counts and correlations better.
Right now it mostly assumes the worst possible case and can end up with very
unoptimal plans.

Make a special case for unique child inheritance constraints that can be
checked much faster -- nobody wants to partition and have overlapping
constraint regions.  And whatever is going on for it on the update / delete
side that causes it to take so much longer and use so much more memory for
what should be the same constraint exclusion check as a select needs to be
attended to.

There would still be manual work for managing creating partitions, but at
this point, that is the _least_ of the problems.


 
 I wonder if there are any plans to incorporate 'native' or 'transparent'
 partitioning in some future version of PG? With this I mean that I would
 basically be able to say something like (pseudo): alter table foo partition
 on bar range 100, and PG would then simply start doing internally what we now
 have to do manually.
 
 Is something like this on the radar or is it just wishful thinking of me?
 
 Kind regards
 
 
 
 
 What can you do with the new Windows Live? Find out
 http://www.microsoft.com/windows/windowslive/default.aspx 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-05-01 Thread PFC



Blocking round trips to another process on the same server should be
fairly cheap--that is, writing to a socket (or pipe, or localhost TCP
connection) where the other side is listening for it; and then
blocking in return for the response.  The act of writing to an FD that
another process is waiting for will make the kernel mark the process
as ready to wake up immediately, and the act of blocking for the
response will kick the scheduler to some waiting process, so as long
as there isn't something else to compete for CPU for, each write/read
will wake up the other process instantly.  There's a task switching
cost, but that's too small to be relevant here.

Doing 100 local round trips, over a pipe: 5.25s (5 *microseconds*
each), code attached.  The cost *should* be essentially identical for
any local transport (pipes, named pipes, local TCP connections), since
the underlying scheduler mechanisms are the same.


	Roundtrips can be quite fast but they have a hidden problem, which is  
that everything gets serialized.
	This means if you have a process that generates data to insert, and a  
postgres process, and 2 cores on your CPU, you will never use more than 1  
core, because both are waiting on each other.

Pipelining is a way to solve this...
	In the ideal case, if postgres is as fast as the data-generating process,  
each would use 1 core, yielding 2x speedup.
	Of course if one of the processes is like 10x faster than the other, it  
doesn't matter.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-05-01 Thread david

On Sat, 2 May 2009, PFC wrote:


Blocking round trips to another process on the same server should be
fairly cheap--that is, writing to a socket (or pipe, or localhost TCP
connection) where the other side is listening for it; and then
blocking in return for the response.  The act of writing to an FD that
another process is waiting for will make the kernel mark the process
as ready to wake up immediately, and the act of blocking for the
response will kick the scheduler to some waiting process, so as long
as there isn't something else to compete for CPU for, each write/read
will wake up the other process instantly.  There's a task switching
cost, but that's too small to be relevant here.

Doing 100 local round trips, over a pipe: 5.25s (5 *microseconds*
each), code attached.  The cost *should* be essentially identical for
any local transport (pipes, named pipes, local TCP connections), since
the underlying scheduler mechanisms are the same.


	Roundtrips can be quite fast but they have a hidden problem, which is 
that everything gets serialized.
	This means if you have a process that generates data to insert, and a 
postgres process, and 2 cores on your CPU, you will never use more than 1 
core, because both are waiting on each other.

Pipelining is a way to solve this...
	In the ideal case, if postgres is as fast as the data-generating 
process, each would use 1 core, yielding 2x speedup.
	Of course if one of the processes is like 10x faster than the other, 
it doesn't matter.


in the case of rsyslog there are config options to allow multiple 
threads to be working on doing the inserts, so it doesn't need to be 
serialized as badly as you are fearing (there is locking involved, so it 
doesn't scale perfectly)


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-05-01 Thread Glenn Maynard
On Fri, May 1, 2009 at 8:29 PM, PFC li...@peufeu.com wrote:
        Roundtrips can be quite fast but they have a hidden problem, which is
 that everything gets serialized.

The client and server will serialize, but what usually matters most is
avoiding serializing against disk I/O--and that's why write-back
caching exists.  There's still a benefit to pipelining (not everything
the db might need to read to complete the write will always be in
cache), but if everything was being serialized it'd be an order of
magnitude worse.  That's why running each insert in a separate
transaction is so much slower; in that case, it *will* serialize
against the disk (by default).

-- 
Glenn Maynard

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance