Re: [PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-14 Thread Clem Dickey

On 07/12/2011 11:11 AM, Mario Splivalo wrote:

Hi, all.

I have a query, looking like this:
SELECT
pub_date
FROM
tubesite_object
INNER JOIN tubesite_image
ON tubesite_image.object_ptr_id = tubesite_object.id
WHERE
tubesite_object.site_id = 8
AND tubesite_object.pub_date < E'2011-07-12 13:25:00'
ORDER BY
tubesite_object.pub_date ASC
LIMIT 21;




Why is planner using NestedLoops, that is, what can I do to make him NOT
to use NestedLoops (other than issuing SET enable_nestloop TO false;
before each query) ?


The planner is using a nested loops because the startup overhead is 
less, and it think that it will only have run a small 0.2% (21/9404) of 
the loops before reaching your limit of 21 results. In fact it has to 
run all the loops, because there are 0 results. (Is that what you expected?)


Try a using CTE to make the planner think you are going to use all the 
rows of the joined table. That may cause the planner to use a merge 
join, which has higher startup cost (sort) but less overall cost if it 
the join will not finish early.


WITH t AS (
  SELECT tubesite_object.site_id AS site_id,
tubesite_object.pub_date as pub_date
  FROM tubesite_object
  INNER JOIN tubesite_image
  ON tubesite_image.object_ptr_id = tubesite_object.id
)
SELECT pub_date
FROM t
WHERE t.site_id = 8 AND t.pub_date < E'2011-07-12 13:25:00'
ORDER BY t.pub_date ASC LIMIT 21;


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-12 Thread Mario Splivalo

On 07/13/2011 02:53 AM, Mario Splivalo wrote:

On 07/13/2011 12:39 AM, Tom Lane wrote:

Mario Splivalo writes:

On 07/12/2011 10:04 PM, Tom Lane wrote:

What you need to look into is why the estimated join size is 9400 rows
when the actual join size is zero. Are both tables ANALYZEd? Are you
intentionally selecting rows that have no join partners?



Yes, both tables have been ANALYZEd. What do you mean, intentilnaly
selecting rows taht have no join partners?


I'm wondering why the actual join size is zero. That seems like a
rather unexpected case for a query like this.


Yes, seems that planer gets confused by LIMIT. This query:

select * from tubesite_object join tubesite_image on id=object_ptr_id
where site_id = 8 and pub_date < '2011-07-12 13:25:00' order by pub_date
desc ;

Does not choose Nested Loop, and is done instantly (20 ms), and returns
no rows. However, if I add LIMIT at the end, it chooses NestedLoop and
it takes 500ms if I'm alone on the server, and 10+ seconds if there 50+
connections on the server.


As explained/suggested by RhodiumToad on IRC, adding composite index on 
(site_id, pub_date) made nestedloop query finish in around 100 seconds!


Thank you!

Mario

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-12 Thread Mario Splivalo

On 07/13/2011 12:39 AM, Tom Lane wrote:

Mario Splivalo  writes:

On 07/12/2011 10:04 PM, Tom Lane wrote:

What you need to look into is why the estimated join size is 9400 rows
when the actual join size is zero.  Are both tables ANALYZEd?  Are you
intentionally selecting rows that have no join partners?



Yes, both tables have been ANALYZEd. What do you mean, intentilnaly
selecting rows taht have no join partners?


I'm wondering why the actual join size is zero.  That seems like a
rather unexpected case for a query like this.


Yes, seems that planer gets confused by LIMIT. This query:

select * from tubesite_object join tubesite_image on id=object_ptr_id 
where site_id = 8 and pub_date < '2011-07-12 13:25:00' order by pub_date 
desc ;


Does not choose Nested Loop, and is done instantly (20 ms), and returns 
no rows. However, if I add LIMIT at the end, it chooses NestedLoop and 
it takes 500ms if I'm alone on the server, and 10+ seconds if there 50+ 
connections on the server.


Mario

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-12 Thread Mario Splivalo

On 07/13/2011 12:39 AM, Tom Lane wrote:

Mario Splivalo  writes:

On 07/12/2011 10:04 PM, Tom Lane wrote:

What you need to look into is why the estimated join size is 9400 rows
when the actual join size is zero.  Are both tables ANALYZEd?  Are you
intentionally selecting rows that have no join partners?



Yes, both tables have been ANALYZEd. What do you mean, intentilnaly
selecting rows taht have no join partners?


I'm wondering why the actual join size is zero.  That seems like a
rather unexpected case for a query like this.


It is true that this particular query returns 0 rows. But it's created 
by django, and I can't do much to alter it.


Mario

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-12 Thread Tom Lane
Mario Splivalo  writes:
> On 07/12/2011 10:04 PM, Tom Lane wrote:
>> What you need to look into is why the estimated join size is 9400 rows
>> when the actual join size is zero.  Are both tables ANALYZEd?  Are you
>> intentionally selecting rows that have no join partners?

> Yes, both tables have been ANALYZEd. What do you mean, intentilnaly 
> selecting rows taht have no join partners?

I'm wondering why the actual join size is zero.  That seems like a
rather unexpected case for a query like this.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-12 Thread Mario Splivalo

On 07/12/2011 10:04 PM, Tom Lane wrote:

Mario Splivalo  writes:

   Limit  (cost=0.00..415.91 rows=21 width=8) (actual
time=11263.089..11263.089 rows=0 loops=1)
 ->   Nested Loop  (cost=0.00..186249.55 rows=9404 width=8) (actual
time=11263.087..11263.087 rows=0 loops=1)



Why is planner using NestedLoops,


Because it thinks the LIMIT will kick in and end the query when the join
is only 21/9404ths (ie, a fraction of a percent) complete.  A NestLoop
results in saving a lot of work in that situation, whereas hash-and-sort
has to do the whole join despite the LIMIT.

What you need to look into is why the estimated join size is 9400 rows
when the actual join size is zero.  Are both tables ANALYZEd?  Are you
intentionally selecting rows that have no join partners?


Hi, Tom.

Yes, both tables have been ANALYZEd. What do you mean, intentilnaly 
selecting rows taht have no join partners?


Mario

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-12 Thread Tom Lane
Mario Splivalo  writes:
>   Limit  (cost=0.00..415.91 rows=21 width=8) (actual 
> time=11263.089..11263.089 rows=0 loops=1)
> ->  Nested Loop  (cost=0.00..186249.55 rows=9404 width=8) (actual 
> time=11263.087..11263.087 rows=0 loops=1)

> Why is planner using NestedLoops,

Because it thinks the LIMIT will kick in and end the query when the join
is only 21/9404ths (ie, a fraction of a percent) complete.  A NestLoop
results in saving a lot of work in that situation, whereas hash-and-sort
has to do the whole join despite the LIMIT.

What you need to look into is why the estimated join size is 9400 rows
when the actual join size is zero.  Are both tables ANALYZEd?  Are you
intentionally selecting rows that have no join partners?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Planner choosing NestedLoop, although it is slower...

2011-07-12 Thread Mario Splivalo

Hi, all.

I have a query, looking like this:
SELECT
pub_date
FROM
tubesite_object
INNER JOIN tubesite_image
ON tubesite_image.object_ptr_id = tubesite_object.id
WHERE
tubesite_object.site_id = 8
AND tubesite_object.pub_date < E'2011-07-12 13:25:00'
ORDER BY
tubesite_object.pub_date ASC
LIMIT 21;


That query takes cca 10-15 seconds to run. Here is query plan:

 Limit  (cost=0.00..415.91 rows=21 width=8) (actual 
time=11263.089..11263.089 rows=0 loops=1)
   ->  Nested Loop  (cost=0.00..186249.55 rows=9404 width=8) (actual 
time=11263.087..11263.087 rows=0 loops=1)
 ->  Index Scan using tubesite_object_pub_date_idx on 
tubesite_object  (cost=0.00..183007.09 rows=9404 width=12) (actual 
time=0.024..11059.487 rows=9374 loops=1)
   Index Cond: (pub_date < '2011-07-12 
13:25:00-05'::timestamp with time zone)

   Filter: (site_id = 8)
 ->  Index Scan using tubesite_image_pkey on tubesite_image 
(cost=0.00..0.33 rows=1 width=4) (actual time=0.021..0.021 rows=0 
loops=9374)
   Index Cond: (tubesite_image.object_ptr_id = 
tubesite_object.id)

 Total runtime: 11263.141 ms


This query runs quickly (around second or two) when there is only few 
connections to the database. Once I have 50-80 connections (200 is the 
limit, although I never have more than 120-150 connections), that query 
takes around 10-15 seconds.


But, if I disable nestedloops, here is the query plan:

 Limit  (cost=22683.45..22683.51 rows=21 width=8) (actual 
time=136.009..136.009 rows=0 loops=1)
   ->  Sort  (cost=22683.45..22706.96 rows=9404 width=8) (actual 
time=136.007..136.007 rows=0 loops=1)

 Sort Key: tubesite_object.pub_date
 Sort Method:  quicksort  Memory: 25kB
 ->  Hash Join  (cost=946.51..22429.91 rows=9404 width=8) 
(actual time=135.934..135.934 rows=0 loops=1)
   Hash Cond: (tubesite_object.id = 
tubesite_image.object_ptr_id)
   ->  Bitmap Heap Scan on tubesite_object 
(cost=545.40..21828.97 rows=9404 width=12) (actual time=20.874..104.075 
rows=9374 loops=1)

 Recheck Cond: (site_id = 8)
 Filter: (pub_date < '2011-07-12 
13:25:00-05'::timestamp with time zone)
 ->  Bitmap Index Scan on tubesite_object_site_id 
(cost=0.00..543.05 rows=9404 width=0) (actual time=18.789..18.789 
rows=9374 loops=1)

   Index Cond: (site_id = 8)
   ->  Hash  (cost=215.49..215.49 rows=14849 width=4) 
(actual time=21.068..21.068 rows=14849 loops=1)
 ->  Seq Scan on tubesite_image  (cost=0.00..215.49 
rows=14849 width=4) (actual time=0.029..9.073 rows=14849 loops=1)

 Total runtime: 136.287 ms


Now, if I disable nested loops in postgres.conf, then my load average on 
the server goes skyhigh (i presume because a lot of other queries are 
now being planned incorrectly).


I have set up default_statistics_target to 2000, and have vacumed and 
analyzed the database.


Here are the other options I have set up in postgresql.conf (that differ 
from the default settings):


 version | PostgreSQL 8.4.8 on x86_64-pc-linux-gnu, 
compiled by GCC gcc-4.3.real (Debian 4.3.2-1.1) 4.3.2, 64-bit

 checkpoint_segments | 64
 default_statistics_target   | 2000
 effective_cache_size| 20GB
 external_pid_file   | /var/run/postgresql/8.4-main.pid
 lc_collate  | en_US.UTF-8
 lc_ctype| en_US.UTF-8
 listen_addresses| *
 log_autovacuum_min_duration | 0
 log_checkpoints | on
 log_line_prefix | %t [%p]: [%l-1]
 log_min_duration_statement  | 1s
 maintenance_work_mem| 256MB
 max_connections | 200
 max_stack_depth | 3MB
 port| 5432
 server_encoding | UTF8
 shared_buffers  | 2GB
 statement_timeout   | 30min
 temp_buffers| 4096
 TimeZone| localtime
 track_activity_query_size   | 2048
 unix_socket_directory   | /var/run/postgresql
 wal_buffers | 128MB
 work_mem| 64MB



Why is planner using NestedLoops, that is, what can I do to make him NOT 
to use NestedLoops (other than issuing SET enable_nestloop TO false; 
before each query) ?


Mario

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance