Re: [PERFORM] Why query takes soo much time

2011-05-16 Thread Craig Ringer

On 05/16/2011 01:39 PM, Adarsh Sharma wrote:

Dear all,
I have a query on 3 tables in a database as :-

_*Explain Analyze Output :-*_

explain anayze select c.clause, s.subject ,s.object , s.verb, 
s.subject_type , s.object_type ,s.doc_id ,s.svo_id from clause2 c, 
svo2 s ,page_content p where c.clause_id=s.clause_id and 
s.doc_id=c.source_id and c.sentence_id=s.sentence_id and 
s.doc_id=p.crawled_page_id order by s.svo_id limit 1000 offset 17929000




Using limit and offset can be horrifyingly slow for non-trivial queries. 
Are you trying to paginate results? If not, what are you trying to achieve?


In most (all?) cases, Pg will have to execute the query up to the point 
where it's found limit+offset rows, producing and discarding offset rows 
as it goes. Needless to say, that's horrifyingly inefficient.


Reformatting your query for readability (to me) as:

EXPLAIN ANALYZE
SELECT c.clause, s.subject ,s.object , s.verb, s.subject_type, 
s.object_type ,s.doc_id ,s.svo_id
FROM clause2 c INNER JOIN svo2 s ON (c.clause_id=s.clause_id AND 
c.source_id=s.doc_id AND c.sentence_id=s.sentence_id)

   INNER JOIN page_content p ON (s.doc_id=p.crawled_page_id)
ORDER BY s.svo_id limit 1000 offset 17929000

... I can see that you're joining on 
(c.clause_id,c.source_id,c.sentence_id)=(s.clause_id,s.doc_id,s.sentence_id). 
You have matching indexes idx_clause2_id and idx_svo2_id_dummy with 
matching column ordering. Pg is using idx_clause2_id in the join of svo2 
and clause2, but instead of doing a bitmap index scan using it and 
idx_svo2_id_dummy it's doing a nested loop using idx_clause2_id and 
pk_svo_id.


First: make sure your stats are up to date by ANALYZE-ing your tables 
and probably increasing the stats collected on the join columns and/or 
increasing default_statistics_target. If that doesn't help, personally 
I'd play with the random_page_cost and seq_page_cost to see if they 
reflect your machine's actual performance, and to see if you get a more 
favourable plan. If I were experimenting with this I'd also see if 
giving the query lots of work_mem allowed it to try a different approach 
to the join.



Limit  (cost=21685592.91..21686802.44 rows=1000 width=2624) (actual 
time=414601.802..414622.920 rows=1000 loops=1)
  -  Nested Loop  (cost=59.77..320659013645.28 rows=265112018116 
width=2624) (actual time=0.422..404902.314 rows=1793 loops=1)
-  Nested Loop  (cost=0.00..313889654.42 rows=109882338 
width=2628) (actual time=0.242..174223.789 rows=17736897 loops=1)
  -  Index Scan using pk_svo_id on svo2 s  
(cost=0.00..33914955.13 rows=26840752 width=2600) (actual 
time=0.157..14691.039 rows=14238271 loops=1)
  -  Index Scan using idx_clause2_id on clause2 c  
(cost=0.00..10.36 rows=4 width=44) (actual time=0.007..0.008 rows=1 
loops=14238271)
Index Cond: ((c.source_id = s.doc_id) AND 
(c.clause_id = s.clause_id) AND (c.sentence_id = s.sentence_id))
-  Bitmap Heap Scan on page_content p  (cost=59.77..2885.18 
rows=2413 width=8) (actual time=0.007..0.008 rows=1 loops=17736897)

  Recheck Cond: (p.crawled_page_id = s.doc_id)
  -  Bitmap Index Scan on idx_crawled_id  
(cost=0.00..59.17 rows=2413 width=0) (actual time=0.005..0.005 rows=1 
loops=17736897)

Index Cond: (p.crawled_page_id = s.doc_id)
Total runtime: 414623.634 ms

_*My Table  index definitions are as under :-

*_Estimated rows in 3 tables are :-

clause2 10341700
svo2 26008000
page_content 479785

CREATE TABLE clause2
(
  id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
  source_id integer,
  sentence_id integer,
  clause_id integer,
  tense character varying(30),
  clause text,
  CONSTRAINT pk_clause_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);
CREATE INDEX idx_clause2_id  ON clause2  USING btree (source_id, 
clause_id, sentence_id);


CREATE TABLE svo2
(
  svo_id bigint NOT NULL DEFAULT nextval('svo_svo_id_seq'::regclass),
  doc_id integer,
  sentence_id integer,
  clause_id integer,
  negation integer,
  subject character varying(3000),
  verb character varying(3000),
  object character varying(3000),
  preposition character varying(3000),
  subject_type character varying(3000),
  object_type character varying(3000),
  subject_attribute character varying(3000),
  object_attribute character varying(3000),
  verb_attribute character varying(3000),
  subject_concept character varying(100),
  object_concept character varying(100),
  subject_sense character varying(100),
  object_sense character varying(100),
  subject_chain character varying(5000),
  object_chain character varying(5000),
  sub_type_id integer,
  obj_type_id integer,
  CONSTRAINT pk_svo_id PRIMARY KEY (svo_id)
)WITH (  OIDS=FALSE);
CREATE INDEX idx_svo2_id_dummy  ON svo2  USING btree  (doc_id, 
clause_id, sentence_id);


CREATE TABLE page_content
(
  content_id integer NOT NULL DEFAULT 
nextval('page_content_ogc_fid_seq'::regclass),

  wkb_geometry 

Re: [PERFORM] Why query takes soo much time

2011-05-16 Thread Denis de Bernardy
[big nestloop with a huge number of rows]

You're in an edge case, and I doubt you'll get things to run much faster: you 
want the last 1k rows out of an 18M row result set... It will be slow no matter 
what you do.

What the plan is currently doing, is it's going through these 18M rows using a 
for each loop, until it returns the 1k requested rows. Without the offset, the 
plan is absolutely correct (and quite fast, I take it). With the enormous 
offset, it's a different story as you've noted.

An alternative plan could have been to hash join the tables together, to sort 
the result set, and to apply the limit/offset on the resulting set. You can 
probably force the planner to do so by rewriting your statement using a with 
statement, too:

EXPLAIN ANALYZE
WITH rows AS (
SELECT c.clause, s.subject ,s.object , s.verb, s.subject_type, s.object_type 
,s.doc_id ,s.svo_id 
FROM clause2 c INNER JOIN svo2 s ON (c.clause_id=s.clause_id AND 
c.source_id=s.doc_id AND c.sentence_id=s.sentence_id)
   INNER JOIN page_content p ON (s.doc_id=p.crawled_page_id)
)
SELECT *
FROM rows
ORDER BY svo_id limit 1000 offset 17929000


I've my doubts that it'll make much of a different, though: you'll still be 
extracting the last 1k rows out of 18M.

D


Re: [PERFORM] Why query takes soo much time

2011-05-16 Thread Tom Lane
Denis de Bernardy ddeberna...@yahoo.com writes:
 An alternative plan could have been to hash join the tables together,
 to sort the result set, and to apply the limit/offset on the resulting
 set.

Indeed.  I rather wonder why the planner didn't do that to start with.
This plan looks to me like it might be suffering from insufficient
work_mem to allow use of a hash join.  Or possibly the OP changed some
of the cost_xxx or enable_xxx settings in a misguided attempt to force
it to use indexes instead.  As a rule of thumb, whole-table joins
probably ought not be using nestloop plans, and that frequently means
that indexes are worthless for them.

But in any case, as Craig noted, the real elephant in the room is the
huge OFFSET value.  It seems likely that this query is not standing
alone but is meant as one of a series that's supposed to provide
paginated output, and if so the total cost of the series is just going
to be impossible no matter what.  The OP needs to think about using a
cursor or some such to avoid repeating most of the work each time.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance