date:20110323

Re: [PERFORM] Reason of Slowness of query

2011-03-23 Thread Chetan Suttraway

On Wed, Mar 23, 2011 at 11:58 AM, Adarsh Sharma wrote:

>  Dear all,
>
> I have 2 tables in my database name clause2( 4900 MB) & page_content(1582
> MB).
>
> My table definations are as :
>
> *page_content :-
>
> *CREATE TABLE page_content
> (
>   content_id integer,
>   wkb_geometry geometry,
>   link_level integer,
>   isprocessable integer,
>   isvalid integer,
>   isanalyzed integer,
>   islocked integer,
>   content_language character(10),
>   url_id integer,
>   publishing_date character(40),
>   heading character(150),
>   category character(150),
>   crawled_page_url character(500),
>   keywords character(500),
>   dt_stamp timestamp with time zone,
>   "content" character varying,
>   crawled_page_id bigint,
>   id integer
> )
> WITH (
>   OIDS=FALSE
> );
>
> *Indexes on it :-*
> CREATE INDEX idx_page_id  ON page_content  USING btree  (crawled_page_id);
> CREATE INDEX idx_page_id_content   ON page_content  USING btree
> (crawled_page_id, content_language, publishing_date, isprocessable);
> CREATE INDEX pgweb_idx  ON page_content   USING gin
> (to_tsvector('english'::regconfig, content::text));
>
> *clause 2:-
> *CREATE TABLE clause2
> (
>   id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
>   source_id integer,
>   sentence_id integer,
>   clause_id integer,
>   tense character varying(30),
>   clause text,
>   CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
> )WITH ( OIDS=FALSE);
>
> *Indexes on it :
>
> *CREATE INDEX idx_clause2_march10
>   ON clause2
>   USING btree
>   (id, source_id);*
>
> *I perform a join query on it as :
>
> * explain analyze select distinct(p.crawled_page_id) from page_content p ,
> clause2  c where p.crawled_page_id != c.source_id ;
>
> *What it takes more than 1 hour to complete. As I issue the explain
> analyze command and cannot able to wait for output but I send my explain
> output as :
>  QUERY
> PLAN
>
> 
>  Unique  (cost=927576.16..395122387390.13 rows=382659 width=8)
>->  Nested Loop  (cost=927576.16..360949839832.15 rows=13669019023195
> width=8)
>  Join Filter: (p.crawled_page_id <> c.source_id)
>  ->  Index Scan using idx_page_id on page_content p
> (cost=0.00..174214.02 rows=428817 width=8)
>  ->  Materialize  (cost=927576.16..1370855.12 rows=31876196
> width=4)
>->  Seq Scan on clause2 c  (cost=0.00..771182.96
> rows=31876196 width=4)
> (6 rows)
>
>
> Please guide me how to make the above query run faster as I am not able to
> do that.
>
>
> Thanks, Adarsh
>
> *
>
> *
>

Could you try just explaining the below query:
explain  select distinct(p.crawled_page_id) from page_content p where NOT
EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);

The idea here is to avoid directly using NOT operator.



Regards,
Chetan

-- 
Chetan Suttraway
EnterpriseDB , The Enterprise
PostgreSQL
 company.

Re: [PERFORM] Reason of Slowness of query

2011-03-23 Thread Samuel Gendler

On Tue, Mar 22, 2011 at 11:28 PM, Adarsh Sharma wrote:

> *
> *I perform a join query on it as :
>
> * explain analyze select distinct(p.crawled_page_id) from page_content p ,
> clause2  c where p.crawled_page_id != c.source_id ;
>
> *What it takes more than 1 hour to complete. As I issue the explain
> analyze command and cannot able to wait for output but I send my explain
> output as :
>

please describe what your query is trying to select, as it is possible that
query isn't doing what you think it is.  joining 2 tables where id1 != id2
will create a cross multiple of the two tables such that every row from the
first table is matched with every single row from the second table that
doesn't have a matching id.  Then you are looking for distinct values on
that potentially enormous set of rows.

db_v2=# select * from table1;
 id | value
+---
  1 | 1
  2 | 2
  3 | 3
(3 rows)

db_v2=# select * from table2;
 id | value
+---
  1 | 4
  2 | 5
  3 | 6
(3 rows)

db_v2=# select t1.id, t1.value, t2.id, t2.value from table1 t1, table2 t2
where t1.id != t2.id;
 id | value | id | value
+---++---
  1 | 1 |  2 | 5
  1 | 1 |  3 | 6
  2 | 2 |  1 | 4
  2 | 2 |  3 | 6
  3 | 3 |  1 | 4
  3 | 3 |  2 | 5

So if you have a couple of million rows in each table, you are selecting
distinct over a potentially huge set of data.   If you are actually trying
to find all ids from one table which have no match at all in the other
table, then you need an entirely different query:

db_v2=# insert into table2 (value) values (7);
INSERT 0 1

db_v2=# select * from table2;
 id | value
+---
  1 | 4
  2 | 5
  3 | 6
  4 | 7

db_v2=# select t2.id, t2.value from table2 t2 where not exists (select 1
from table1 t1 where t1.id = t2.id);
 id | value
+---
  4 | 7

Re: [PERFORM] Reason of Slowness of query

2011-03-23 Thread Vitalii Tymchyshyn


23.03.11 08:28, Adarsh Sharma ???(??):

*
*I perform a join query on it as :

* explain analyze select distinct(p.crawled_page_id) from page_content 
p , clause2  c where p.crawled_page_id != c.source_id ;*
Your query is wrong. This query will return every *crawled_page_id* if 
clause2 has more then 1 source_id. This is because DB will be able to 
find clause with source_id different from crawled_page_id. You need to 
use "not exists" or "not in".


Best regards, Vitalii Tymchyshyn.

Re: [PERFORM] Reason of Slowness of query

2011-03-23 Thread Chetan Suttraway

On Wed, Mar 23, 2011 at 12:50 PM, Samuel Gendler
wrote:

> On Tue, Mar 22, 2011 at 11:28 PM, Adarsh Sharma 
> wrote:
>
>> *
>> *I perform a join query on it as :
>>
>> * explain analyze select distinct(p.crawled_page_id) from page_content p
>> , clause2  c where p.crawled_page_id != c.source_id ;
>>
>> *What it takes more than 1 hour to complete. As I issue the explain
>> analyze command and cannot able to wait for output but I send my explain
>> output as :
>>
>
>
> please describe what your query is trying to select, as it is possible that
> query isn't doing what you think it is.  joining 2 tables where id1 != id2
> will create a cross multiple of the two tables such that every row from the
> first table is matched with every single row from the second table that
> doesn't have a matching id.  Then you are looking for distinct values on
> that potentially enormous set of rows.
>
> db_v2=# select * from table1;
>  id | value
> +---
>   1 | 1
>   2 | 2
>   3 | 3
> (3 rows)
>
> db_v2=# select * from table2;
>  id | value
> +---
>   1 | 4
>   2 | 5
>   3 | 6
> (3 rows)
>
> db_v2=# select t1.id, t1.value, t2.id, t2.value from table1 t1, table2 t2
> where t1.id != t2.id;
>  id | value | id | value
> +---++---
>   1 | 1 |  2 | 5
>   1 | 1 |  3 | 6
>   2 | 2 |  1 | 4
>   2 | 2 |  3 | 6
>   3 | 3 |  1 | 4
>   3 | 3 |  2 | 5
>
> So if you have a couple of million rows in each table, you are selecting
> distinct over a potentially huge set of data.   If you are actually trying
> to find all ids from one table which have no match at all in the other
> table, then you need an entirely different query:
>
> db_v2=# insert into table2 (value) values (7);
> INSERT 0 1
>
> db_v2=# select * from table2;
>  id | value
> +---
>   1 | 4
>   2 | 5
>   3 | 6
>   4 | 7
>
> db_v2=# select t2.id, t2.value from table2 t2 where not exists (select 1
> from table1 t1 where t1.id = t2.id);
>  id | value
> +---
>   4 | 7
>
>

Check this setup:
pg=# create table t1(a int, b int);
CREATE TABLE
pg=# create index t1_b on t1(b);
CREATE INDEX
pg=# create table t2(c int, d int);
CREATE TABLE
pg=# create index t2_cd on t2(c,d);
CREATE INDEX
pg=# explain select distinct(b) from t1,t2 where t1.b !=t2.d;
  QUERY
PLAN
---
 Unique  (cost=0.00..80198.86 rows=200 width=4)
   ->  Nested Loop  (cost=0.00..68807.10 rows=4556702 width=4)
 Join Filter: (t1.b <> t2.d)
 ->  Index Scan using t1_b on t1  (cost=0.00..76.35 rows=2140
width=4)
 ->  Materialize  (cost=0.00..42.10 rows=2140 width=4)
   ->  Seq Scan on t2  (cost=0.00..31.40 rows=2140 width=4)
(6 rows)


pg=# explain select distinct(b) from t1 where NOT EXISTS (select 1 from t2
where t2.d=t1.b);
   QUERY PLAN

 HashAggregate  (cost=193.88..193.89 rows=1 width=4)
   ->  Hash Anti Join  (cost=58.15..193.88 rows=1 width=4)
 Hash Cond: (t1.b = t2.d)
 ->  Seq Scan on t1  (cost=0.00..31.40 rows=2140 width=4)
 ->  Hash  (cost=31.40..31.40 rows=2140 width=4)
   ->  Seq Scan on t2  (cost=0.00..31.40 rows=2140 width=4)
(6 rows)

The cost seems to be on higher side, but maybe on your system with index
scan on t2 and t1, the cost might be on lower side.

Another query which forced index scan was :
pg=# explain select distinct(b) from t1,t2 where t1.b >t2.d union all
select distinct(b) from t1,t2 where  t1.b   Unique  (cost=0.00..50246.37 rows=200 width=4)
 ->  Nested Loop  (cost=0.00..46430.04 rows=1526533 width=4)
   ->  Index Scan using t1_b on t1  (cost=0.00..76.35 rows=2140
width=4)
   ->  Index Scan using t2_d on t2  (cost=0.00..12.75 rows=713
width=4)
 Index Cond: (public.t1.b > public.t2.d)
   ->  Unique  (cost=0.00..50246.37 rows=200 width=4)
 ->  Nested Loop  (cost=0.00..46430.04 rows=1526533 width=4)
   ->  Index Scan using t1_b on t1  (cost=0.00..76.35 rows=2140
width=4)
   ->  Index Scan using t2_d on t2  (cost=0.00..12.75 rows=713
width=4)
 Index Cond: (public.t1.b < public.t2.d)
(11 rows)


This looks like to a acceptable.
Please try this above query with your setup and post the explain output.


-- 
Regards,
Chetan Suttraway
EnterpriseDB , The Enterprise
PostgreSQL
 company.

Re: [PERFORM] Reason of Slowness of query

2011-03-23 Thread Adarsh Sharma


Thanks Chetan, here is the output of your updated query :


*explain  select distinct(p.crawled_page_id) from page_content p where 
NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);


*
 QUERY 
PLAN  
---

HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8)
  ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071 width=8)
Hash Cond: (p.crawled_page_id = c.source_id)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8)

->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4)
  ->  Seq Scan on clause2 c  (cost=0.00..771182.96 
rows=31876196 width=4)

(6 rows)

And my explain analyze output is :

 QUERY 
PLAN

HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8) 
(actual time=5.181..56669.270 rows=72 loops=1)
  ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071 width=8) 
(actual time=45740.789..56665.816 rows=74 loops=1)

Hash Cond: (p.crawled_page_id = c.source_id)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)
->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4) 
(actual time=45310.524..45310.524 rows=31853083 loops=1)
  ->  Seq Scan on clause2 c  (cost=0.00..771182.96 
rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083 loops=1)

Total runtime: 56687.660 ms
(7 rows)

But Is there is any option to tune it further and one more thing output 
rows varies from 6 to 7.



Thanks & best Regards,
Adarsh Sharma






Chetan Suttraway wrote:



On Wed, Mar 23, 2011 at 11:58 AM, Adarsh Sharma 
mailto:adarsh.sha...@orkash.com>> wrote:


Dear all,

I have 2 tables in my database name clause2( 4900 MB) &
page_content(1582 MB).

My table definations are as :

*page_content :-

*CREATE TABLE page_content
(
  content_id integer,
  wkb_geometry geometry,
  link_level integer,
  isprocessable integer,
  isvalid integer,
  isanalyzed integer,
  islocked integer,
  content_language character(10),
  url_id integer,
  publishing_date character(40),
  heading character(150),
  category character(150),
  crawled_page_url character(500),
  keywords character(500),
  dt_stamp timestamp with time zone,
  "content" character varying,
  crawled_page_id bigint,
  id integer
)
WITH (
  OIDS=FALSE
);

*Indexes on it :-*
CREATE INDEX idx_page_id  ON page_content  USING btree 
(crawled_page_id);
CREATE INDEX idx_page_id_content   ON page_content  USING btree 
(crawled_page_id, content_language, publishing_date, isprocessable);
CREATE INDEX pgweb_idx  ON page_content   USING gin  
(to_tsvector('english'::regconfig, content::text));


*clause 2:-
*CREATE TABLE clause2
(
  id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
  source_id integer,
  sentence_id integer,
  clause_id integer,
  tense character varying(30),
  clause text,
  CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);

*Indexes on it :

*CREATE INDEX idx_clause2_march10
  ON clause2
  USING btree
  (id, source_id);*

*I perform a join query on it as :

* explain analyze select distinct(p.crawled_page_id) from
page_content p , clause2  c where p.crawled_page_id != c.source_id ;

*What it takes more than 1 hour to complete. As I issue the
explain analyze command and cannot able to wait for output but I
send my explain output as :
 QUERY
PLAN  


 Unique  (cost=927576.16..395122387390.13 rows=382659 width=8)
   ->  Nested Loop  (cost=927576.16..360949839832.15
rows=13669019023195 width=8)
 Join Filter: (p.crawled_page_id <> c.source_id)
 ->  Index Scan using idx_page_id on page_content p 
(cost=0.00..174214.02 rows=428817 width=8)

 ->  Materialize  (cost=927576.16..1370855.12
rows=31876196 width=4)
   ->  Seq Scan on clause2 c  (cost=0.00..771182.96
rows=31876196 width=4)
(6 rows)


Please guide me how to make the above query run faster as I am not
able to do that.


Thanks, Adarsh

*

*


Could you try just explaining the below query:
explain  select distin

[PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Adarsh Sharma


Thanks Chetan, After my Lunch Break, I tried the below steps :

*My original query was :
*explain analyze select distinct(p.crawled_page_id) from page_content p 
, clause2  c where p.crawled_page_id != c.source_id


which hangs because it is wrong query to fetch the desired output .

*Next Updated Query be Chetan Suttraway :*

explain analyze select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from  clause2 c where c.source_id = 
p.crawled_page_id);


   
QUERY 
PLAN 
-
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual 
time=7192.843..7195.923 rows=72 loops=1)
  ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) 
(actual time=0.040..7192.426 rows=74 loops=1)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.009..395.599 rows=428467 loops=1)
->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.014..0.014 rows=1 
loops=428467)

  Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7199.748 ms
(6 rows)

I think it is very much faster but I don't understand the query :

*explain select distinct(b) from t1,t2 where t1.b >t2.d union all  
select distinct(b) from t1,t2 where  t1.b 

*As i transform it into my format as:

explain select distinct(p.crawled_page_id) from page_content p , clause2 
c where p.crawled_page_id > c.source_id union all  select 
distinct(p.crawled_page_id) from page_content p,clause2 c where 
p.crawled_page_id < c.source_id;


QUERY 
PLAN 
-

Append  (cost=0.00..296085951076.34 rows=765318 width=8)
  ->  Unique  (cost=0.00..148042971711.58 rows=382659 width=8)
->  Nested Loop  (cost=0.00..136655213119.84 rows=4555103436696 
width=8)
  ->  Index Scan using idx_page_id on page_content p  
(cost=0.00..174214.02 rows=428817 width=8)
  ->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..185898.05 rows=10622488 width=4)

Index Cond: (p.crawled_page_id > c.source_id)
  ->  Unique  (cost=0.00..148042971711.58 rows=382659 width=8)
->  Nested Loop  (cost=0.00..136655213119.84 rows=4555103436696 
width=8)
  ->  Index Scan using idx_page_id on page_content p  
(cost=0.00..174214.02 rows=428817 width=8)
  ->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..185898.05 rows=10622488 width=4)

Index Cond: (p.crawled_page_id < c.source_id)
(11 rows)

I don't think this is correct because it produce 11 rows output.

Any further suggestions, Please guide.

Thanks & best Regards,
Adarsh Sharma

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Vitalii Tymchyshyn


23.03.11 11:17, Adarsh Sharma ???(??):


I think it is very much faster but I don't understand the query :

*explain select distinct(b) from t1,t2 where t1.b >t2.d union all  
select distinct(b) from t1,t2 where  t1.b 
*

I don't understand it too. What are you trying to get? Is it
select distinct(b) from t1 where  b > (select min(d) from t2)**or b < 
(select max(d) from t2)

?

Can you explain in words, not SQL, what do you expect do retrieve?

Best regards, Vitalii Tymchyshyn

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Adarsh Sharma

I just want to retrieve that id 's from page_content which do not have 
any entry in clause2 table.


Thanks , Adarsh

Vitalii Tymchyshyn wrote:

23.03.11 11:17, Adarsh Sharma ???(??):


I think it is very much faster but I don't understand the query :

*explain select distinct(b) from t1,t2 where t1.b >t2.d union all  
select distinct(b) from t1,t2 where  t1.b 
*

I don't understand it too. What are you trying to get? Is it
select distinct(b) from t1 where  b > (select min(d) from t2)* *or b  
< (select max(d) from t2)

?

Can you explain in words, not SQL, what do you expect do retrieve?

Best regards, Vitalii Tymchyshyn

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Vitalii Tymchyshyn


23.03.11 12:10, Adarsh Sharma ???(??):
I just want to retrieve that id 's from page_content which do not have 
any entry in clause2 table.



Then
select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = 
p.crawled_page_id);

is correct query.

Best regards, Vitalii Tymchyshyn.

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Adarsh Sharma


Vitalii Tymchyshyn wrote:

23.03.11 12:10, Adarsh Sharma ???(??):
I just want to retrieve that id 's from page_content which do not 
have any entry in clause2 table.



Then
select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = 
p.crawled_page_id);

is correct query.



I can't understand how* select 1 from  clause2 c where c.source_id = 
p.crawled_page_id works too, *i get my output .


What is the significance of 1 here.

Thanks , Adarsh
**

Best regards, Vitalii Tymchyshyn.

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Vitalii Tymchyshyn


23.03.11 12:19, Adarsh Sharma ???(??):

Vitalii Tymchyshyn wrote:

23.03.11 12:10, Adarsh Sharma ???(??):
I just want to retrieve that id 's from page_content which do not 
have any entry in clause2 table.



Then
select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = 
p.crawled_page_id);

is correct query.



I can't understand how*select 1 from  clause2 c where c.source_id = 
p.crawled_page_id works too, *i get my output .


What is the significance of 1 here.
No significance. You can put anything there. E.g. "*". Simply arbitrary 
constant. Exists checks if there were any rows, it does not matter which 
columns are there or what is in this columns.


Best regards, Vitalii Tymchyshyn

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread tv

> I just want to retrieve that id 's from page_content which do not have
> any entry in clause2 table.

In that case the query probably does not work (at least the query you've
sent in the first post) as it will return even those IDs that have at
least one other row in 'clause2' (not matching the != condition). At least
that's how I understand it.

So instead of this

select distinct(p.crawled_page_id)
from page_content p, clause2 c where p.crawled_page_id != c.source_id ;

you should probably do this

select distinct(p.crawled_page_id)
from page_content p left join clause2 c on (p.crawled_page_id =
c.source_id) where (c.source_id is null);

I guess this will be much more efficient too.

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Chetan Suttraway

On Wed, Mar 23, 2011 at 3:49 PM, Adarsh Sharma wrote:

>  Vitalii Tymchyshyn wrote:
>
> 23.03.11 12:10, Adarsh Sharma написав(ла):
>
> I just want to retrieve that id 's from page_content which do not have any
> entry in clause2 table.
>
>  Then
> select distinct(p.crawled_page_id) from page_content p
>  where NOT EXISTS (select 1 from  clause2 c where c.source_id =
> p.crawled_page_id);
> is correct query.
>
>
> I can't understand how* select 1 from  clause2 c where c.source_id =
> p.crawled_page_id works too, *i get my output .
>
> What is the significance of 1 here.
>
> Thanks , Adarsh
> **
>
> Best regards, Vitalii Tymchyshyn.
>
>
>
Its the inverted logic for finding crawled_page_id not matching with
source_id.
Actually, the idea was to force index scan on clause2 though.


-- 
Regards,
Chetan Suttraway
EnterpriseDB , The Enterprise
PostgreSQL
 company.

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Chetan Suttraway

On Wed, Mar 23, 2011 at 4:08 PM,  wrote:

> > I just want to retrieve that id 's from page_content which do not have
> > any entry in clause2 table.
>
> In that case the query probably does not work (at least the query you've
> sent in the first post) as it will return even those IDs that have at
> least one other row in 'clause2' (not matching the != condition). At least
> that's how I understand it.
>
> true.


> So instead of this
>
> select distinct(p.crawled_page_id)
> from page_content p, clause2 c where p.crawled_page_id != c.source_id ;
>
> you should probably do this
>
> select distinct(p.crawled_page_id)
> from page_content p left join clause2 c on (p.crawled_page_id =
> c.source_id) where (c.source_id is null);
>
> I guess this will be much more efficient too.
>
>
This looks like to give expected results. Also note that the where clause
"is null" is really required and is not an
optional predicate.



> regards
> Tomas
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



-- 
Regards,
Chetan Suttraway
EnterpriseDB , The Enterprise
PostgreSQL
 company.

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Adarsh Sharma


Thank U all, for U'r Nice Support.

Let me Conclude the results, below results are obtained after finding 
the needed queries :


*First Option :

*pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);

QUERY 
PLAN 
-
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual 
time=87927.000..87930.084 rows=72 loops=1)
  ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) 
(actual time=0.191..87926.546 rows=74 loops=1)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 
loops=428467)

  Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 87933.882 ms :-(
(6 rows)

*Second Option :

*pdc_uima=# explain analyze select distinct(p.crawled_page_id) from 
page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id 
= p.crawled_page_id);

QUERY 
PLAN 
-
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual 
time=7047.259..7050.261 rows=72 loops=1)
  ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) 
(actual time=0.039..7046.826 rows=74 loops=1)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 
loops=428467)

  Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7054.074 ms :-)
(6 rows)


Thanks & best Regards,
Adarsh Sharma




Chetan Suttraway wrote:



On Wed, Mar 23, 2011 at 4:08 PM, mailto:t...@fuzzy.cz>> wrote:

> I just want to retrieve that id 's from page_content which do
not have
> any entry in clause2 table.

In that case the query probably does not work (at least the query
you've
sent in the first post) as it will return even those IDs that have at
least one other row in 'clause2' (not matching the != condition).
At least
that's how I understand it.

true.
 


So instead of this

select distinct(p.crawled_page_id)
from page_content p, clause2 c where p.crawled_page_id !=
c.source_id ;

you should probably do this

select distinct(p.crawled_page_id)
from page_content p left join clause2 c on (p.crawled_page_id =
c.source_id) where (c.source_id is null);

I guess this will be much more efficient too.


This looks like to give expected results. Also note that the where 
clause "is null" is really required and is not an

optional predicate.

 


regards
Tomas


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




--
Regards,
Chetan Suttraway
EnterpriseDB , The Enterprise PostgreSQL 
 company.

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Vitalii Tymchyshyn


23.03.11 13:21, Adarsh Sharma ???(??):

Thank U all, for U'r Nice Support.

Let me Conclude the results, below results are obtained after finding 
the needed queries :


*First Option :

*pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
 
QUERY PLAN

-
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) 
(actual time=87927.000..87930.084 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 
width=8) (actual time=0.191..87926.546 rows=74 loops=1)
 ->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
 ->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 
loops=428467)

   Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 87933.882 ms:-(
(6 rows)

*Second Option :

*pdc_uima=# explain analyze select distinct(p.crawled_page_id) from 
page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where 
c.source_id = p.crawled_page_id);
 
QUERY PLAN

-
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) 
(actual time=7047.259..7050.261 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 
width=8) (actual time=0.039..7046.826 rows=74 loops=1)
 ->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
 ->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 
loops=428467)

   Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 7054.074 ms :-)
(6 rows)



Actually the plans are equal, so I suppose it depends on what were run 
first :). Slow query operates with data mostly on disk, while fast one 
with data in memory.


Best regards, Vitalii Tymchyshyn

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Chetan Suttraway

On Wed, Mar 23, 2011 at 4:51 PM, Vitalii Tymchyshyn wrote:

>  23.03.11 13:21, Adarsh Sharma написав(ла):
>
> Thank U all, for U'r Nice Support.
>
> Let me Conclude the results, below results are obtained after finding the
> needed queries :
>
> *First Option :
>
> *pdc_uima=# explain analyze select distinct(p.crawled_page_id)
> pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
> pdc_uima(# c.source_id) where (c.source_id is null);
>  QUERY
> PLAN
>
> -
>  HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual
> time=87927.000..87930.084 rows=72 loops=1)
>->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8)
> (actual time=0.191..87926.546 rows=74 loops=1)
>  ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817
> width=8) (actual time=0.027..528.978 rows=428467 loops=1)
>  ->  Index Scan using idx_clause2_source_id on clause2 c
> (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1
> loops=428467)
>Index Cond: (p.crawled_page_id = c.source_id)
>  Total runtime: 87933.882 ms :-(
> (6 rows)
>
> *Second Option :
>
> *pdc_uima=# explain analyze select distinct(p.crawled_page_id) from
> page_content p
> pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id =
> p.crawled_page_id);
>  QUERY
> PLAN
>
> -
>  HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual
> time=7047.259..7050.261 rows=72 loops=1)
>->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8)
> (actual time=0.039..7046.826 rows=74 loops=1)
>  ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817
> width=8) (actual time=0.008..388.976 rows=428467 loops=1)
>  ->  Index Scan using idx_clause2_source_id on clause2 c
> (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1
> loops=428467)
>Index Cond: (c.source_id = p.crawled_page_id)
>  Total runtime: 7054.074 ms :-)
> (6 rows)
>
>
> Actually the plans are equal, so I suppose it depends on what were run
> first :). Slow query operates with data mostly on disk, while fast one with
> data in memory.
>
> yeah. maybe the easiest way, is to start a fresh session and fire the
queries.


>  Best regards, Vitalii Tymchyshyn
>



-- 
Regards,
Chetan Suttraway
EnterpriseDB , The Enterprise
PostgreSQL
 company.

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Adarsh Sharma


Vitalii Tymchyshyn wrote:

23.03.11 13:21, Adarsh Sharma ???(??):

Thank U all, for U'r Nice Support.

Let me Conclude the results, below results are obtained after finding 
the needed queries :


*First Option :

*pdc_uima=# explain analyze select distinct(p.crawled_page_id) from 
page_content p left join clause2 c on (p.crawled_page_id = 
c.source_id) where (c.source_id is null);
 
QUERY 
PLAN  


-
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) 
(actual time=87927.000..87930.084 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 
width=8) (actual time=0.191..87926.546 rows=74 loops=1)
 ->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
 ->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 
loops=428467)

   Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 87933.882 ms :-(
(6 rows)

*Second Option :

*pdc_uima=# explain analyze select distinct(p.crawled_page_id) from 
page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where 
c.source_id = p.crawled_page_id);
 
QUERY 
PLAN  


-
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) 
(actual time=7047.259..7050.261 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 
width=8) (actual time=0.039..7046.826 rows=74 loops=1)
 ->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
 ->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 
loops=428467)

   Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 7054.074 ms :-)
(6 rows)



Actually the plans are equal, so I suppose it depends on what were run 
first :). Slow query operates with data mostly on disk, while fast one 
with data in memory.


Yes U 'r absolutely right, if I run it again, it display the output as :

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from 
page_content p left join clause2 c on (p.crawled_page_id = c.source_id) 
where (c.source_id is null);


   
QUERY 
PLAN 
-
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual 
time=7618.452..7621.427 rows=72 loops=1)
  ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) 
(actual time=0.131..7618.043 rows=74 loops=1)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.020..472.811 rows=428467 loops=1)
->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.015..0.015 rows=1 
loops=428467)

  Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 7637.132 ms
(6 rows)

I let U know after a fresh start (session ).
Then the true result comes and if further tuning required can be performed.

Best Regards, Adarsh


Best regards, Vitalii Tymchyshyn

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Adarsh Sharma




Actually the plans are equal, so I suppose it depends on what were
run first :). Slow query operates with data mostly on disk, while
fast one with data in memory.

yeah. maybe the easiest way, is to start a fresh session and fire the 
queries.



After the fresh start , the results obtained are :

pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-#  from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(#  c.source_id) where (c.source_id is null);

QUERY 
PLAN 
-
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual 
time=7725.132..7728.341 rows=72 loops=1)
  ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) 
(actual time=0.115..7724.713 rows=74 loops=1)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.021..472.199 rows=428467 loops=1)
->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.015..0.015 rows=1 
loops=428467)

  Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 7731.840 ms
(6 rows)

pdc_uima=#  explain analyze select distinct(p.crawled_page_id) 
from page_content p
pdc_uima-#   where NOT EXISTS (select 1 from  clause2 c where 
c.source_id = p.crawled_page_id);

QUERY 
PLAN 
-
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual 
time=6192.249..6195.368 rows=72 loops=1)
  ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) 
(actual time=0.036..6191.838 rows=74 loops=1)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.008..372.489 rows=428467 loops=1)
->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.012..0.012 rows=1 
loops=428467)

  Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 6198.567 ms
(6 rows)


This seems a slight upper hand of the second query .


Would it be possible to tune it further.
My postgresql.conf parameters are as follows : ( Total RAM = 16 GB )

shared_buffers = 4GB
max_connections=700
effective_cache_size = 6GB
work_mem=16MB
maintenance_mem=64MB

I think to change

work_mem=64MB
maintenance_mem=256MB

Does it has some effects now.


Thanks & best Regards,
Adarsh Sharma



Best regards, Vitalii Tymchyshyn




--
Regards,
Chetan Suttraway
EnterpriseDB , The Enterprise PostgreSQL 
 company.

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread tv

>
>> Actually the plans are equal, so I suppose it depends on what were
>> run first :). Slow query operates with data mostly on disk, while
>> fast one with data in memory.
>>
>> yeah. maybe the easiest way, is to start a fresh session and fire the
>> queries.
>
>
> After the fresh start , the results obtained are :

As Chetan Suttraway already pointed out, the execution plans are exactly
the same. And by "excactly" I mean there's no difference in evaluating
those two queries.

The difference is due to cached data - not just in shared buffers (which
will be lost of postgres restart) but also in filesystem cache (which is
managed by kernel, not postgres).

So the first execution had to load (some of) the data into shared buffers,
while the second execution already had a lot of data in shared buffers.
That's why the first query run in 7.7sec while the second 6.2sec.

>> This seems a slight upper hand of the second query .

Again, there's no difference between those two queries, they're exactly
the same. It's just a matter of which of them is executed first.

> Would it be possible to tune it further.

I don't think so. The only possibility I see is to add a flag into
page_content table, update it using a trigger (when something is
inserted/deleted from clause2). Then you don't need to do the join.

> My postgresql.conf parameters are as follows : ( Total RAM = 16 GB )
>
> shared_buffers = 4GB
> max_connections=700
> effective_cache_size = 6GB
> work_mem=16MB
> maintenance_mem=64MB
>
> I think to change
>
> work_mem=64MB
> maintenance_mem=256MB
>
> Does it has some effects now.

Generally a good idea, but we don't know if there are other processes
running on the same machine and what kind of system is this (how many
users are there, what kind of queries do they run). If there's a lot of
users, keep work_mem low. If there's just a few users decrease
max_connections and bump up work_mem and consider increasing
shared_buffers.

Maintenance_work_mem is used for vacuum/create index etc. so it really
does not affect regular queries.

Some of those values (e.g. work_mem/maintenance_work_mem) are dynamic, so
you can set them for the current connection and see how it affects the
queries.

Just do something like

db=# SET work_mem='32MB'
db=# EXPLAIN ANALYZE SELECT ...

But I don't think this will improve the query we've been talking about.

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Reason of Slowness of query

2011-03-23 Thread Vitalii Tymchyshyn


23.03.11 09:30, Adarsh Sharma ???(??):

Thanks Chetan, here is the output of your updated query :


*explain  select distinct(p.crawled_page_id) from page_content p where 
NOT EXISTS (select 1 from  clause2 c where c.source_id = 
p.crawled_page_id);


*
  QUERY PLAN
---
 HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8)
   ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071 width=8)
 Hash Cond: (p.crawled_page_id = c.source_id)
 ->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8)

 ->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4)
   ->  Seq Scan on clause2 c  (cost=0.00..771182.96 
rows=31876196 width=4)

(6 rows)

And my explain analyze output is :

  QUERY PLAN

 HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8) 
(actual time=5.181..56669.270 rows=72 loops=1)
   ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071 
width=8) (actual time=45740.789..56665.816 rows=74 loops=1)

 Hash Cond: (p.crawled_page_id = c.source_id)
 ->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)
 ->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4) 
(actual time=45310.524..45310.524 rows=31853083 loops=1)
   ->  Seq Scan on clause2 c  (cost=0.00..771182.96 
rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083 
loops=1)

 Total runtime: 56687.660 ms
(7 rows)

But Is there is any option to tune it further and one more thing 
output rows varies from 6 to 7.

You need an index on source_id to prevent seq scan, like the next:
CREATE INDEX idx_clause2_source_id
  ON clause2
  (source_id);*

*Best regards, Vitalii Tymchyshyn

[PERFORM] buffercache/bgwriter

2011-03-23 Thread Uwe Bartels

Hi,

I have very bad bgwriter statistics on a server which runs since many weeks
and it is still the same after a recent restart.
There are roughly 50% of buffers written by the backend processes and the
rest by checkpoints.
The statistics below are from a server with 140GB RAM, 32GB shared_buffers
and a runtime of one hour.

As you can see in the pg_buffercache view that there are most buffers
without usagecount - so they are as free or even virgen as they can be.
At the same time I have 53% percent of the dirty buffers written by the
backend process.

I want to tune the database to achieve a ratio of max 10% backend writer vs.
90% checkpoint or bgwriter writes.
But I don't understand how postgres is unable to fetch a free buffer.
Does any body have an idea?

I'm running postgres 8.4.4 64 Bit on linux.

Best Regards,
Uwe

background writer stats
 checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean |
maxwritten_clean | buffers_backend | buffers_alloc
---+-++---+--+-+---
 3 |   0 |  99754 | 0
|0 |  115307 |246173
(1 row)


background writer relative stats
 checkpoints_timed | minutes_between_checkpoint | buffers_checkpoint |
buffers_clean | buffers_backend | total_writes | avg_checkpoint_write
---+++---+-+--+--
 100%  | 10 | 46%|
0%| 53% | 0.933 MB/s   | 259.000 MB
(1 row)

postgres=# select usagecount,count(*),isdirty from pg_buffercache group by
isdirty,usagecount order by isdirty,usagecount;
 usagecount |  count  | isdirty
+-+-
  1 |   31035 | f
  2 |   13109 | f
  3 |  184290 | f
  4 |6581 | f
  5 |  912068 | f
  1 |   6 | t
  2 |  35 | t
  3 |  48 | t
  4 |  53 | t
  5 |   43066 | t
| 3004013 |
(11 rows)

Re: [PERFORM] good old VACUUM FULL

2011-03-23 Thread Shaun Thomas


On 03/23/2011 01:16 AM, Scott Marlowe wrote:


Then either cluster failed (did you get an error message) or the table
was not bloated.  Given that it looks like it was greatly reduced in
size by the vacuum full, I'd guess cluster failed for some reason.


Or it just bloated again. Remember, he still hasn't changed his 
max_fsm_pages setting, and that table apparently experiences *very* high 
turnover.


A 25x bloat factor isn't unheard of for such a table. We have one that 
needs to have autovacuum or be manually vacuumed frequently because it 
experiences several thousand update/deletes per minute. The daily 
turnover of that particular table is around 110x. If our fsm settings 
were too low, or we didn't vacuum regularly, I could easily see that 
table quickly becoming unmanageable. I fear for his django_session table 
for similar reasons.


Felix, I know you don't want to "experiment" with kernel parameters, but 
you *need* to increase your max_fsm_pages setting.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@peak6.com

__

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Jochen Erwied

Wednesday, March 23, 2011, 1:51:31 PM you wrote:

[rearranged for quoting]

> background writer stats
>  checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean |
> maxwritten_clean | buffers_backend | buffers_alloc
> ---+-++---+--+-+---
>  3 |   0 |  99754 | 0
> |0 |  115307 |246173
> (1 row)

buffers_clean = 0 ?!

> But I don't understand how postgres is unable to fetch a free buffer.
> Does any body have an idea?

Somehow looks like the bgwriter is completely disabled. How are the 
relevant settings in your postgresql.conf?


-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Shaun Thomas


On 03/23/2011 04:17 AM, Adarsh Sharma wrote:


explain analyze select distinct(p.crawled_page_id) from page_content
p where NOT EXISTS (select 1 from clause2 c where c.source_id =
p.crawled_page_id);


You know... I'm surprised nobody has mentioned this, but DISTINCT is 
very slow unless you have a fairly recent version of Postgres that 
replaces it with something faster. Try this:


EXPLAIN ANALYZE
SELECT p.crawled_page_id
  FROM page_content p
 WHERE NOT EXISTS (
 SELECT 1
   FROM clause2 c
  WHERE c.source_id = p.crawled_page_id
   )
 GROUP BY p.crawled_page_id;

Or if you like the cleaner query without a sub-select:

EXPLAIN ANALYZE
SELECT p.crawled_page_id
  FROM page_content p
  LEFT JOIN clause2 c ON (c.source_id = p.crawled_page_id)
 WHERE c.source_id IS NULL
 GROUP BY p.crawled_page_id;

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@peak6.com

__

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Uwe Bartels

Hi Jochen,

yes, I had that impression too.
But it is running. ...And has almost no effect. I changed all parameter to
the most aggressive, but
Before I restarted the server I had a percentage of writes by the bgwriter
of less that 1 percent.

postgres=# select name,setting from pg_settings where name like 'bgw%';
  name   | setting
-+-
 bgwriter_delay  | 10
 bgwriter_lru_maxpages   | 1000
 bgwriter_lru_multiplier | 10

Best...
Uwe

Uwe Bartels
Systemarchitect - Freelancer
mailto: uwe.bart...@gmail.com
tel: +49 172 3899006
profile: https://www.xing.com/profile/Uwe_Bartels
website: http://www.uwebartels.com



On 23 March 2011 14:19, Jochen Erwied wrote:

> Wednesday, March 23, 2011, 1:51:31 PM you wrote:
>
> [rearranged for quoting]
>
> > background writer stats
> >  checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean
> |
> > maxwritten_clean | buffers_backend | buffers_alloc
> >
> ---+-++---+--+-+---
> >  3 |   0 |  99754 | 0
> > |0 |  115307 |246173
> > (1 row)
>
> buffers_clean = 0 ?!
>
> > But I don't understand how postgres is unable to fetch a free buffer.
> > Does any body have an idea?
>
> Somehow looks like the bgwriter is completely disabled. How are the
> relevant settings in your postgresql.conf?
>
>
> --
> Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX:
> -19
> Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX:
> -50
> D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de
> +49-173-5404164
>
>

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread tv

> On 03/23/2011 04:17 AM, Adarsh Sharma wrote:
>
>> explain analyze select distinct(p.crawled_page_id) from page_content
>> p where NOT EXISTS (select 1 from clause2 c where c.source_id =
>> p.crawled_page_id);
>
> You know... I'm surprised nobody has mentioned this, but DISTINCT is
> very slow unless you have a fairly recent version of Postgres that
> replaces it with something faster. Try this:

Nobody mentioned that because the explain plan already uses hash aggregate
(instead of the old sort)

 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual
time=7047.259..7050.261 rows=72 loops=1)

which means this is at least 8.4. Plus the 'distinct' step uses less than
1% of total time, so even if you improve it the impact will be minimal.

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Shaun Thomas


On 03/23/2011 09:16 AM, t...@fuzzy.cz wrote:


which means this is at least 8.4. Plus the 'distinct' step uses less than
1% of total time, so even if you improve it the impact will be minimal.


Haha. Noted. I guess I'm still on my original crusade against DISTINCT. 
I was pulling it out of so much old code it's been fused to my DNA. 
Actually, we're still on 8.2 so... :)


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@peak6.com

__

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread tv

> Hi,
>
> I have very bad bgwriter statistics on a server which runs since many
> weeks
> and it is still the same after a recent restart.
> There are roughly 50% of buffers written by the backend processes and the
> rest by checkpoints.
> The statistics below are from a server with 140GB RAM, 32GB shared_buffers
> and a runtime of one hour.
>
> As you can see in the pg_buffercache view that there are most buffers
> without usagecount - so they are as free or even virgen as they can be.
> At the same time I have 53% percent of the dirty buffers written by the
> backend process.

There are some nice old threads dealing with this - see for example

http://postgresql.1045698.n5.nabble.com/Bgwriter-and-pg-stat-bgwriter-buffers-clean-aspects-td2071472.html

http://postgresql.1045698.n5.nabble.com/tuning-bgwriter-in-8-4-2-td1926854.html

and there even some nice external links to more detailed explanation

http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Uwe Bartels

Hi Thomas,

thanks, but there were no new informations in there for me.
this article
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm I know and
others on his website.

Best...
Uwe


On 23 March 2011 15:41,  wrote:

> > Hi,
> >
> > I have very bad bgwriter statistics on a server which runs since many
> > weeks
> > and it is still the same after a recent restart.
> > There are roughly 50% of buffers written by the backend processes and the
> > rest by checkpoints.
> > The statistics below are from a server with 140GB RAM, 32GB
> shared_buffers
> > and a runtime of one hour.
> >
> > As you can see in the pg_buffercache view that there are most buffers
> > without usagecount - so they are as free or even virgen as they can be.
> > At the same time I have 53% percent of the dirty buffers written by the
> > backend process.
>
> There are some nice old threads dealing with this - see for example
>
>
> http://postgresql.1045698.n5.nabble.com/Bgwriter-and-pg-stat-bgwriter-buffers-clean-aspects-td2071472.html
>
>
> http://postgresql.1045698.n5.nabble.com/tuning-bgwriter-in-8-4-2-td1926854.html
>
> and there even some nice external links to more detailed explanation
>
> http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
>
> regards
> Tomas
>
>

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Nicholson, Brad (Toronto, ON, CA)



> -Original Message-
> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> ow...@postgresql.org] On Behalf Of t...@fuzzy.cz
> Sent: Wednesday, March 23, 2011 10:42 AM
> To: Uwe Bartels
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] buffercache/bgwriter
> 
> > Hi,
> >
> > I have very bad bgwriter statistics on a server which runs since many
> > weeks
> > and it is still the same after a recent restart.
> > There are roughly 50% of buffers written by the backend processes and
> the
> > rest by checkpoints.
> > The statistics below are from a server with 140GB RAM, 32GB
> shared_buffers
> > and a runtime of one hour.
> >
> > As you can see in the pg_buffercache view that there are most buffers
> > without usagecount - so they are as free or even virgen as they can
> be.
> > At the same time I have 53% percent of the dirty buffers written by
> the
> > backend process.
> 
> There are some nice old threads dealing with this - see for example
> 
> http://postgresql.1045698.n5.nabble.com/Bgwriter-and-pg-stat-bgwriter-
> buffers-clean-aspects-td2071472.html
> 
> http://postgresql.1045698.n5.nabble.com/tuning-bgwriter-in-8-4-2-
> td1926854.html
> 
> and there even some nice external links to more detailed explanation
> 
> http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

The interesting question here is - with 3 million unallocated buffers, why is 
the DB evicting buffers (buffers_backend column) instead of allocating the 
unallocated buffers?

Brad.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Uwe Bartels

Hi Brad,

yes. that's the question
in the source code in freelist.c there is something that I don't understand.

This is the first try to get a free page. The second try scans used buffers.
What makes me wonder is the why postgres is checking for >
where usage_count is supposed to be NULL initially.

while (StrategyControl->firstFreeBuffer >= 0)
{
buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

/* Unconditionally remove buffer from freelist */
StrategyControl->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;

/*
 * If the buffer is pinned or has a nonzero usage_count, we cannot
use
 * it; discard it and retry.  (This can only happen if VACUUM put a
 * valid buffer in the freelist and then someone else used it before
 * we got to it.  It's probably impossible altogether as of 8.3, but
 * we'd better check anyway.)
 */
LockBufHdr(buf);
if (buf->refcount == 0 && buf->usage_count == 0)
{
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}
UnlockBufHdr(buf);
}

Best...
Uwe

On 23 March 2011 15:58, Nicholson, Brad (Toronto, ON, CA)  wrote:

>
>
> > -Original Message-
> > From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> > ow...@postgresql.org] On Behalf Of t...@fuzzy.cz
> > Sent: Wednesday, March 23, 2011 10:42 AM
> > To: Uwe Bartels
> > Cc: pgsql-performance@postgresql.org
> > Subject: Re: [PERFORM] buffercache/bgwriter
> >
> > > Hi,
> > >
> > > I have very bad bgwriter statistics on a server which runs since many
> > > weeks
> > > and it is still the same after a recent restart.
> > > There are roughly 50% of buffers written by the backend processes and
> > the
> > > rest by checkpoints.
> > > The statistics below are from a server with 140GB RAM, 32GB
> > shared_buffers
> > > and a runtime of one hour.
> > >
> > > As you can see in the pg_buffercache view that there are most buffers
> > > without usagecount - so they are as free or even virgen as they can
> > be.
> > > At the same time I have 53% percent of the dirty buffers written by
> > the
> > > backend process.
> >
> > There are some nice old threads dealing with this - see for example
> >
> > http://postgresql.1045698.n5.nabble.com/Bgwriter-and-pg-stat-bgwriter-
> > buffers-clean-aspects-td2071472.html
> >
> > http://postgresql.1045698.n5.nabble.com/tuning-bgwriter-in-8-4-2-
> > td1926854.html
> >
> > and there even some nice external links to more detailed explanation
> >
> > http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
>
> The interesting question here is - with 3 million unallocated buffers, why
> is the DB evicting buffers (buffers_backend column) instead of allocating
> the unallocated buffers?
>
> Brad.
>

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Jeff Janes

On Wed, Mar 23, 2011 at 6:19 AM, Jochen Erwied
 wrote:
> Wednesday, March 23, 2011, 1:51:31 PM you wrote:
>
> [rearranged for quoting]
>
>> background writer stats
>>  checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean |
>> maxwritten_clean | buffers_backend | buffers_alloc
>> ---+-++---+--+-+---
>>                  3 |               0 |              99754 |             0
>> |                0 |          115307 |        246173
>> (1 row)
>
> buffers_clean = 0 ?!
>
>> But I don't understand how postgres is unable to fetch a free buffer.
>> Does any body have an idea?
>
> Somehow looks like the bgwriter is completely disabled. How are the
> relevant settings in your postgresql.conf?

I suspect the work load is entirely bulk inserts, and is using a
Buffer Access Strategy.  By design, bulk inserts generally write out
their own buffers.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Uwe Bartels

On 23 March 2011 16:36, Jeff Janes  wrote:

> On Wed, Mar 23, 2011 at 6:19 AM, Jochen Erwied
>  wrote:
> > Wednesday, March 23, 2011, 1:51:31 PM you wrote:
> >
> > [rearranged for quoting]
> >
> >> background writer stats
> >>  checkpoints_timed | checkpoints_req | buffers_checkpoint |
> buffers_clean |
> >> maxwritten_clean | buffers_backend | buffers_alloc
> >>
> ---+-++---+--+-+---
> >>  3 |   0 |  99754 |
> 0
> >> |0 |  115307 |246173
> >> (1 row)
> >
> > buffers_clean = 0 ?!
> >
> >> But I don't understand how postgres is unable to fetch a free buffer.
> >> Does any body have an idea?
> >
> > Somehow looks like the bgwriter is completely disabled. How are the
> > relevant settings in your postgresql.conf?
>
> I suspect the work load is entirely bulk inserts, and is using a
> Buffer Access Strategy.  By design, bulk inserts generally write out
> their own buffers.
>
> Cheers,
>
> Jeff
>

Yes. that's true. We are converting databases from one schema into another
with a lot of computing in between.
But most of the written data is accessed soon for other conversions.
OK. That sounds very simple and thus trustable ;).

So everything is fine and there is no need/potential for optimization?

Best...
Uwe

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Jeff Janes

On Wed, Mar 23, 2011 at 8:26 AM, Uwe Bartels  wrote:
> Hi Brad,
>
> yes. that's the question
> in the source code in freelist.c there is something that I don't understand.
>
> This is the first try to get a free page. The second try scans used buffers.
> What makes me wonder is the why postgres is checking for  == 0>>
> where usage_count is supposed to be NULL initially.

The code comment preceding that check seems to explain that it is
probably not needed but simply done from an abundance of caution.

>         /*
>          * If the buffer is pinned or has a nonzero usage_count, we cannot
> use
>          * it; discard it and retry.  (This can only happen if VACUUM put a
>          * valid buffer in the freelist and then someone else used it before
>          * we got to it.  It's probably impossible altogether as of 8.3, but
>          * we'd better check anyway.)

Seems like maybe an Assert would be called for.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Josh Berkus

Folks,

Yet more evidence that we need some way to assess query plans which are
high-risk and avoid them (or have Yet Another GUC):

 Merge Join  (cost=29.16..1648.00 rows=382 width=78) (actual
time=57215.167..57215.216 rows=1 loops=1)
   Merge Cond: (rn.node_id = device_nodes.node_id)
   ->  Nested Loop  (cost=0.00..11301882.40 rows=6998 width=62) (actual
time=57209.291..57215.030 rows=112 loops=1)
 Join Filter: (node_ep.node_id = rn.node_id)
 ->  Nested Loop  (cost=0.00..11003966.85 rows=90276 width=46)
(actual time=0.027..52792.422 rows=90195 loops=1)
   ->  Index Scan using ix_ne_ns on node_ep
(cost=0.00..1545943.45 rows=32606992 width=26) (actual
time=0.010..7787.043 rows=32606903 loops=1)
   ->  Index Scan using ix_nefp_eid on ep_fp
(cost=0.00..0.28 rows=1 width=20) (actual time=0.001..0.001 rows=0
loops=32606903)
 Index Cond: (ep_fp.ep_id = node_ep.ep_id)
 ->  Materialize  (cost=0.00..5.30 rows=220 width=16) (actual
time=0.000..0.019 rows=220 loops=90195)
   ->  Seq Scan on mytable rn  (cost=0.00..4.20 rows=220
width=16) (actual time=0.008..0.043 rows=220 loops=1)
   ->  Sort  (cost=28.18..28.21 rows=12 width=16) (actual
time=0.164..0.165 rows=10 loops=1)
 Sort Key: device_nodes.node_id
 Sort Method:  quicksort  Memory: 25kB
 ->  Index Scan using ix_dn_did on device_nodes
(cost=0.00..27.96 rows=12 width=16) (actual time=0.086..0.134 rows=10
loops=1)
   Index Cond: (dev_id = 18165)
 Total runtime: 57215.329 ms


AFAICT, what's happening in this query is that PostgreSQL's statistics
on the device_nodes and several other tables are slightly out of date
(as in 5% of the table).  Thus it thinks that nothing will match the
list of node_ids in "mytable", and that it can exit the merge join early
and ignore the whole huge cost of the join plan.  This particular form
of out-of-dateness will be fixed in 9.1 (it's due to values being higher
than the highest histogram bucket in pg_stat), but not all forms will be.

It really seems like we should be able to detect an obvious high-risk
situation like this one.  Or maybe we're just being too optimistic about
discarding subplans?

BTW, the optimal plan for this query (post-analyze) is this one:

 Nested Loop  (cost=0.00..213068.26 rows=12 width=78) (actual
time=0.374..0.514 rows=1 loops=1)
   Join Filter: (device_nodes.node_id = rn.node_id)
   ->  Seq Scan on mytable rn  (cost=0.00..4.20 rows=220 width=16)
(actual time=0.013..0.050 rows=220 loops=1)
   ->  Materialize  (cost=0.00..213024.49 rows=12 width=62) (actual
time=0.001..0.002 rows=1 loops=220)
 ->  Nested Loop  (cost=0.00..213024.43 rows=12 width=62)
(actual time=0.077..0.278 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..211740.04 rows=4428
width=42) (actual time=0.070..0.269 rows=1 loops=1)
 ->  Index Scan using ix_dn_did on device_nodes
(cost=0.00..51.92 rows=13 width=16) (actual time=0.058..0.115 rows=10
loops=1)
   Index Cond: (dev_id = 18165)
 ->  Index Scan using ix_ne_ns on node_ep
(cost=0.00..16137.45 rows=11700 width=26) (actual time=0.014..0.014
rows=0 loops=10)
   Index Cond: (node_ep.node_id =
device_nodes.node_id)
   ->  Index Scan using ix_nefp_eid on ep_fp
(cost=0.00..0.28 rows=1 width=20) (actual time=0.006..0.007 rows=1 loops=1)
 Index Cond: (ep_fp.ep_id = node_ep.ep_id);


-- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Claudio Freire

On Wed, Mar 23, 2011 at 2:12 PM, Josh Berkus  wrote:
> Folks,
>
>...
> It really seems like we should be able to detect an obvious high-risk
> situation like this one.  Or maybe we're just being too optimistic about
> discarding subplans?

Why not letting the GEQO learn from past mistakes?

If somehow a post-mortem analysis of queries can be done and accounted
for, then these kinds of mistakes would be a one-time occurrence.

Ideas:
 *  you estimate cost IFF there's no past experience.
 *  if rowcount estimates miss by much, a correction cache could be
populated with extra (volatile - ie in shared memory) statistics
 *  or, if rowcount estimates miss by much, autoanalyze could be scheduled
 *  consider plan bailout: execute a tempting plan, if it takes too
long or its effective cost raises well above the expected cost, bail
to a safer plan
 *  account for worst-case performance when evaluating plans

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] buffercache/bgwriter

2011-03-23 Thread Cédric Villemain

2011/3/23 Uwe Bartels :
> On 23 March 2011 16:36, Jeff Janes  wrote:
>>
>> On Wed, Mar 23, 2011 at 6:19 AM, Jochen Erwied
>>  wrote:
>> > Wednesday, March 23, 2011, 1:51:31 PM you wrote:
>> >
>> > [rearranged for quoting]
>> >
>> >> background writer stats
>> >>  checkpoints_timed | checkpoints_req | buffers_checkpoint |
>> >> buffers_clean |
>> >> maxwritten_clean | buffers_backend | buffers_alloc
>> >>
>> >> ---+-++---+--+-+---
>> >>                  3 |               0 |              99754 |
>> >> 0
>> >> |                0 |          115307 |        246173
>> >> (1 row)
>> >
>> > buffers_clean = 0 ?!
>> >
>> >> But I don't understand how postgres is unable to fetch a free buffer.
>> >> Does any body have an idea?
>> >
>> > Somehow looks like the bgwriter is completely disabled. How are the
>> > relevant settings in your postgresql.conf?
>>
>> I suspect the work load is entirely bulk inserts, and is using a
>> Buffer Access Strategy.  By design, bulk inserts generally write out
>> their own buffers.
>>
>> Cheers,
>>
>> Jeff
>
> Yes. that's true. We are converting databases from one schema into another
> with a lot of computing in between.
> But most of the written data is accessed soon for other conversions.
> OK. That sounds very simple and thus trustable ;).

yes, it is.

>
> So everything is fine and there is no need/potential for optimization?
>

There are probably room for improvements, without more thinking, I
would suggest:

 * review bufferstrategy to increase the buffer size for the pool when
there is a lot of free buffers
* have a bgwriter working just behind the seqscan (and probably a
biger pool of buffers anyway)
* do not use  the special bufferstrategy when  the buffer cache has
more than X% of free pages
* add more :)

I believe it should be ok to do good improvement for special case
easely identifiable like yours.

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Josh Berkus

On 3/23/11 10:35 AM, Claudio Freire wrote:
>  *  consider plan bailout: execute a tempting plan, if it takes too
> long or its effective cost raises well above the expected cost, bail
> to a safer plan

That would actually solve this particular case.  It would still require
us to have some definition of "safer" though.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Slow query on CLUTER -ed tables

2011-03-23 Thread Laszlo Nagy


Given two tables:

CREATE TABLE product_price_history
(
  hid bigint NOT NULL,
  hdate timestamp without time zone NOT NULL,
  id bigint NOT NULL,
  product_id bigint NOT NULL,
 more columns here
  CONSTRAINT pk_product_price_history PRIMARY KEY (hid);

CREATE INDEX idx_product_price_history_id_hdate
  ON product_price_history
  USING btree
  (id, hdate);


CREATE TABLE product_price_offer_history
(
  hid bigint NOT NULL,
  product_price_id bigint NOT NULL,
  isfeatured smallint NOT NULL,
  price double precision NOT NULL,
  shipping double precision NOT NULL,
 some more coumns here
  CONSTRAINT pk_product_price_offer_history PRIMARY KEY (hid, offerno)
);

Stats:

product_price_history - tablesize=23GB, indexes size=4GB, row count = 87 
million
product_price_offer_history - tablesize=24GB, indexes size=7GB, row 
count = 235 million



These tables store historical data of some million products from the 
last year.

The following commands are executed on them daily:

CLUSTER idx_product_price_history_id_hdate on product_price_history;
CLUSTER pk_product_price_offer_history on product_price_offer_history;

Here is a query:

select
  date_part('epoch', min(pph.hdate) )  as hdate_ticks,
  min(ppoh.price+ppoh.shipping) as price_plus_shipping
from
  product_price_history pph
  inner join product_price_offer_history ppoh on ppoh.hid = pph.hid
where pph.id = 37632081
 and ppoh.isfeatured=1
group by ppoh.merchantid,pph.hid,pph.hdate
order by pph.hid asc


I think that the query plan is correct:


"GroupAggregate  (cost=5553554.25..5644888.17 rows=2283348 width=50)"
"  ->  Sort  (cost=5553554.25..5559262.62 rows=2283348 width=50)"
"Sort Key: pph.hid, ppoh.merchantid, pph.hdate"
"->  Nested Loop  (cost=0.00..5312401.66 rows=2283348 width=50)"
"  ->  Index Scan using idx_product_price_history_id_hdate 
on product_price_history pph  (cost=0.00..8279.80 rows=4588 width=16)"

"Index Cond: (id = 37632081)"
"  ->  Index Scan using pk_product_price_offer_history on 
product_price_offer_history ppoh  (cost=0.00..1149.86 rows=498 width=42)"

"Index Cond: (ppoh.hid = pph.hid)"
"Filter: (ppoh.isfeatured = 1)"

So it uses two index scans on the indexes we CLUSTER the tables on. 
Number of rows returned is usually between 100 and 20 000.



Here is the problem. When I first open this query for a given 
identifier, it runs for 100 seconds. When I try to run it again for the 
same identifier it returns the same rows within one second!


The indexes are very well conditioned: from the 235 million rows, any id 
given occurs at most 20 000 times. It is a btree index, so it should 
already be stored sorted, and the 20 000 rows to be returned should fit 
into a few database pages. Even if they are not in the cache, PostgreSQL 
should be able to read the required pages within a second.


I understand that for an index scan, PostgreSQL also needs to read the 
rows from the table. But since these tables are CLUSTER-ed on those 
specific indexes, all the data needed shoud fit on a few database pages 
and PostgreSQL should be able to read them within a second.


Then why it is taking 100 seconds to do the query for the first time and 
why it is just one sec for the second time? Probably my thinking is 
wrong, but I suppose it means that the data is spread on thousands of 
pages on the disk.


How is that possible? What am I doing wrong?

Thanks,

   Laszlo


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Claudio Freire

On Wed, Mar 23, 2011 at 5:29 PM, Josh Berkus  wrote:
> On 3/23/11 10:35 AM, Claudio Freire wrote:
>>  *  consider plan bailout: execute a tempting plan, if it takes too
>> long or its effective cost raises well above the expected cost, bail
>> to a safer plan
>
> That would actually solve this particular case.  It would still require
> us to have some definition of "safer" though.

In my head, safer = better worst-case performance.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Justin Pitts

On Wed, Mar 23, 2011 at 1:12 PM, Josh Berkus  wrote:
> AFAICT, what's happening in this query is that PostgreSQL's statistics
> on the device_nodes and several other tables are slightly out of date
> (as in 5% of the table).

What about some manner of query feedback mechanism ( along the lines
of what explain analyze yields ) to track "stats staleness" in
general?

Probably, I misunderstand the costs of executing explain analyze.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Tom Lane

Claudio Freire  writes:
> On Wed, Mar 23, 2011 at 5:29 PM, Josh Berkus  wrote:
>> On 3/23/11 10:35 AM, Claudio Freire wrote:
>>>  *  consider plan bailout: execute a tempting plan, if it takes too
>>> long or its effective cost raises well above the expected cost, bail
>>> to a safer plan

>> That would actually solve this particular case.  It would still require
>> us to have some definition of "safer" though.

> In my head, safer = better worst-case performance.

If the planner starts operating on the basis of worst case rather than
expected-case performance, the complaints will be far more numerous than
they are today.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Claudio Freire

On Wed, Mar 23, 2011 at 6:00 PM, Tom Lane  wrote:
> Claudio Freire  writes:
>> In my head, safer = better worst-case performance.
>
> If the planner starts operating on the basis of worst case rather than
> expected-case performance, the complaints will be far more numerous than
> they are today.

I imagine, that's why, if you put my comment in context, I was talking
about picking a safer plan only when the "better on average one" fails
miserably.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Slow query on CLUTER -ed tables

2011-03-23 Thread Marti Raudsepp

2011/3/23 Laszlo Nagy :
> "GroupAggregate  (cost=5553554.25..5644888.17 rows=2283348 width=50)"
> "  ->  Sort  (cost=5553554.25..5559262.62 rows=2283348 width=50)"
> "        Sort Key: pph.hid, ppoh.merchantid, pph.hdate"
> "        ->  Nested Loop  (cost=0.00..5312401.66 rows=2283348 width=50)"
> "              ->  Index Scan using idx_product_price_history_id_hdate on
> product_price_history pph  (cost=0.00..8279.80 rows=4588 width=16)"
> "                    Index Cond: (id = 37632081)"
> "              ->  Index Scan using pk_product_price_offer_history on
> product_price_offer_history ppoh  (cost=0.00..1149.86 rows=498 width=42)"
> "                    Index Cond: (ppoh.hid = pph.hid)"
> "                    Filter: (ppoh.isfeatured = 1)"

I suspect that, since the matched hid's probably aren't sequential,
many of those ~500 product_price_offer_history rows will be far apart
on disk.

Please show the EXPLAIN ANALYZE output in the slow case, not just
EXPLAIN. Also, PostgreSQL version? What configuration options have you
changed? (http://wiki.postgresql.org/wiki/SlowQueryQuestions)

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-23 Thread Josh Berkus


> If the planner starts operating on the basis of worst case rather than
> expected-case performance, the complaints will be far more numerous than
> they are today.

Yeah, I don't think that's the way to go.  The other thought I had was
to accumulate a "risk" stat the same as we accumulate a "cost" stat.

However, I'm thinking that I'm overengineering what seems to be a fairly
isolated problem, in that we might simply need to adjust the costing on
this kind of a plan.

Also, can I say that the cost figures in this plan are extremely
confusing?  Is it really necessary to show them the way we do?

Merge Join  (cost=29.16..1648.00 rows=382 width=78) (actual
time=57215.167..57215.216 rows=1 loops=1)
   Merge Cond: (rn.node_id = device_nodes.node_id)
   ->  Nested Loop  (cost=0.00..11301882.40 rows=6998 width=62) (actual
time=57209.291..57215.030 rows=112 loops=1)
 Join Filter: (node_ep.node_id = rn.node_id)
 ->  Nested Loop  (cost=0.00..11003966.85 rows=90276 width=46)
(actual time=0.027..52792.422 rows=90195 loops=1)

The first time I saw the above, I thought we had some kind of glibc math
bug on the host system.  Costs are supposed to accumulate upwards.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] pg9.0.3 explain analyze running very slow compared to a different box with much less configuration

2011-03-23 Thread DM

Hi All,

pg9.0.3 explain analyze running very slow compared to old box with much less
configuration.

But actual query is performing much better than the old server.

old Server===
OS: CentOS release 5.4 (Final)
Linux Server 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64
x86_64 x86_64 GNU/Linux

RAM - 16GB
CPU - 8 Core
disk - 300GB
RAID10 on the disk

Postgresql 9.0.3

Postgres Config:
shared_buffers = 6GB
work_mem = 32MB
maintenance_work_mem = 512MB
effective_cache_size = 12GB

#explain analyze select * from photo;
 QUERY PLAN

 Seq Scan on photo  (cost=0.00..8326849.24 rows=395405824 width=168) (actual
time=5.632..157757.284 rows=395785382 loops=1)
 Total runtime: 187443.850 ms
(2 rows)

newServer===

CentOS release 5.4 (Final)
Linux Server 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64
x86_64 x86_64 GNU/Linux

RAM - 64GB
CPU - 12 Core
disk - 1TB
RAID10 on the disk

Postgresql 9.0.3
Postgres Config:
shared_buffers = 16GB
work_mem = 32MB
maintenance_work_mem = 1024MB
effective_cache_size = 12GB


# explain analyze select * from photo;
QUERY PLAN

 Seq Scan on photo  (cost=0.00..8326810.24 rows=395579424 width=165) (actual
time=0.051..316879.928 rows=395648020 loops=1)
 Total runtime: 605703.206 ms
(2 rows)


I read other articles about the same issue but could not find the exact
solution.


I ran gettimeofday() on both machines and got the below results:

Results:

*[Old Server]# time /tmp/gtod*

real  0m0.915s

user  0m0.914s

sys   0m0.001s

*[New Server]#  time /tmp/gtod*

real  0m7.542s

user  0m7.540s

sys   0m0.001s


I am not sure how to fix this issue, any help would be in great assistance.


Thanks

Deepak

Re: [PERFORM] Re-Reason of Slowness of Query

2011-03-23 Thread Adarsh Sharma


t...@fuzzy.cz wrote:

On 03/23/2011 04:17 AM, Adarsh Sharma wrote:



explain analyze select distinct(p.crawled_page_id) from page_content
p where NOT EXISTS (select 1 from clause2 c where c.source_id =
p.crawled_page_id);
  

You know... I'm surprised nobody has mentioned this, but DISTINCT is
very slow unless you have a fairly recent version of Postgres that
replaces it with something faster. Try this:



Nobody mentioned that because the explain plan already uses hash aggregate
(instead of the old sort)

 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual
time=7047.259..7050.261 rows=72 loops=1)

which means this is at least 8.4. Plus the 'distinct' step uses less than
1% of total time, so even if you improve it the impact will be minimal.

  


Yes, U"r absolutely right I am using Version 8.4SS and i am satisfied 
with the below query results:


pdc_uima=# explain analyze select distinct(p.crawled_page_id) from 
page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id 
= p.crawled_page_id);

QUERY 
PLAN 
-
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual 
time=5149.308..5152.251 rows=72 loops=1)
  ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) 
(actual time=0.119..5148.954 rows=74 loops=1)
->  Seq Scan on page_content p  (cost=0.00..87132.17 
rows=428817 width=8) (actual time=0.021..444.487 rows=428467 loops=1)
->  Index Scan using idx_clause2_source_id on clause2 c  
(cost=0.00..18.18 rows=781 width=4) (actual time=0.009..0.009 rows=1 
loops=428467)

  Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 5155.874 ms
(6 rows)

I don't think that the above results are optimized further.


Thanks & best Regards,
Adarsh Sharma

regards
Tomas

48 matches

Mail list logo