Re: [PERFORM] Optimizing a VIEW

2008-08-18 Thread Matthew Wakeling

On Fri, 15 Aug 2008, Madison Kelly wrote:
 Below I will post the VIEW and a sample of the query's EXPLAIN ANALYZE. 
Thanks for any tips/help/clue-stick-beating you may be able to share!


This query looks incredibly expensive:


   SELECT

...

   FROM
   customer a,
   history.customer_data b,
   history.customer_data c,
   history.customer_data d,
   history.customer_data e,
   history.customer_data f,
   history.customer_data g,
   history.customer_data h,
   history.customer_data i,
   history.customer_data j,
   history.customer_data k,
   history.customer_data l
   WHERE
   a.cust_id=b.cd_cust_id AND
   a.cust_id=c.cd_cust_id AND
   a.cust_id=d.cd_cust_id AND
   a.cust_id=e.cd_cust_id AND
   a.cust_id=f.cd_cust_id AND
   a.cust_id=g.cd_cust_id AND
   a.cust_id=h.cd_cust_id AND
   a.cust_id=i.cd_cust_id AND
   a.cust_id=j.cd_cust_id AND
   a.cust_id=k.cd_cust_id AND
   a.cust_id=l.cd_cust_id AND

...

I would refactor this significantly, so that instead of returning a wide 
result, it would return more than one row per customer. Just do a single 
join between customer and history.customer_data - it will run much faster.


Matthew

--
Here we go - the Fairy Godmother redundancy proof.
   -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Cross Join Problem

2008-08-18 Thread Gauri Kanekar
Hi,


Following is the Query :
SELECT sum(id), sum(cd), sum(ad)
   FROM table1 a , table2 b cross join table3 c
   WHERE a.nkey = b.key
 AND a.dkey = c.key
 AND c.date = '2008-02-01'
 AND b.id = 999 ;


We have fired this on our production system which is postgres 8.1.3, and got
the following explain analyse of it

 Aggregate  (cost=11045.52..11045.53 rows=1 width=24) (actual
time=79.290..79.291 rows=1 loops=1)
   ->  Nested Loop  (cost=49.98..11043.42 rows=279 width=24) (actual
time=1.729..50.498 rows=10473 loops=1)
 ->  Nested Loop  (cost=0.00..6.05 rows=1 width=8) (actual
time=0.028..0.043 rows=1 loops=1)
   ->  Index Scan using rnididx on table2 b  (cost=0.00..3.02
rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1)
 Index Cond: (id = 999)
   ->  Index Scan using rddtidx on table3 c  (cost=0.00..3.02
rows=1 width=4) (actual time=0.010..0.016 rows=1 loops=1)
 Index Cond: (date = '2008-02-01 00:00:00'::timestamp
without time zone)
 ->  Bitmap Heap Scan on table1 a  (cost=49.98..10954.93 rows=5496
width=32) (actual time=1.694..19.006 rows=10473 loops=1)
   Recheck Cond: ((a.nkey = "outer"."key") AND (a.dkey =
"outer"."key"))
   ->  Bitmap Index Scan on rndateidx  (cost=0.00..49.98
rows=5496 width=0) (actual time=1.664..1.664 rows=10473 loops=1)
 Index Cond: ((a.nkey = "outer"."key") AND (a.dkey =
"outer"."key"))
 Total runtime: 79.397 ms

Time: 80.752 ms



Same Query when we fire on postgres 8.3.3, following is the explain analyse

QUERY PLAN
---
 Aggregate  (cost=1171996.35..1171996.36 rows=1 width=24) (actual
time=6360.783..6360.785 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..1171994.28 rows=275 width=24) (actual
time=3429.309..6330.424 rows=10473 loops=1)
 Join Filter: (a.nkey = b.key)
 ->  Index Scan using rnididx on table2 b  (cost=0.00..4.27 rows=1
width=4) (actual time=0.030..0.033 rows=1 loops=1)
   Index Cond: (id = 999)
 ->  Nested Loop  (cost=0.00..1169411.17 rows=206308 width=28)
(actual time=0.098..4818.450 rows=879480 loops=1)
   ->  Index Scan using rddtidx on table1 c  (cost=0.00..4.27
rows=1 width=4) (actual time=0.031..0.034 rows=1 loops=1)
 Index Cond: (date = '2008-02-01 00:00:00'::timestamp
without time zone)
   ->  Index Scan using rdnetidx on table1 a
(cost=0.00..1156050.51 rows=1068511 width=32) (actual time=0.047..1732.229
rows=879480 loops=1)
 Index Cond: (a.dkey = c.key)
 Total runtime: 6360.978 ms


The Query on postgres 8.1.3 use to take only 80.752 ms is now taking
6364.950 ms.

We have done vacuum analyse on all the tables.

Can anybody helpout over here ... was may b wrong... and why the query seems
to take time on postgres 8.3.3.

Is it 8.3.3 problem or its cross join problem on 8.3.3

Thanx

-- 
Regards
Gauri


Re: [PERFORM] Cross Join Problem

2008-08-18 Thread Tom Lane
"Gauri Kanekar" <[EMAIL PROTECTED]> writes:
> Following is the Query :
> SELECT sum(id), sum(cd), sum(ad)
>FROM table1 a , table2 b cross join table3 c
>WHERE a.nkey = b.key
>  AND a.dkey = c.key
>  AND c.date = '2008-02-01'
>  AND b.id = 999 ;


> We have fired this on our production system which is postgres 8.1.3, and got
> the following explain analyse of it

>  Aggregate  (cost=11045.52..11045.53 rows=1 width=24) (actual
> time=79.290..79.291 rows=1 loops=1)
>->  Nested Loop  (cost=49.98..11043.42 rows=279 width=24) (actual
> time=1.729..50.498 rows=10473 loops=1)
>  ->  Nested Loop  (cost=0.00..6.05 rows=1 width=8) (actual
> time=0.028..0.043 rows=1 loops=1)
>->  Index Scan using rnididx on table2 b  (cost=0.00..3.02
> rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1)
>  Index Cond: (id = 999)
>->  Index Scan using rddtidx on table3 c  (cost=0.00..3.02
> rows=1 width=4) (actual time=0.010..0.016 rows=1 loops=1)
>  Index Cond: (date = '2008-02-01 00:00:00'::timestamp
> without time zone)
>  ->  Bitmap Heap Scan on table1 a  (cost=49.98..10954.93 rows=5496
> width=32) (actual time=1.694..19.006 rows=10473 loops=1)
>Recheck Cond: ((a.nkey = "outer"."key") AND (a.dkey =
> "outer"."key"))
>->  Bitmap Index Scan on rndateidx  (cost=0.00..49.98
> rows=5496 width=0) (actual time=1.664..1.664 rows=10473 loops=1)
>  Index Cond: ((a.nkey = "outer"."key") AND (a.dkey =
> "outer"."key"))
>  Total runtime: 79.397 ms

No PG release since 7.3 would have voluntarily planned that query that
way.  Maybe you were using join_collapse_limit = 1 to force the join
order?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Slow query with a lot of data

2008-08-18 Thread Moritz Onken

Hi,

I run this query:

select max(a."user"), b.category, count(1) from result a,  
domain_categories b where a."domain" = b."domain" group by b.category;


the table result contains all websites a user visited. And the table  
domain_categories contains all categories a domain is in.
result has 20 Mio rows and domain_categories has about 12 Mio. There  
are 500.000 different users.


I have indexes on result.domain, domain_categories.domain,  
result.user, domain_categories.category. Clustered result on user and  
domain_categories on domain.


explain analyze says (limited to one user with id 1337):

"HashAggregate  (cost=2441577.16..2441614.72 rows=2504 width=8)  
(actual time=94667.335..94671.508 rows=3361 loops=1)"
"  ->  Merge Join  (cost=2119158.02..2334105.00 rows=14329622 width=8)  
(actual time=63559.938..94621.557 rows=36308 loops=1)"

"Merge Cond: (a.domain = b.domain)"
"->  Sort  (cost=395.52..405.49 rows=3985 width=8) (actual  
time=0.189..0.211 rows=19 loops=1)"

"  Sort Key: a.domain"
"  Sort Method:  quicksort  Memory: 27kB"
"  ->  Index Scan using result_user_idx on result a   
(cost=0.00..157.21 rows=3985 width=8) (actual time=0.027..0.108  
rows=61 loops=1)"

"Index Cond: ("user" = 1337)"
"->  Materialize  (cost=2118752.28..2270064.64 rows=12104989  
width=8) (actual time=46460.599..82336.116 rows=12123161 loops=1)"
"  ->  Sort  (cost=2118752.28..2149014.75 rows=12104989  
width=8) (actual time=46460.592..59595.851 rows=12104989 loops=1)"

"Sort Key: b.domain"
"Sort Method:  external sort  Disk: 283992kB"
"->  Seq Scan on domain_categories b   
(cost=0.00..198151.89 rows=12104989 width=8) (actual  
time=14.352..22572.869 rows=12104989 loops=1)"

"Total runtime: 94817.058 ms"

This is running on a pretty small server with 1gb of ram and a slow  
sata hd. Shared_buffers is 312mb, max_fsm_pages = 153600. Everything  
else is commented out. Postgresql v8.3.3. Operating system Ubuntu 8.04.


It would be great if someone could help improve this query. This is  
for a research project at my university.


Thanks in advance,

Moritz


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-18 Thread Matthew Wakeling

On Mon, 18 Aug 2008, Moritz Onken wrote:
I have indexes on result.domain, domain_categories.domain, result.user, 
domain_categories.category. Clustered result on user and domain_categories on 
domain.


"->  Materialize  (cost=2118752.28..2270064.64 rows=12104989 width=8) 
(actual time=46460.599..82336.116 rows=12123161 loops=1)"
"  ->  Sort  (cost=2118752.28..2149014.75 rows=12104989 width=8) 
(actual time=46460.592..59595.851 rows=12104989 loops=1)"

"Sort Key: b.domain"
"Sort Method:  external sort  Disk: 283992kB"
"->  Seq Scan on domain_categories b 
(cost=0.00..198151.89 rows=12104989 width=8) (actual time=14.352..22572.869 
rows=12104989 loops=1)"


This is weird, given you say you have clustered domain_categories on 
domain. Have you analysed? You should be able to run:


EXPLAIN SELECT * from domain_categories ORDER BY domain

and have it say "Index scan" instead of "Seq Scan followed by disc sort)".

Matthew

--
Patron: "I am looking for a globe of the earth."
Librarian: "We have a table-top model over here."
Patron: "No, that's not good enough. Don't you have a life-size?"
Librarian: (pause) "Yes, but it's in use right now."

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-18 Thread Moritz Onken


Am 18.08.2008 um 16:30 schrieb Matthew Wakeling:


On Mon, 18 Aug 2008, Moritz Onken wrote:
I have indexes on result.domain, domain_categories.domain,  
result.user, domain_categories.category. Clustered result on user  
and domain_categories on domain.


"->  Materialize  (cost=2118752.28..2270064.64  
rows=12104989 width=8) (actual time=46460.599..82336.116  
rows=12123161 loops=1)"
"  ->  Sort  (cost=2118752.28..2149014.75 rows=12104989  
width=8) (actual time=46460.592..59595.851 rows=12104989 loops=1)"

"Sort Key: b.domain"
"Sort Method:  external sort  Disk: 283992kB"
"->  Seq Scan on domain_categories b  
(cost=0.00..198151.89 rows=12104989 width=8) (actual  
time=14.352..22572.869 rows=12104989 loops=1)"


This is weird, given you say you have clustered domain_categories on  
domain. Have you analysed? You should be able to run:


EXPLAIN SELECT * from domain_categories ORDER BY domain

and have it say "Index scan" instead of "Seq Scan followed by disc  
sort)".


Matthew



Thanks, the index was created but I forgot to run analyze again on  
that table.


I had a little mistake in my previous sql query. The corrected version  
is this:
explain analyze select a."user", b.category, count(1) from result a,  
domain_categories b where a."domain" = b."domain" and a."user" = 1337  
group by a."user", b.category;


(notice the additional group by column).

explain analyze:


"HashAggregate  (cost=817397.78..817428.92 rows=2491 width=8) (actual  
time=42874.339..42878.419 rows=3361 loops=1)"
"  ->  Merge Join  (cost=748.47..674365.50 rows=19070970 width=8)  
(actual time=15702.449..42829.388 rows=36308 loops=1)"

"Merge Cond: (b.domain = a.domain)"
"->  Index Scan using domain_categories_domain on  
domain_categories b  (cost=0.00..391453.79 rows=12105014 width=8)  
(actual time=39.018..30166.349 rows=12104989 loops=1)"
"->  Sort  (cost=395.52..405.49 rows=3985 width=8) (actual  
time=0.188..32.345 rows=36309 loops=1)"

"  Sort Key: a.domain"
"  Sort Method:  quicksort  Memory: 27kB"
"  ->  Index Scan using result_user_idx on result a   
(cost=0.00..157.21 rows=3985 width=8) (actual time=0.021..0.101  
rows=61 loops=1)"

"Index Cond: ("user" = 1337)"
"Total runtime: 42881.382 ms"

This is still very slow...



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-18 Thread Matthew Wakeling

On Mon, 18 Aug 2008, Moritz Onken wrote:

"HashAggregate  (cost=817397.78..817428.92 rows=2491 width=8) (actual  
time=42874.339..42878.419 rows=3361 loops=1)"
"  ->  Merge Join  (cost=748.47..674365.50 rows=19070970 width=8) (actual > 
time=15702.449..42829.388 rows=36308 loops=1)"
"Merge Cond: (b.domain = a.domain)"
"->  Index Scan using domain_categories_domain on domain_categories b > 
(cost=0.00..391453.79 rows=12105014 width=8) (actual time=39.018..30166.349 > rows=12104989 
loops=1)"
"->  Sort  (cost=395.52..405.49 rows=3985 width=8) (actual > 
time=0.188..32.345 rows=36309 loops=1)"
"  Sort Key: a.domain"
"  Sort Method:  quicksort  Memory: 27kB"
"  ->  Index Scan using result_user_idx on result a > (cost=0.00..157.21 
rows=3985 width=8) (actual time=0.021..0.101 rows=61 > loops=1)"
"Index Cond: ("user" = 1337)"
"Total runtime: 42881.382 ms"

This is still very slow...


Well, you're getting the database to read the entire contents of the 
domain_categories table in order. That's 12 million rows - a fair amount 
of work.


You may find that removing the "user = 1337" constraint doesn't make the 
query much slower - that's where you get a big win by clustering on 
domain. You might also want to cluster the results table on domain.


If you want the results for just one user, it would be very helpful to 
have a user column on the domain_categories table, and an index on that 
column. However, that will slow down the query for all users a little.


Matthew

--
For every complex problem, there is a solution that is simple, neat, and wrong.
 -- H. L. Mencken 


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Cross Join Problem

2008-08-18 Thread Tom Lane
[ please keep the list cc'd for the archives' sake ]

"Gauri Kanekar" <[EMAIL PROTECTED]> writes:
> On Mon, Aug 18, 2008 at 7:32 PM, Tom Lane <[EMAIL PROTECTED]> wrote:
>> No PG release since 7.3 would have voluntarily planned that query that
>> way.  Maybe you were using join_collapse_limit = 1 to force the join
>> order?

> Yes, We have set join_collapse_limit set to 1.

Ah, so really your question is why join_collapse_limit isn't working as
you expect.  That code changed quite a bit in 8.2, and the way it works
now is that the critical decision occurs while deciding whether to fold
the cross-join (a sub-problem of size 2) into the top-level join
problem.  Which is a decision that's going to be driven by
from_collapse_limit not join_collapse_limit.

So one way you could make it work is to reduce from_collapse_limit to
less than 3, but I suspect you'd find that that has too many bad
consequences for other queries.  What's probably best is to write the
problem query like this:

FROM table1 a cross join ( table2 b cross join table3 c )

which will cause join_collapse_limit to be the relevant number at both
steps.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] file system and raid performance

2008-08-18 Thread Mark Wong
On Fri, Aug 15, 2008 at 12:22 PM, Bruce Momjian <[EMAIL PROTECTED]> wrote:
> Mark Wong wrote:
>> On Mon, Aug 4, 2008 at 10:04 PM,  <[EMAIL PROTECTED]> wrote:
>> > On Mon, 4 Aug 2008, Mark Wong wrote:
>> >
>> >> Hi all,
>> >>
>> >> We've thrown together some results from simple i/o tests on Linux
>> >> comparing various file systems, hardware and software raid with a
>> >> little bit of volume management:
>> >>
>> >> http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide
>
> Mark, very useful analysis.  I am curious why you didn't test
> 'data=writeback' on ext3;  'data=writeback' is the recommended mount
> method for that file system, though I see that is not mentioned in our
> official documentation.

I have one set of results with ext3 data=writeback and it appears that
some of the write tests have less throughput than data=ordered.  For
anyone who wants to look at the results details:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide

it's under the "Aggregate Bandwidth (MB/s) - RAID 5 (256KB stripe) -
No partition table" table.

Regards,
Mark

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-18 Thread Moritz Onken




Well, you're getting the database to read the entire contents of the  
domain_categories table in order. That's 12 million rows - a fair  
amount of work.


You may find that removing the "user = 1337" constraint doesn't make  
the query much slower - that's where you get a big win by clustering  
on domain. You might also want to cluster the results table on domain.


Running the query for more than one user is indeed not much slower.  
That's what I need. I'm clustering the results table on domain right  
now. But why is this better than clustering it on "user"?





If you want the results for just one user, it would be very helpful  
to have a user column on the domain_categories table, and an index  
on that column. However, that will slow down the query for all users  
a little.


A row in domain_categories can belong to more than one user. But I  
don't need to run this query for only one user anyway.


Thanks so far,

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with a lot of data

2008-08-18 Thread Matthew Wakeling

On Mon, 18 Aug 2008, Moritz Onken wrote:
Running the query for more than one user is indeed not much slower. That's 
what I need. I'm clustering the results table on domain right now. But why is 
this better than clustering it on "user"?


The reason is the way that the merge join algorithm works. What it does is 
takes two tables, and sorts them both by the join fields. Then it can 
stream through both tables producing results as it goes. It's the best 
join algorithm, but it does require both tables to be sorted by the same 
thing, which is domain in this case. The aggregating on user happens after 
the join has been done, and the hash aggregate can accept the users in 
random order.


If you look at your last EXPLAIN, see that it has to sort the result table 
on domain, although it can read the domain_categories in domain order due 
to the clustered index.


"HashAggregate
"  ->  Merge Join
"Merge Cond: (b.domain = a.domain)"
"->  Index Scan using domain_categories_domain on domain_categories b
"->  Sort
"  Sort Key: a.domain"
"  Sort Method:  quicksort  Memory: 27kB"
"  ->  Index Scan using result_user_idx on result a 
"Index Cond: ("user" = 1337)"


Without the user restriction and re-clustering, this should become:

"HashAggregate
"  ->  Merge Join
"Merge Cond: (b.domain = a.domain)"
"->  Index Scan using domain_categories_domain on domain_categories b
"->  Index Scan using result_domain on result a

Matthew

--
Vacuums are nothings. We only mention them to let them know we know
they're there.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance