Re: [PERFORM] Value of Quad vs. Dual Processor machine
Fred Moyer wrote: One thing I learned after spending about a week comparing the Athlon (2 ghz, 333 mhz frontside bus) and Xeon (2.4 ghz, 266 mhz frontside bus) platforms was that on average the select queries I was benchmarking ran 30% faster on the Athlon (this was with data cached in memory so may not apply to the larger data sets where I/O is the limiting factor.) I benchmarked against the Opteron 244 when it came out and it came in about the same as the Athlon (makes sense since both were 333 mhz memory). The results within +/- 5-10% that of the Athlon. From testing against a couple of other machines I noticed that the memory bus speeds were almost directly proportional to the query times under these conditions. I remember a posting here about opteron, which essentially said, even if opteron works on par with athlon under few clients, as load increases it scales more than 50% better than athlons. So that could be another shot at it.Sorry, no handy URL here. Shridhar ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] Seeking help with a query that takes too long
[I originally posted this using the wrong E-Mail account, so a double
posting may occur if the first message gets released by the moderator later-
sorry!]
Hi-
I have a query that I'm trying to speed up. I haven't been able to come up
with any workable ideas for speeding it up, so I'm seeking some input from
the list.
I'm using version 7.3.2
I have three tables:
case_data (1,947,386 rows)
actor (3,385,669 rows)
actor_case_assignment (8,668,650 rows)
As the names imply, actor_case_assignment contains records that assign an
actor to a case. Actors such as attorneys or judges may have many cases,
while the average actor (we hope) only has one.
What I'm trying to do is link these tables to get back a single row per
actor that shows the actor's name, the number of cases that actor is
assigned to, and if they only have one case, I want the public_id for that
case. This means I have to do a group by to get the case count, but I'm then
forced to use an aggregate function like max on the other fields.
All of the fields ending in "_id" have unique indexes, and
actor_full_name_uppercase is indexed. An analyze is done every night & the
database is fairly stable in it's composition.
Here's the select:
select
actor.actor_id,
max(actor.actor_full_name),
max(case_data.case_public_id),
max(case_data.case_id),
count(case_data.case_id) as case_count
from
actor,
actor_case_assignment,
case_data
where
actor.actor_full_name_uppercase like upper('sanders%')
and actor.actor_id = actor_case_assignment.actor_id
and case_data.case_id = actor_case_assignment.case_id
group by
actor.actor_id
order by
max(actor.actor_full_name),
case_count desc
limit
1000;
Here's the explain analyze:
QUERY PLAN
--
Limit (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.61..120035.67 rows=1000 loops=1)
-> Sort (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.60..120034.98 rows=1001 loops=1)
Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
-> Aggregate (cost=2214.67..2214.70 rows=1 width=115) (actual
time=119962.80..120011.49 rows=3456 loops=1)
-> Group (cost=2214.67..2214.68 rows=2 width=115) (actual
time=119962.76..119987.04 rows=5879 loops=1)
-> Sort (cost=2214.67..2214.68 rows=2 width=115)
(actual time=119962.74..119965.09 rows=5879 loops=1)
Sort Key: actor.actor_id
-> Nested Loop (cost=0.00..2214.66 rows=2
width=115) (actual time=59.05..119929.71 rows=5879 loops=1)
-> Nested Loop (cost=0.00..2205.26 rows=3
width=76) (actual time=51.46..66089.04 rows=5882 loops=1)
-> Index Scan using
actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42)
(actual time=37.62..677.44 rows=3501 loops=1)
Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
-> Index Scan using
actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2165.93
rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501)
Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
-> Index Scan using case_data_case_id on
case_data (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1
loops=5882)
Index Cond: (case_data.case_id =
"outer".case_id)
Total runtime: 120038.60 msec
(17 rows)
Any ideas?
Thanks!
-Nick
-
Nick Fankhauser
[EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Value of Quad vs. Dual Processor machine
On Tue, 11 Nov 2003 21:13:19 -0500 "Chris Field" <[EMAIL PROTECTED]> wrote: > we are looking at Xeon, We are currently running it on a quad sun v880 > compiled to be 64bit and have been getting dreadful performance. I > don't think we really have much to gain from going 64bit. > > By chance, are you running 7.3.4 on that sun? If so, try this: export CFLAGS=-02 ./configure and rebuild PG. Before 7.4 PG was build with _no_ optimization on Solaris. Recompiling gives __HUGE__ (notice the underscores) performance gains. And onto the dual vs quad. PG will only use 1 cpu / connection / query. So if your machine iwll have 1-2 queries running at a time those other 2 proc's will sit around idling. However if you are going to have a bunch going, 4 cpus will be most useful. One of hte nicest things to do for PG is more ram and fast IO. It really loves those things. good luck -- Jeff Trout <[EMAIL PROTECTED]> http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Seeking help with a query that takes too long
On Wed, 12 Nov 2003 08:34:50 -0500, "Nick Fankhauser" <[EMAIL PROTECTED]> wrote: > -> Index Scan using >actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) ^^ >(actual time=37.62..677.44 rows=3501 loops=1) ^ > Index Cond: >((actor_full_name_uppercase >= 'SANDERS'::character varying) AND >(actor_full_name_uppercase < 'SANDERT'::character varying)) > Filter: >(actor_full_name_uppercase ~~ 'SANDERS%'::text) Nick, can you find out why this row count estimation is so far off? \x SELECT * FROM pg_stats WHERE tablename='actor' AND attname='actor_full_name_uppercase'; BTW, there seem to be missing cases: > -> Nested Loop (cost=0.00..2214.66 rows=2 width=115) > (actual time=59.05..119929.71 rows=5879 loops=1) >-> Nested Loop (cost=0.00..2205.26 rows=3 width=76) > (actual time=51.46..66089.04 rows=5882 loops=1) Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] Superior performance in PG 7.4
Hi, I am trying the PG 7.4 RC1 and RC2 and I see a superb performance improvement compared with 7.3 Explaining the querys, I see a change of planner that, in my case, prefer Nested Loops in 7.4 opposite to Hash or Merge Join in 7.3. To test, I disable Hash and Merge Joins in 7.3 and performance have been very improved using nested loops... Both systems are identical in configurations, properly vacuuned and analyzed before tests. Something can be wrong with my tests ? [ I desire that not :) ] Alexandre ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Seeking help with a query that takes too long
>(actual time=37.62..677.44 rows=3501 loops=1)
^
> Nick, can you find out why this row count estimation is so far off?
It's actually correct:
prod1=# select count(actor_id) from actor where actor_full_name_uppercase
like 'SANDERS%';
count
---
3501
(1 row)
Of course, I merely chose "SANDERS" arbitrarily as a name that falls
somewhere near the middle of the frequency range for names. SMITH or JONES
would represent a worst-case, and something like KOIZAR would probably be
unique.
Here are the stats:
prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';
-[ RECORD
1 ]-+---
-
schemaname| public
tablename | actor
attname | actor_full_name_uppercase
null_frac | 0.00033
avg_width | 21
n_distinct| 24215
most_common_vals | {"STATE OF INDIANA","INDIANA DEPARTMENT OF
REVENUE","BARTH CONS SCHOOL CORP","HOWARD COUNTY CLERK","ADVANCED RECOVERY
SERVICES","STATE OF INDIANA-DEPT OF REVENUE","ALLIED COLLECTION SERVICE
INC","CREDIT BUREAU OF LAPORTE","MIDWEST COLLECTION SVC INC","NCO FINANCIAL
SYSTEMS INC"}
most_common_freqs |
{0.015,0.014,0.0043,0.0043,0.004,0.0037,0.0033,0.003
3,0.0027,0.0027}
histogram_bounds | {"(POE) ESTELLE, DENISE","BRIEN, LIISI","COTTRELL,
CAROL","FAMILY RENTALS","HAYNES, TAMIKA","KESSLER, VICTORIA","MEFFORD,
VERNON L","PHILLIPS, GERALD L","SHELTON, ANTOINETTE","TRICARICO, MELISSA
SUE","ZUEHLKE, THOMAS L"}
correlation | -0.00147395
I think this means that the average is 357 per actor. As you can see, the
range of assignments varies from people with a single parking ticket to
"State of Indiana", which is party to many thousands of cases.
> BTW, there seem to be missing cases:
> > -> Nested Loop (cost=0.00..2214.66 rows=2 width=115)
> > (actual time=59.05..119929.71 rows=5879 loops=1)
>
> >-> Nested Loop (cost=0.00..2205.26 rows=3 width=76)
> > (actual time=51.46..66089.04 rows=5882 loops=1)
This is expected- We actually aggregate data from many county court
databases, with varying levels of data "cleanliness".
Regards,
-Nick
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Seeking help with a query that takes too long
> >actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) > ^^ > >(actual time=37.62..677.44 rows=3501 loops=1) > ^ > Nick, can you find out why this row count estimation is so far off? ^ Oops- I read this backward- I see what you mean now. That's a good question. I'm not sure what part of the stats this estimate might be pulled from. The average is 357, but the most common frequency may be around 1. -Nick ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Seeking help with a query that takes too long
"Nick Fankhauser" <[EMAIL PROTECTED]> writes: >> Nick, can you find out why this row count estimation is so far off? > It's actually correct: Sure, the 3501 was the "actual". The estimate was 1 row, which was pretty far off :-( > Here are the stats: It looks like you are running with the default statistics target (10). Try boosting it to 100 or even more for this column (see ALTER TABLE SET STATISTICS, then re-ANALYZE) and see if the estimate gets better. I think the major problem is likely here: > n_distinct| 24215 which is no doubt much too small (do you have an idea of the number of distinct actor_full_name_uppercase values?) regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Value of Quad vs. Dual Processor machine
As another post pointed out, you need to set cflags to get optimization under Solaris on that flavor of Postgresql. Also, Postgresql tends to get its best performance from the free unixes, Linux and BSD. those are available for Sun Sparcs, but postgresql in 64 bit mode on those boxes is still a bit cutting edge. It might be worth a try to set up the sun to dual boot to either BSD or Linux and test Postgresql under that environment to see how it works and compare it to Sun after you've set the cflags and recompiled. On Tue, 11 Nov 2003, Chris Field wrote: > we are looking at Xeon, We are currently running it on a quad sun v880 > compiled to be 64bit and have been getting dreadful performance. I don't > think we really have much to gain from going 64bit. > > > - Original Message - > From: "Ron Johnson" <[EMAIL PROTECTED]> > To: "PgSQL Performance ML" <[EMAIL PROTECTED]> > Sent: Tuesday, November 11, 2003 8:24 PM > Subject: Re: [PERFORM] Value of Quad vs. Dual Processor machine > > > > On Tue, 2003-11-11 at 17:32, Chris Field wrote: > > > We are getting ready to spec out a new machine and are wondering about > > > the wisdom of buying a quad versus a dual processor machine. Seing as > > > how postgres in not a threaded application, and this server will only be > > > used for log/transaction analysis (it will only ever have a few large > > > queries running). Is there any performance to be gained, and if so is > > > it worth the large cost? Any thoughts/experience are much > > > appreciated... > > > > Xeon or Opteron? The faster Opterons *really* blaze, especially > > in 64-bit mode. As others have said, though, RAM and I/O are most > > important. > > > > -- > > - > > Ron Johnson, Jr. [EMAIL PROTECTED] > > Jefferson, LA USA > > > > "As I like to joke, I may have invented it, but Microsoft made it > > popular" > > David Bradley, regarding Ctrl-Alt-Del > > > > > > ---(end of broadcast)--- > > TIP 5: Have you checked our extensive FAQ? > > > >http://www.postgresql.org/docs/faqs/FAQ.html > > > > > > > ---(end of broadcast)--- > TIP 7: don't forget to increase your free space map settings > > ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] *very* slow query to summarize data for a month ...
On Wed, 12 Nov 2003, Greg Stark wrote: > > "Marc G. Fournier" <[EMAIL PROTECTED]> writes: > > > Just as a side note, just doing a straight scan for the records, with no > > SUM()/GROUP BY involved, with the month_trunc() index is still >8k msec: > > One of the other advantages of these aggregate tables was that we could > purge the old data much sooner with much less resistance from the > business. Since the reports were all still available and a lot of ad-hoc > queries could still be done without the raw data anyways. Actually, what I've done is do this at the 'load stage' ... but same concept ... ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Seeking help with a query that takes too long
> It looks like you are running with the default statistics target (10).
> Try boosting it to 100 or even more for this column (see ALTER TABLE
> SET STATISTICS, then re-ANALYZE) and see if the estimate gets better.
Here are the results & a few more clues:
prod1=# alter table actor alter column actor_full_name_uppercase set
statistics 1000;
ALTER TABLE
prod1=# analyze actor;
ANALYZE
prod1=# select count(distinct actor_full_name_uppercase) from actor;
count
-
1453371
(1 row)
prod1=# select count(actor_id) from actor;
count
-
3386359
(1 row)
This indicates to me that 1 isn't too shabby as an estimate if the whole
name is specified, but I'm not sure how this gets altered in the case of a
"LIKE"
prod1=# \x
Expanded display is on.
prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';
schemaname| public
tablename | actor
attname | actor_full_name_uppercase
null_frac | 0.000586667
avg_width | 21
n_distinct| -0.14701
correlation | -0.00211291
Question: What does it mean when n_distinct is negative?
New results of explain analyze:
QUERY PLAN
--
Limit (cost=252683.61..252683.68 rows=28 width=116) (actual
time=169377.32..169378.39 rows=1000 loops=1)
-> Sort (cost=252683.61..252683.68 rows=29 width=116) (actual
time=169377.31..169377.69 rows=1001 loops=1)
Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
-> Aggregate (cost=252678.57..252682.91 rows=29 width=116)
(actual time=169305.79..169354.50 rows=3456 loops=1)
-> Group (cost=252678.57..252680.01 rows=289 width=116)
(actual time=169305.76..169330.00 rows=5879 loops=1)
-> Sort (cost=252678.57..252679.29 rows=289
width=116) (actual time=169305.75..169308.15 rows=5879 loops=1)
Sort Key: actor.actor_id
-> Nested Loop (cost=0.00..252666.74 rows=289
width=116) (actual time=89.27..169273.51 rows=5879 loops=1)
-> Nested Loop (cost=0.00..251608.11
rows=289 width=77) (actual time=57.73..92753.49 rows=5882 loops=1)
-> Index Scan using
actor_full_name_uppercase on actor (cost=0.00..456.88 rows=113 width=42)
(actual time=32.80..3197.28 rows=3501 loops=1)
Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
-> Index Scan using
actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2181.29
rows=2616 width=35) (actual time=22.26..25.57 rows=2 loops=3501)
Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
-> Index Scan using case_data_case_id on
case_data (cost=0.00..3.65 rows=1 width=39) (actual time=13.00..13.00
rows=1 loops=5882)
Index Cond: (case_data.case_id =
"outer".case_id)
Total runtime: 169381.38 msec
(17 rows)
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Seeking help with a query that takes too long
"Nick Fankhauser" <[EMAIL PROTECTED]> writes: > This indicates to me that 1 isn't too shabby as an estimate if the whole > name is specified, but I'm not sure how this gets altered in the case of a > "LIKE" For a pattern like "SANDERS%", the estimate is basically a range estimate for this condition: > ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND > (actor_full_name_uppercase < 'SANDERT'::character varying)) > n_distinct| -0.14701 > Question: What does it mean when n_distinct is negative? It means that the number of distinct values is estimated as a fraction of the table size, rather than an absolute number. In this case 14.7% of the table size, which is a bit off compared to the correct value of 43% (1453371/3386359), but at least it's of the right order of magnitude now ... >-> Index Scan using > actor_full_name_uppercase on actor (cost=0.00..456.88 rows=113 width=42) > (actual time=32.80..3197.28 rows=3501 loops=1) Hmm. Better, but not enough better to force a different plan choice. You might have to resort to brute force, like "set enable_nestloop=false". Just out of curiosity, what do you get if you do that? regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
[PERFORM] performance optimzations
I'm moving a webmail service over to use a postgresql database for storage and wanted to get any tips for optimizing performance. The machine will be a multiprocessor (either 2 or 4 cpu ) system with a raid array. What layout should be used? I was thinking using about using a raid 1+0 array to hold the database but since I can use different array types, would it be better to use 1+0 for the wal logs and a raid 5 for the database? The database gets fairly heavy activity (the system handles about 500MB of incoming and about 750MB of outgoing emails daily). I have a fairly free rein in regards to the system's layout as well as how the applications will interact with the database since I'm writing the code. -- Suchandra Thapa <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
Re: [PERFORM] performance optimzations
On Wed, 2003-11-12 at 12:34, Suchandra Thapa wrote: > I'm moving a webmail service over to use a postgresql database for > storage and wanted to get any tips for optimizing performance. The > machine will be a multiprocessor (either 2 or 4 cpu ) system with a raid > array. What layout should be used? I was thinking using about using a > raid 1+0 array to hold the database but since I can use different array > types, would it be better to use 1+0 for the wal logs and a raid 5 for > the database? How much in total storage? If you have (or will have) > ~6 disks, go for RAID 5 otherwise 10 is probably appropriate. > The database gets fairly heavy activity (the system handles about 500MB > of incoming and about 750MB of outgoing emails daily). I have a fairly > free rein in regards to the system's layout as well as how the > applications will interact with the database since I'm writing the > code. These are archived permanently -- ~450GB of annual data? Or is the data removed upon delivery? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Seeking help with a query that takes too long
> You might have to resort to brute force, like "set enable_nestloop=false".
> Just out of curiosity, what do you get if you do that?
I get a different plan, but similar execution time:
Limit (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.95 rows=1000 loops=1)
-> Sort (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.26 rows=1001 loops=1)
Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
-> Aggregate (cost=323437.08..323437.12 rows=1 width=115) (actual
time=170849.94..170898.06 rows=3457 loops=1)
-> Group (cost=323437.08..323437.09 rows=3 width=115)
(actual time=170849.90..170873.60 rows=5880 loops=1)
-> Sort (cost=323437.08..323437.08 rows=3 width=115)
(actual time=170847.97..170850.21 rows=5880 loops=1)
Sort Key: actor.actor_id
-> Hash Join (cost=25.29..323437.06 rows=3
width=115) (actual time=122873.80..170814.27 rows=5880 loops=1)
Hash Cond: ("outer".case_id =
"inner".case_id)
-> Seq Scan on case_data
(cost=0.00..60368.16 rows=1947116 width=39) (actual time=12.95..43542.25
rows=1947377 loops=1)
-> Hash (cost=25.28..25.28 rows=3
width=76) (actual time=122844.40..122844.40 rows=0 loops=1)
-> Hash Join (cost=6.02..25.28
rows=3 width=76) (actual time=24992.70..122810.32 rows=5883 loops=1)
Hash Cond: ("outer".actor_id =
"inner".actor_id)
-> Seq Scan on
actor_case_assignment (cost=0.00..209980.49 rows=8669349 width=34) (actual
time=9.13..85504.05 rows=8670467 loops=1)
-> Hash (cost=6.01..6.01
rows=1 width=42) (actual time=24926.56..24926.56 rows=0 loops=1)
-> Index Scan using
actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42)
(actual time=51.67..24900.53 rows=3502 loops=1)
Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
Total runtime: 170925.93 msec
(19 rows)
-Nick
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
Re: [PERFORM] performance optimzations
On Wed, 2003-11-12 at 12:23, Rod Taylor wrote: > On Wed, 2003-11-12 at 12:34, Suchandra Thapa wrote: > > I'm moving a webmail service over to use a postgresql database for > > storage and wanted to get any tips for optimizing performance. The > > machine will be a multiprocessor (either 2 or 4 cpu ) system with a raid > > array. What layout should be used? I was thinking using about using a > > raid 1+0 array to hold the database but since I can use different array > > types, would it be better to use 1+0 for the wal logs and a raid 5 for > > the database? > > How much in total storage? If you have (or will have) > ~6 disks, go > for RAID 5 otherwise 10 is probably appropriate. I'm not sure but I believe there are about 6-8 10K scsi drives on the system. There is quite a bit of storage to spare currently so I think > > The database gets fairly heavy activity (the system handles about 500MB > > of incoming and about 750MB of outgoing emails daily). I have a fairly > > free rein in regards to the system's layout as well as how the > > applications will interact with the database since I'm writing the > > code. > > These are archived permanently -- ~450GB of annual data? Or is the data > removed upon delivery? No, it's more like hotmail. Some users may keep mail for a longer term but a lot of the mail probably gets deleted fairly quickly. The database load will be mixed with a insertions due to deliveries, queries by the webmail system, and deletions from pop and webmail. -- Suchandra Thapa <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
[PERFORM] Seeking help with a query that take too long
Hi-
I have a query that takes too long. I haven't been able to come up with any
ideas for speeding it up, so I'm seeking some input from the list.
I'm using version 7.3.2
I have three tables:
case_data (1,947,386 rows)
actor (3,385,669 rows)
actor_case_assignment (8,668,650 rows)
As the names imply, actor_case_assignment contains records that assign an
actor to a case. Actors such as attorneys or judges may have many cases,
while the average actor (we hope) only has one.
What I'm trying to do is link these tables to get back a single row per
actor that shows the actor's name, the number of cases that actor is
assigned to, and if they only have one case, I want the public_id for that
case. This means I have to do a group by to get the case count, but I'm then
forced to use an aggregate function like max on the other fields.
All of the fields ending in "_id" have unique indexes, and
actor_full_name_uppercase is indexed.
Here's the select:
select
actor.actor_id,
max(actor.actor_full_name),
max(case_data.case_public_id),
max(case_data.case_id),
count(case_data.case_id) as case_count
from
actor,
actor_case_assignment,
case_data
where
actor.actor_full_name_uppercase like upper('sanders%')
and actor.actor_id = actor_case_assignment.actor_id
and case_data.case_id = actor_case_assignment.case_id
group by
actor.actor_id
order by
max(actor.actor_full_name),
case_count desc
limit
1000;
Here's the explain analyze:
QUERY PLAN
--
Limit (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.61..120035.67 rows=1000 loops=1)
-> Sort (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.60..120034.98 rows=1001 loops=1)
Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
-> Aggregate (cost=2214.67..2214.70 rows=1 width=115) (actual
time=119962.80..120011.49 rows=3456 loops=1)
-> Group (cost=2214.67..2214.68 rows=2 width=115) (actual
time=119962.76..119987.04 rows=5879 loops=1)
-> Sort (cost=2214.67..2214.68 rows=2 width=115)
(actual time=119962.74..119965.09 rows=5879 loops=1)
Sort Key: actor.actor_id
-> Nested Loop (cost=0.00..2214.66 rows=2
width=115) (actual time=59.05..119929.71 rows=5879 loops=1)
-> Nested Loop (cost=0.00..2205.26 rows=3
width=76) (actual time=51.46..66089.04 rows=5882 loops=1)
-> Index Scan using
actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42)
(actual time=37.62..677.44 rows=3501 loops=1)
Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
-> Index Scan using
actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2165.93
rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501)
Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
-> Index Scan using case_data_case_id on
case_data (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1
loops=5882)
Index Cond: (case_data.case_id =
"outer".case_id)
Total runtime: 120038.60 msec
(17 rows)
Any ideas?
Thanks!
-Nick
-
Nick Fankhauser
[EMAIL PROTECTED] Phone 1.765.965.7363 Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] performance optimzations
Suchandra Thapa <[EMAIL PROTECTED]> writes: > I was thinking using about using a raid 1+0 array to hold the > database but since I can use different array types, would it be > better to use 1+0 for the wal logs and a raid 5 for the database? It has been recommended on this list that getting a RAID controller with a battery-backed cache is pretty essential to getting good performance. Search the list archives for lots more discussion about RAID configurations. -Neil ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Seeking help with a query that takes too long
On Wed, 12 Nov 2003 13:27:53 -0500, "Nick Fankhauser" <[EMAIL PROTECTED]> wrote: > >> You might have to resort to brute force, like "set enable_nestloop=false". > -> Seq Scan on >actor_case_assignment (cost=0.00..209980.49 rows=8669349 width=34) (actual >time=9.13..85504.05 rows=8670467 loops=1) Does actor_case_assignment contain more columns than just the two ids? If yes, do these additional fields account for ca. 70 bytes per tuple? If not, try VACUUM FULL ANALYSE actor_case_assignment; > -> Index Scan using >actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) >(actual time=51.67..24900.53 rows=3502 loops=1) This same index scan on actor has been much faster in your previous postings (677ms, 3200ms), probably due to caching effects. 7ms per tuple returned looks like a lot of disk seeks are involved. Is clustering actor on actor_full_name_uppercase an option or would this slow down other queries? Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] performance optimzations
On Wed, 2003-11-12 at 16:29, Neil Conway wrote: > Suchandra Thapa <[EMAIL PROTECTED]> writes: > > I was thinking using about using a raid 1+0 array to hold the > > database but since I can use different array types, would it be > > better to use 1+0 for the wal logs and a raid 5 for the database? > > It has been recommended on this list that getting a RAID controller > with a battery-backed cache is pretty essential to getting good > performance. Search the list archives for lots more discussion about > RAID configurations. The server is already using a raid controller with battery backed ram and the cache set to write back (the server is on a ups so power failures shouldn't cause problems).I'll look at the list archives for RAID information. -- Suchandra Thapa <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
[PERFORM] Query question
Hi, I have coded some improvements to phpPgAdmin that I think are pretty cool. Basicaly, once you are browsing the results of an arbitrary SELECT query, you can still sort by columns, regardless of the underlying ORDER BY of the SELECT. I do this like this: SELECT * FROM (arbitrary subquery) AS sub ORDER BY 1,3; Now, this all works fine, but I want to know if this is efficient or not. Does doing a select of a select cause serious performance degradation? Chris ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] performance optimzations
> > How much in total storage? If you have (or will have) > ~6 disks, go > > for RAID 5 otherwise 10 is probably appropriate. > > I'm not sure but I believe there are about 6-8 10K scsi drives on the > system. There is quite a bit of storage to spare currently so I think I see.. With 8 drives, you'll probably want to go with RAID 5. It grows beyond that point fairly well with a decent controller card. Be sure to have some battery backed write cache on the raid card (128MB goes a long way). > > > The database gets fairly heavy activity (the system handles about 500MB > > > of incoming and about 750MB of outgoing emails daily). I have a fairly > No, it's more like hotmail. Some users may keep mail for a longer term > but a lot of the mail probably gets deleted fairly quickly. The > database load will be mixed with a insertions due to deliveries, queries > by the webmail system, and deletions from pop and webmail. You might consider having the mailserver gzip the emails prior to injection into the database (turn off compression in PostgreSQL) and decompress the data on the webserver for display to the client. Now you have about 7 times the number of emails in memory. It's easier to toss a webserver at the problem than make the database bigger in size. Take the savings in CPU on the DB and add it to ram. 1200MB of compressed mail is about 200MB? Assume email descriptive material (subject, from, etc.), account structure, indexes... so about 400MB for one days worth of information? You may want to consider keeping the compressed email in a separate table than the information describing it. It would mean descriptive information is more likely to be in RAM, where the body probably doesn't matter as much (you view them 1 at a time, subjects tend to be listed all at once). Most clients will be interested in say the last 7 days worth of data? Great.. Start out with 4GB ram on a good Dual CPU -- Opterons seem to work quite well -- and make sure the motherboard can hold double that in memory for an upgrade sometime next year when you've become popular. I firmly believe lots of RAM is the answer to most IO issues until you start getting into large sets of active data (>50GB). 64GB ram is fairly cheap compared to ongoing maintenance of the 30+ drive system required to get decent throughput. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Query question
Chris, > SELECT * FROM (arbitrary subquery) AS sub ORDER BY 1,3; > > Now, this all works fine, but I want to know if this is efficient or not. > > Does doing a select of a select cause serious performance degradation? It would be better if you could strip out the inner sort, but I can understand why that might not be possible in all cases. The only thing you're adding to the query is a second SORT step, so it shouldn't require any more time/memory than the query's first SORT did. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] performance optimzations
On Wed, 2003-11-12 at 22:35, Rod Taylor wrote: > You may want to consider keeping the compressed email in a separate > table than the information describing it. It would mean descriptive > information is more likely to be in RAM, where the body probably doesn't > matter as much (you view them 1 at a time, subjects tend to be listed > all at once). Thanks for the suggestions. Splitting the load between several machines was the original intent of moving the storage from the file system to a database. I believe the schema I'm already using splits out the body due to the size of some attachments. Luckily the code already gzips the email body and abbreviates common email headers so storing compressed emails isn't a problem. > Most clients will be interested in say the last 7 days worth of data? > Great.. Start out with 4GB ram on a good Dual CPU -- Opterons seem to > work quite well -- and make sure the motherboard can hold double that in > memory for an upgrade sometime next year when you've become popular. Unfortunately, the hardware available is pretty much fixed in regards to the system. I can play around with the raid configurations and have some limited choice in regards to the raid controller and number of drivers but that's about all in terms of hardware. > I firmly believe lots of RAM is the answer to most IO issues until you > start getting into large sets of active data (>50GB). 64GB ram is fairly > cheap compared to ongoing maintenance of the 30+ drive system required > to get decent throughput. The current file system holding the user and email information indicates the current data has about 64GB (70K accounts, I'm not sure how many are active but 50% might be good guess). This seems to be somewhat of a steady state however. -- Suchandra Thapa <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
