Re: [PERFORM] Value of Quad vs. Dual Processor machine

2003-11-12 Thread Shridhar Daithankar
Fred Moyer wrote:
One thing I learned after spending about a week comparing the Athlon (2
ghz, 333 mhz frontside bus) and Xeon (2.4 ghz, 266 mhz frontside bus)
platforms was that on average the select queries I was benchmarking ran
30% faster on the Athlon (this was with data cached in memory so may not
apply to the larger data sets where I/O is the limiting factor.)
I benchmarked against the Opteron 244 when it came out and it came in
about the same as the Athlon (makes sense since both were 333 mhz
memory).  The results within +/- 5-10% that of the Athlon.  From testing
against a couple of other machines I noticed that the memory bus speeds
were almost directly proportional to the query times under these
conditions.
I remember a posting here about opteron, which essentially said, even if opteron 
works on par with athlon under few clients, as load increases it scales more 
than 50% better than athlons.

So that could be another shot at it.Sorry, no handy URL here.

 Shridhar

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser
[I originally posted this using the wrong E-Mail account, so a double
posting may occur if the first message gets released by the moderator later-
sorry!]

Hi-

I have a query that I'm trying to speed up. I haven't been able to come up
with any workable ideas for speeding it up, so I'm seeking some input from
the list.

I'm using version 7.3.2

I have three tables:

case_data (1,947,386 rows)
actor (3,385,669 rows)
actor_case_assignment (8,668,650 rows)

As the names imply, actor_case_assignment contains records that assign an
actor to a case. Actors such as attorneys or judges may have many cases,
while the average actor (we hope) only has one.

What I'm trying to do is link these tables to get back a single row per
actor that shows the actor's name, the number of cases that actor is
assigned to, and if they only have one case, I want the public_id for that
case. This means I have to do a group by to get the case count, but I'm then
forced to use an aggregate function like max on the other fields.

All of the fields ending in "_id" have unique indexes, and
actor_full_name_uppercase is indexed. An analyze is done every night & the
database is fairly stable in it's composition.

Here's the select:

  select
actor.actor_id,
max(actor.actor_full_name),
max(case_data.case_public_id),
max(case_data.case_id),
count(case_data.case_id) as case_count
  from
actor,
actor_case_assignment,
case_data
  where
actor.actor_full_name_uppercase like upper('sanders%')
and actor.actor_id = actor_case_assignment.actor_id
and case_data.case_id = actor_case_assignment.case_id
  group by
actor.actor_id
  order by
max(actor.actor_full_name),
case_count desc
  limit
1000;


Here's the explain analyze:


QUERY PLAN


--
 Limit  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.61..120035.67 rows=1000 loops=1)
   ->  Sort  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.60..120034.98 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=2214.67..2214.70 rows=1 width=115) (actual
time=119962.80..120011.49 rows=3456 loops=1)
   ->  Group  (cost=2214.67..2214.68 rows=2 width=115) (actual
time=119962.76..119987.04 rows=5879 loops=1)
 ->  Sort  (cost=2214.67..2214.68 rows=2 width=115)
(actual time=119962.74..119965.09 rows=5879 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=0.00..2214.66 rows=2
width=115) (actual time=59.05..119929.71 rows=5879 loops=1)
 ->  Nested Loop  (cost=0.00..2205.26 rows=3
width=76) (actual time=51.46..66089.04 rows=5882 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=37.62..677.44 rows=3501 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2165.93
rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501)
 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1
loops=5882)
   Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 120038.60 msec
(17 rows)


Any ideas?

Thanks!
 -Nick


-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Value of Quad vs. Dual Processor machine

2003-11-12 Thread Jeff
On Tue, 11 Nov 2003 21:13:19 -0500
"Chris Field" <[EMAIL PROTECTED]> wrote:

> we are looking at Xeon, We are currently running it on a quad sun v880
> compiled to be 64bit and have been getting dreadful performance.  I
> don't think we really have much to gain from going 64bit.
> 
> 
By chance, are you running 7.3.4 on that sun?
If so, try this:
export CFLAGS=-02
./configure

and rebuild PG.

Before 7.4 PG was build with _no_ optimization on Solaris. 
Recompiling gives __HUGE__ (notice the underscores) performance gains.

And onto the dual vs quad.

PG will only use 1 cpu / connection / query. 

So if your machine iwll have 1-2 queries running at a time those other 2
proc's will sit around idling.  However if you are going to have a bunch
going, 4 cpus will be most useful.  One of hte nicest things to do for
PG is more ram and fast IO. It really loves those things.

good luck


-- 
Jeff Trout <[EMAIL PROTECTED]>
http://www.jefftrout.com/
http://www.stuarthamm.net/

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Manfred Koizar
On Wed, 12 Nov 2003 08:34:50 -0500, "Nick Fankhauser"
<[EMAIL PROTECTED]> wrote:
>   ->  Index Scan using
>actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
  ^^
>(actual time=37.62..677.44 rows=3501 loops=1)
^
> Index Cond:
>((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
>(actor_full_name_uppercase < 'SANDERT'::character varying))
> Filter:
>(actor_full_name_uppercase ~~ 'SANDERS%'::text)

Nick, can you find out why this row count estimation is so far off?

\x
SELECT * FROM pg_stats
 WHERE tablename='actor' AND attname='actor_full_name_uppercase';

BTW, there seem to be missing cases:
>  ->  Nested Loop  (cost=0.00..2214.66 rows=2 width=115)
>   (actual time=59.05..119929.71 rows=5879 loops=1)
   
>->  Nested Loop  (cost=0.00..2205.26 rows=3 width=76)
> (actual time=51.46..66089.04 rows=5882 loops=1)


Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Superior performance in PG 7.4

2003-11-12 Thread alexandre :: aldeia digital
Hi,

I am trying the PG 7.4 RC1 and RC2 and I see a superb performance
improvement compared with 7.3

Explaining the querys, I see a change of planner that, in my case,
prefer Nested Loops in 7.4 opposite to Hash or Merge Join in 7.3.

To test, I disable Hash and Merge Joins in 7.3 and performance
have been very improved using nested loops...

Both systems are identical in configurations, properly vacuuned and
analyzed before tests.

Something can be wrong with my tests ? [ I desire that not :) ]

Alexandre


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser


>(actual time=37.62..677.44 rows=3501 loops=1)
^

> Nick, can you find out why this row count estimation is so far off?

It's actually correct:

prod1=# select count(actor_id) from actor where actor_full_name_uppercase
like 'SANDERS%';
 count
---
  3501
(1 row)

Of course, I merely chose "SANDERS" arbitrarily as a name that falls
somewhere near the middle of the frequency range for names. SMITH or JONES
would represent a worst-case, and something like KOIZAR would probably be
unique.


Here are the stats:

prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';
-[ RECORD
1 ]-+---


-
schemaname| public
tablename | actor
attname   | actor_full_name_uppercase
null_frac | 0.00033
avg_width | 21
n_distinct| 24215
most_common_vals  | {"STATE OF INDIANA","INDIANA DEPARTMENT OF
REVENUE","BARTH CONS SCHOOL CORP","HOWARD COUNTY CLERK","ADVANCED RECOVERY
SERVICES","STATE OF INDIANA-DEPT OF REVENUE","ALLIED COLLECTION SERVICE
INC","CREDIT BUREAU OF LAPORTE","MIDWEST COLLECTION SVC INC","NCO FINANCIAL
SYSTEMS INC"}
most_common_freqs |
{0.015,0.014,0.0043,0.0043,0.004,0.0037,0.0033,0.003
3,0.0027,0.0027}
histogram_bounds  | {"(POE) ESTELLE, DENISE","BRIEN, LIISI","COTTRELL,
CAROL","FAMILY RENTALS","HAYNES, TAMIKA","KESSLER, VICTORIA","MEFFORD,
VERNON L","PHILLIPS, GERALD L","SHELTON, ANTOINETTE","TRICARICO, MELISSA
SUE","ZUEHLKE, THOMAS L"}
correlation   | -0.00147395


I think this means that the average is 357 per actor. As you can see, the
range of assignments varies from people with a single parking ticket to
"State of Indiana", which is party to many thousands of cases.


> BTW, there seem to be missing cases:
> >  ->  Nested Loop  (cost=0.00..2214.66 rows=2 width=115)
> >   (actual time=59.05..119929.71 rows=5879 loops=1)
>
> >->  Nested Loop  (cost=0.00..2205.26 rows=3 width=76)
> > (actual time=51.46..66089.04 rows=5882 loops=1)

This is expected- We actually aggregate data from many county court
databases, with varying levels of data "cleanliness".

Regards,
-Nick




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser

> >actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
>   ^^
> >(actual time=37.62..677.44 rows=3501 loops=1)
> ^
> Nick, can you find out why this row count estimation is so far off?
^

Oops- I read this backward- I see what you mean now. That's a good question.
I'm not sure what part of the stats this estimate might be pulled from. The
average is 357, but the most common frequency may be around 1.

-Nick



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Tom Lane
"Nick Fankhauser" <[EMAIL PROTECTED]> writes:
>> Nick, can you find out why this row count estimation is so far off?

> It's actually correct:

Sure, the 3501 was the "actual".  The estimate was 1 row, which was
pretty far off :-(

> Here are the stats:

It looks like you are running with the default statistics target (10).
Try boosting it to 100 or even more for this column (see ALTER TABLE
SET STATISTICS, then re-ANALYZE) and see if the estimate gets better.
I think the major problem is likely here:
> n_distinct| 24215
which is no doubt much too small (do you have an idea of the number
of distinct actor_full_name_uppercase values?)

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Value of Quad vs. Dual Processor machine

2003-11-12 Thread scott.marlowe
As another post pointed out, you need to set cflags to get optimization 
under Solaris on that flavor of Postgresql.

Also, Postgresql tends to get its best performance from the free unixes, 
Linux and BSD.  those are available for Sun Sparcs, but postgresql in 64 
bit mode on those boxes is still a bit cutting edge.

It might be worth a try to set up the sun to dual boot to either BSD or 
Linux and test Postgresql under that environment to see how it works and 
compare it to Sun after you've set the cflags and recompiled.

On Tue, 11 Nov 2003, Chris Field wrote:

> we are looking at Xeon, We are currently running it on a quad sun v880
> compiled to be 64bit and have been getting dreadful performance.  I don't
> think we really have much to gain from going 64bit.
> 
> 
> - Original Message - 
> From: "Ron Johnson" <[EMAIL PROTECTED]>
> To: "PgSQL Performance ML" <[EMAIL PROTECTED]>
> Sent: Tuesday, November 11, 2003 8:24 PM
> Subject: Re: [PERFORM] Value of Quad vs. Dual Processor machine
> 
> 
> > On Tue, 2003-11-11 at 17:32, Chris Field wrote:
> > > We are getting ready to spec out a new machine and are wondering about
> > > the wisdom of buying a quad versus a dual processor machine.  Seing as
> > > how postgres in not a threaded application, and this server will only be
> > > used for log/transaction analysis (it will only ever have a few large
> > > queries running).  Is there any performance to be gained, and if so is
> > > it worth the large cost?  Any thoughts/experience are much
> > > appreciated...
> >
> > Xeon or Opteron?  The faster Opterons *really* blaze, especially
> > in 64-bit mode.  As others have said, though, RAM and I/O are most
> > important.
> >
> > -- 
> > -
> > Ron Johnson, Jr. [EMAIL PROTECTED]
> > Jefferson, LA USA
> >
> > "As I like to joke, I may have invented it, but Microsoft made it
> > popular"
> > David Bradley, regarding Ctrl-Alt-Del
> >
> >
> > ---(end of broadcast)---
> > TIP 5: Have you checked our extensive FAQ?
> >
> >http://www.postgresql.org/docs/faqs/FAQ.html
> >
> >
> 
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 
> 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] *very* slow query to summarize data for a month ...

2003-11-12 Thread Marc G. Fournier


On Wed, 12 Nov 2003, Greg Stark wrote:

>
> "Marc G. Fournier" <[EMAIL PROTECTED]> writes:
>
> > Just as a side note, just doing a straight scan for the records, with no
> > SUM()/GROUP BY involved, with the month_trunc() index is still >8k msec:
>
> One of the other advantages of these aggregate tables was that we could
> purge the old data much sooner with much less resistance from the
> business. Since the reports were all still available and a lot of ad-hoc
> queries could still be done without the raw data anyways.

Actually, what I've done is do this at the 'load stage' ... but same
concept ...


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser

> It looks like you are running with the default statistics target (10).
> Try boosting it to 100 or even more for this column (see ALTER TABLE
> SET STATISTICS, then re-ANALYZE) and see if the estimate gets better.


Here are the results & a few more clues:

prod1=# alter table actor alter column actor_full_name_uppercase set
statistics 1000;
ALTER TABLE
prod1=# analyze actor;
ANALYZE
prod1=# select count(distinct actor_full_name_uppercase) from actor;
  count
-
 1453371
(1 row)

prod1=# select count(actor_id) from actor;
  count
-
 3386359
(1 row)

This indicates to me that 1 isn't too shabby as an estimate if the whole
name is specified, but I'm not sure how this gets altered in the case of a
"LIKE"


prod1=# \x
Expanded display is on.
prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';



schemaname| public
tablename | actor
attname   | actor_full_name_uppercase
null_frac | 0.000586667
avg_width | 21
n_distinct| -0.14701



correlation   | -0.00211291


Question: What does it mean when n_distinct is negative?

New results of explain analyze:



QUERY PLAN


--
 Limit  (cost=252683.61..252683.68 rows=28 width=116) (actual
time=169377.32..169378.39 rows=1000 loops=1)
   ->  Sort  (cost=252683.61..252683.68 rows=29 width=116) (actual
time=169377.31..169377.69 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=252678.57..252682.91 rows=29 width=116)
(actual time=169305.79..169354.50 rows=3456 loops=1)
   ->  Group  (cost=252678.57..252680.01 rows=289 width=116)
(actual time=169305.76..169330.00 rows=5879 loops=1)
 ->  Sort  (cost=252678.57..252679.29 rows=289
width=116) (actual time=169305.75..169308.15 rows=5879 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=0.00..252666.74 rows=289
width=116) (actual time=89.27..169273.51 rows=5879 loops=1)
 ->  Nested Loop  (cost=0.00..251608.11
rows=289 width=77) (actual time=57.73..92753.49 rows=5882 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..456.88 rows=113 width=42)
(actual time=32.80..3197.28 rows=3501 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2181.29
rows=2616 width=35) (actual time=22.26..25.57 rows=2 loops=3501)
 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.65 rows=1 width=39) (actual time=13.00..13.00
rows=1 loops=5882)
   Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 169381.38 msec
(17 rows)



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Tom Lane
"Nick Fankhauser" <[EMAIL PROTECTED]> writes:
> This indicates to me that 1 isn't too shabby as an estimate if the whole
> name is specified, but I'm not sure how this gets altered in the case of a
> "LIKE"

For a pattern like "SANDERS%", the estimate is basically a range estimate
for this condition:

> ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
> (actor_full_name_uppercase < 'SANDERT'::character varying))

> n_distinct| -0.14701

> Question: What does it mean when n_distinct is negative?

It means that the number of distinct values is estimated as a fraction
of the table size, rather than an absolute number.  In this case 14.7%
of the table size, which is a bit off compared to the correct value
of 43% (1453371/3386359), but at least it's of the right order of
magnitude now ...

>->  Index Scan using
> actor_full_name_uppercase on actor  (cost=0.00..456.88 rows=113 width=42)
> (actual time=32.80..3197.28 rows=3501 loops=1)

Hmm.  Better, but not enough better to force a different plan choice.

You might have to resort to brute force, like "set enable_nestloop=false".
Just out of curiosity, what do you get if you do that?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] performance optimzations

2003-11-12 Thread Suchandra Thapa
I'm moving a webmail service over to use a postgresql database for
storage and wanted to get any tips for optimizing performance.  The
machine will be a multiprocessor (either 2 or 4 cpu ) system with a raid
array.  What layout should be used?  I was thinking using about using a
raid 1+0 array to hold the database but since I can use different array
types, would it be better to use 1+0 for the wal logs and a raid 5 for
the database?

The database gets fairly heavy activity (the system handles about 500MB
of incoming and about 750MB of outgoing emails daily).  I have a fairly
free rein in regards to the system's layout as well as how the
applications will interact with the database since I'm writing the
code.  


-- 
Suchandra Thapa <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] performance optimzations

2003-11-12 Thread Rod Taylor
On Wed, 2003-11-12 at 12:34, Suchandra Thapa wrote:
> I'm moving a webmail service over to use a postgresql database for
> storage and wanted to get any tips for optimizing performance.  The
> machine will be a multiprocessor (either 2 or 4 cpu ) system with a raid
> array.  What layout should be used?  I was thinking using about using a
> raid 1+0 array to hold the database but since I can use different array
> types, would it be better to use 1+0 for the wal logs and a raid 5 for
> the database?

How much in total storage?  If you have (or will have) > ~6 disks, go
for RAID 5 otherwise 10 is probably appropriate.

> The database gets fairly heavy activity (the system handles about 500MB
> of incoming and about 750MB of outgoing emails daily).  I have a fairly
> free rein in regards to the system's layout as well as how the
> applications will interact with the database since I'm writing the
> code.

These are archived permanently -- ~450GB of annual data? Or is the data
removed upon delivery?


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Nick Fankhauser

> You might have to resort to brute force, like "set enable_nestloop=false".
> Just out of curiosity, what do you get if you do that?

I get a different plan, but similar execution time:


 Limit  (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.95 rows=1000 loops=1)
   ->  Sort  (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.26 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=323437.08..323437.12 rows=1 width=115) (actual
time=170849.94..170898.06 rows=3457 loops=1)
   ->  Group  (cost=323437.08..323437.09 rows=3 width=115)
(actual time=170849.90..170873.60 rows=5880 loops=1)
 ->  Sort  (cost=323437.08..323437.08 rows=3 width=115)
(actual time=170847.97..170850.21 rows=5880 loops=1)
   Sort Key: actor.actor_id
   ->  Hash Join  (cost=25.29..323437.06 rows=3
width=115) (actual time=122873.80..170814.27 rows=5880 loops=1)
 Hash Cond: ("outer".case_id =
"inner".case_id)
 ->  Seq Scan on case_data
(cost=0.00..60368.16 rows=1947116 width=39) (actual time=12.95..43542.25
rows=1947377 loops=1)
 ->  Hash  (cost=25.28..25.28 rows=3
width=76) (actual time=122844.40..122844.40 rows=0 loops=1)
   ->  Hash Join  (cost=6.02..25.28
rows=3 width=76) (actual time=24992.70..122810.32 rows=5883 loops=1)
 Hash Cond: ("outer".actor_id =
"inner".actor_id)
 ->  Seq Scan on
actor_case_assignment  (cost=0.00..209980.49 rows=8669349 width=34) (actual
time=9.13..85504.05 rows=8670467 loops=1)
 ->  Hash  (cost=6.01..6.01
rows=1 width=42) (actual time=24926.56..24926.56 rows=0 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=51.67..24900.53 rows=3502 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
 Total runtime: 170925.93 msec
(19 rows)


-Nick



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] performance optimzations

2003-11-12 Thread Suchandra Thapa
On Wed, 2003-11-12 at 12:23, Rod Taylor wrote:
> On Wed, 2003-11-12 at 12:34, Suchandra Thapa wrote:
> > I'm moving a webmail service over to use a postgresql database for
> > storage and wanted to get any tips for optimizing performance.  The
> > machine will be a multiprocessor (either 2 or 4 cpu ) system with a raid
> > array.  What layout should be used?  I was thinking using about using a
> > raid 1+0 array to hold the database but since I can use different array
> > types, would it be better to use 1+0 for the wal logs and a raid 5 for
> > the database?
> 
> How much in total storage?  If you have (or will have) > ~6 disks, go
> for RAID 5 otherwise 10 is probably appropriate.

I'm not sure but I believe there are about 6-8 10K scsi drives on the
system.   There is quite a bit of storage to spare currently so I think 

> > The database gets fairly heavy activity (the system handles about 500MB
> > of incoming and about 750MB of outgoing emails daily).  I have a fairly
> > free rein in regards to the system's layout as well as how the
> > applications will interact with the database since I'm writing the
> > code.
> 
> These are archived permanently -- ~450GB of annual data? Or is the data
> removed upon delivery?

No, it's more like hotmail.  Some users may keep mail for a longer term
but a lot of the mail probably gets deleted fairly quickly.  The
database load will be mixed with a insertions due to deliveries, queries
by the webmail system, and deletions from pop and webmail.

-- 
Suchandra Thapa <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


[PERFORM] Seeking help with a query that take too long

2003-11-12 Thread Nick Fankhauser - Doxpop
Hi-

I have a query that takes too long. I haven't been able to come up with any
ideas for speeding it up, so I'm seeking some input from the list.

I'm using version 7.3.2

I have three tables:

case_data (1,947,386 rows)
actor (3,385,669 rows)
actor_case_assignment (8,668,650 rows)

As the names imply, actor_case_assignment contains records that assign an
actor to a case. Actors such as attorneys or judges may have many cases,
while the average actor (we hope) only has one.

What I'm trying to do is link these tables to get back a single row per
actor that shows the actor's name, the number of cases that actor is
assigned to, and if they only have one case, I want the public_id for that
case. This means I have to do a group by to get the case count, but I'm then
forced to use an aggregate function like max on the other fields.

All of the fields ending in "_id" have unique indexes, and
actor_full_name_uppercase is indexed.

Here's the select:

  select
actor.actor_id,
max(actor.actor_full_name),
max(case_data.case_public_id),
max(case_data.case_id),
count(case_data.case_id) as case_count
  from
actor,
actor_case_assignment,
case_data
  where
actor.actor_full_name_uppercase like upper('sanders%')
and actor.actor_id = actor_case_assignment.actor_id
and case_data.case_id = actor_case_assignment.case_id
  group by
actor.actor_id
  order by
max(actor.actor_full_name),
case_count desc
  limit
1000;


Here's the explain analyze:


QUERY PLAN


--
 Limit  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.61..120035.67 rows=1000 loops=1)
   ->  Sort  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.60..120034.98 rows=1001 loops=1)
 Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
 ->  Aggregate  (cost=2214.67..2214.70 rows=1 width=115) (actual
time=119962.80..120011.49 rows=3456 loops=1)
   ->  Group  (cost=2214.67..2214.68 rows=2 width=115) (actual
time=119962.76..119987.04 rows=5879 loops=1)
 ->  Sort  (cost=2214.67..2214.68 rows=2 width=115)
(actual time=119962.74..119965.09 rows=5879 loops=1)
   Sort Key: actor.actor_id
   ->  Nested Loop  (cost=0.00..2214.66 rows=2
width=115) (actual time=59.05..119929.71 rows=5879 loops=1)
 ->  Nested Loop  (cost=0.00..2205.26 rows=3
width=76) (actual time=51.46..66089.04 rows=5882 loops=1)
   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=37.62..677.44 rows=3501 loops=1)
 Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
 Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
   ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2165.93
rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501)
 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1
loops=5882)
   Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 120038.60 msec
(17 rows)


Any ideas?

Thanks!
 -Nick


-
Nick Fankhauser

[EMAIL PROTECTED]  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] performance optimzations

2003-11-12 Thread Neil Conway
Suchandra Thapa <[EMAIL PROTECTED]> writes:
> I was thinking using about using a raid 1+0 array to hold the
> database but since I can use different array types, would it be
> better to use 1+0 for the wal logs and a raid 5 for the database?

It has been recommended on this list that getting a RAID controller
with a battery-backed cache is pretty essential to getting good
performance. Search the list archives for lots more discussion about
RAID configurations.

-Neil


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Seeking help with a query that takes too long

2003-11-12 Thread Manfred Koizar
On Wed, 12 Nov 2003 13:27:53 -0500, "Nick Fankhauser"
<[EMAIL PROTECTED]> wrote:
>
>> You might have to resort to brute force, like "set enable_nestloop=false".

> ->  Seq Scan on
>actor_case_assignment  (cost=0.00..209980.49 rows=8669349 width=34) (actual
>time=9.13..85504.05 rows=8670467 loops=1)

Does actor_case_assignment contain more columns than just the two ids?
If yes, do these additional fields account for ca. 70 bytes per tuple?
If not, try
VACUUM FULL ANALYSE actor_case_assignment;

>   ->  Index Scan using
>actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
>(actual time=51.67..24900.53 rows=3502 loops=1)

This same index scan on actor has been much faster in your previous
postings (677ms, 3200ms), probably due to caching effects.  7ms per
tuple returned looks like a lot of disk seeks are involved.  Is
clustering actor on actor_full_name_uppercase an option or would this
slow down other queries?

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] performance optimzations

2003-11-12 Thread Suchandra Thapa
On Wed, 2003-11-12 at 16:29, Neil Conway wrote:
> Suchandra Thapa <[EMAIL PROTECTED]> writes:
> > I was thinking using about using a raid 1+0 array to hold the
> > database but since I can use different array types, would it be
> > better to use 1+0 for the wal logs and a raid 5 for the database?
> 
> It has been recommended on this list that getting a RAID controller
> with a battery-backed cache is pretty essential to getting good
> performance. Search the list archives for lots more discussion about
> RAID configurations.

The server is already using a raid controller with battery backed ram
and the cache set to write back (the server is on a ups so power
failures shouldn't cause problems).I'll look at the list archives
for RAID information.  

-- 
Suchandra Thapa <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


[PERFORM] Query question

2003-11-12 Thread Christopher Kings-Lynne
Hi,

I have coded some improvements to phpPgAdmin that I think are pretty 
cool.  Basicaly, once you are browsing the results of an arbitrary 
SELECT query, you can still sort by columns, regardless of the 
underlying ORDER BY of the SELECT.

I do this like this:

SELECT * FROM (arbitrary subquery) AS sub ORDER BY 1,3;

Now, this all works fine, but I want to know if this is efficient or not.

Does doing a select of a select cause serious performance degradation?

Chris



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] performance optimzations

2003-11-12 Thread Rod Taylor
> > How much in total storage?  If you have (or will have) > ~6 disks, go
> > for RAID 5 otherwise 10 is probably appropriate.
> 
> I'm not sure but I believe there are about 6-8 10K scsi drives on the
> system.   There is quite a bit of storage to spare currently so I think 

I see.. With 8 drives, you'll probably want to go with RAID 5. It grows
beyond that point fairly well with a decent controller card. Be sure to
have some battery backed write cache on the raid card (128MB goes a long
way).

> > > The database gets fairly heavy activity (the system handles about 500MB
> > > of incoming and about 750MB of outgoing emails daily).  I have a fairly

> No, it's more like hotmail.  Some users may keep mail for a longer term
> but a lot of the mail probably gets deleted fairly quickly.  The
> database load will be mixed with a insertions due to deliveries, queries
> by the webmail system, and deletions from pop and webmail.

You might consider having the mailserver gzip the emails prior to
injection into the database (turn off compression in PostgreSQL) and
decompress the data on the webserver for display to the client. Now you
have about 7 times the number of emails in memory.

It's easier to toss a webserver at the problem than make the database
bigger in size. Take the savings in CPU on the DB and add it to ram.

1200MB of compressed mail is about 200MB? Assume email descriptive
material (subject, from, etc.), account structure, indexes... so about
400MB for one days worth of information?

You may want to consider keeping the compressed email in a separate
table than the information describing it. It would mean descriptive
information is more likely to be in RAM, where the body probably doesn't
matter as much (you view them 1 at a time, subjects tend to be listed
all at once).

Most clients will be interested in say the last 7 days worth of data? 
Great.. Start out with 4GB ram on a good Dual CPU -- Opterons seem to
work quite well -- and make sure the motherboard can hold double that in
memory for an upgrade sometime next year when you've become popular.

I firmly believe lots of RAM is the answer to most IO issues until you
start getting into large sets of active data (>50GB). 64GB ram is fairly
cheap compared to ongoing maintenance of the 30+ drive system required
to get decent throughput.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Query question

2003-11-12 Thread Josh Berkus
Chris,

> SELECT * FROM (arbitrary subquery) AS sub ORDER BY 1,3;
>
> Now, this all works fine, but I want to know if this is efficient or not.
>
> Does doing a select of a select cause serious performance degradation?

It would be better if you could strip out the inner sort, but I can understand 
why that might not be possible in all cases.

The only thing you're adding to the query is a second SORT step, so it 
shouldn't require any more time/memory than the query's first SORT did.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] performance optimzations

2003-11-12 Thread Suchandra Thapa
On Wed, 2003-11-12 at 22:35, Rod Taylor wrote:
> You may want to consider keeping the compressed email in a separate
> table than the information describing it. It would mean descriptive
> information is more likely to be in RAM, where the body probably doesn't
> matter as much (you view them 1 at a time, subjects tend to be listed
> all at once).

Thanks for the suggestions.  Splitting the load between several machines
was the original intent of moving the storage from the file system to a
database.  I believe the schema I'm already using splits out the body
due to the size of some attachments.  Luckily the code already gzips the
email body and abbreviates common email headers so storing compressed
emails isn't a problem.  

> Most clients will be interested in say the last 7 days worth of data? 
> Great.. Start out with 4GB ram on a good Dual CPU -- Opterons seem to
> work quite well -- and make sure the motherboard can hold double that in
> memory for an upgrade sometime next year when you've become popular.

Unfortunately, the hardware available is pretty much fixed in regards to
the system.  I can play around with the raid configurations and have
some limited choice in regards to the raid controller and number of
drivers but that's about all in terms of hardware.

> I firmly believe lots of RAM is the answer to most IO issues until you
> start getting into large sets of active data (>50GB). 64GB ram is fairly
> cheap compared to ongoing maintenance of the 30+ drive system required
> to get decent throughput.

The current file system holding the user and email information indicates
the current data has about 64GB (70K accounts, I'm not sure how many are
active but 50% might be good guess).  This seems to be somewhat of a
steady state however.

-- 
Suchandra Thapa <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part