Re: [PERFORM] postgresql geqo optimization

2006-02-10 Thread Steinar H. Gunderson
On Fri, Feb 10, 2006 at 08:46:14PM -0500, uwcssa wrote:
> I am wondering if anyone here ever had complex queries that the GEQO fails
> to work properly, i.e.,  finds a terrible query plan as compared to one
> found by DP optimizer (by forcing Postgresql always uses DP).This is
> important to me since I am trying to see what type of queries will be worth
> spending a lot of time doing a thorough DP optimization (if it is going to
> be executed again and again).

There have been a few problems earlier on this list which might have been the
geqo's fault; search the list archives for "geqo" or "genetic", and you
should be able to find them quite easily.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] [HACKERS] What do the Windows pg hackers out there like for dev

2006-02-10 Thread Mark Kirkwood

Ron wrote:
Subject line says it all.  I'm going to be testing changes under both 
Linux and WinXP, so I'm hoping those of you that do M$ hacking will pass 
along your list of suggestions and/or favorite (and hated so I know what 
to avoid) tools.




Testing only? So you really only need to build and run on Windows...

I was doing exactly this about a year ago and used Mingw. The only 
annoyance was that I could compile everything on Linux in about 3 
minutes (P4 2.8Ghz), but had to wait about 60-90 minutes for the same 
thing on Windows 2003 Server! (also a P4 2.8Ghz...). So I used to build 
a 'go for coffee' task into the build and test cycle.


Cheers

Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] postgresql geqo optimization

2006-02-10 Thread uwcssa
I have a question with regard to GEQO optimizer of Postgresql.
 
For complex queries with over 12 tables in a join, (12 is the
default value), the Postgresql optimizer by default will not use the dynamic programming style optimizer. Instead, it uses genetic algorithm to compute a sub-optimal query plan.  The reason is that GEQO takes sub-seconds to find a query plan while the DP style optimizer will take minutes or even hours to optimize a complex query with large join degree.

 
I am wondering if anyone here ever had complex queries that the GEQO fails to work properly, i.e.,  finds a terrible query plan as compared to one found by DP optimizer (by forcing Postgresql always uses DP).    This is important to me since I am trying to see what type of queries will be worth spending a lot of time doing a thorough DP optimization (if it is going to be executed again and again).

 
thanks a lot!


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Dave Dutcher
OK, if I'm reading this correctly, it looks like the planner is choosing
a sequential scan because it expects 48,000 rows for that
patientidentifier, but its actually only getting 3.  The planner has the
number of rows right for the sequential scan, so it seems like the stats
are up to date.  I would try increasing the stats for the
patientindentifier column with 'alter table set statistics...' or
increasing the default_statistics_target for the whole DB.  Once you
have changed the stats I believe you need to run analyze again.



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006 4:59 PM
To: Tom Lane
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan 

ok here is real db

the first  query I had seems to make no sense because it is only fast if
I limit the rows since almost all rows have status = 'AC'

second query
 tables both have about 10 million rows and it takes a long time as you
can see but this person only has approx 160 total documents


 QUERY PLAN  

---
 Hash Join  (cost=84813.14..1510711.97 rows=48387 width=555) (actual
time=83266.854..91166.315 rows=3 loops=1)
   Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)
   ->  Seq Scan on documentversions  (cost=0.00..269141.98 rows=9677398
width=415) (actual time=0.056..49812.459 rows=9677398 loops=1)
   ->  Hash  (cost=83660.05..83660.05 rows=48036 width=140) (actual
time=10.833..10.833 rows=3 loops=1)
 ->  Bitmap Heap Scan on clinicaldocuments
(cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258
rows=3 loops=1)
   Recheck Cond: (patientidentifier = 690193)
   ->  Bitmap Index Scan on ix_cdocpid  (cost=0.00..301.13
rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1)
 Index Cond: (patientidentifier = 690193)
 Total runtime: 91166.540 ms


Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 10, 2006 5:52 PM
To: Tim Jones
Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan 

"Tim Jones" <[EMAIL PROTECTED]> writes:
> QUERY PLAN
> 'Hash Join  (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ...
> 'Total runtime: 0.392 ms'

Hardly seems like evidence of a performance problem ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Tim Jones
ok here is real db

the first  query I had seems to make no sense because it is only fast if
I limit the rows since almost all rows have status = 'AC'

second query
 tables both have about 10 million rows and it takes a long time as you
can see but this person only has approx 160 total documents


 QUERY PLAN  

---
 Hash Join  (cost=84813.14..1510711.97 rows=48387 width=555) (actual
time=83266.854..91166.315 rows=3 loops=1)
   Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)
   ->  Seq Scan on documentversions  (cost=0.00..269141.98 rows=9677398
width=415) (actual time=0.056..49812.459 rows=9677398 loops=1)
   ->  Hash  (cost=83660.05..83660.05 rows=48036 width=140) (actual
time=10.833..10.833 rows=3 loops=1)
 ->  Bitmap Heap Scan on clinicaldocuments
(cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258
rows=3 loops=1)
   Recheck Cond: (patientidentifier = 690193)
   ->  Bitmap Index Scan on ix_cdocpid  (cost=0.00..301.13
rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1)
 Index Cond: (patientidentifier = 690193)
 Total runtime: 91166.540 ms


Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 10, 2006 5:52 PM
To: Tim Jones
Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan 

"Tim Jones" <[EMAIL PROTECTED]> writes:
> QUERY PLAN
> 'Hash Join  (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ...
> 'Total runtime: 0.392 ms'

Hardly seems like evidence of a performance problem ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Tom Lane
"Tim Jones" <[EMAIL PROTECTED]> writes:
> QUERY PLAN
> 'Hash Join  (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ...
> 'Total runtime: 0.392 ms'

Hardly seems like evidence of a performance problem ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Scott Marlowe
On Fri, 2006-02-10 at 16:43, Tim Jones wrote:
> oops
> 
> QUERY PLAN
> 'Hash Join  (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> '  Hash Cond: ("outer".documentidentifier =
> "inner".dssdocumentidentifier)'
> '  ->  Seq Scan on documentversions  (cost=0.00..2997.68 rows=96368
> width=996) (actual time=0.007..0.007 rows=1 loops=1)'
> '  ->  Hash  (cost=898.62..898.62 rows=482 width=354) (actual
> time=0.161..0.161 rows=0 loops=1)'
> '->  Bitmap Heap Scan on clinicaldocuments  (cost=4.69..898.62
> rows=482 width=354) (actual time=0.159..0.159 rows=0 loops=1)'
> '  Recheck Cond: (patientidentifier = 123)'
> '  ->  Bitmap Index Scan on ix_cdocpid  (cost=0.00..4.69
> rows=482 width=0) (actual time=0.153..0.153 rows=0 loops=1)'
> 'Index Cond: (patientidentifier = 123)'
> 'Total runtime: 0.392 ms'
> 
> note I have done these on a smaller db than what I am using but the
> plans are the same 


H.  We really need to see what's happening on the real database to
see what's going wrong.  i.e. if the real database thinks it'll get 30
rows and it gets back 5,000,000 that's a problem.

The query planner in pgsql is cost based, so until you have real data
underneath it, and analyze it, you can't really say how it will behave
for you.  I.e. small test sets don't work.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Tom Lane
"Tim Jones" <[EMAIL PROTECTED]> writes:
> QUERY PLAN
> 'Hash Join  (cost=899.83..4384.17 rows=482 width=1350)'
> '  Hash Cond: ("outer".documentidentifier =
> "inner".dssdocumentidentifier)'

This is not EXPLAIN ANALYZE output.  Also, the rowcount estimates
seem far enough off in the other query to make me wonder how long
it's been since you ANALYZEd the tables...

More generally, though, I don't see anything particularly wrong
with this query plan.  You're selecting enough of the table that
an indexscan isn't necessarily a good plan.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Tim Jones
oops

QUERY PLAN
'Hash Join  (cost=899.83..4384.17 rows=482 width=1350) (actual
time=0.203..0.203 rows=0 loops=1)'
'  Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)'
'  ->  Seq Scan on documentversions  (cost=0.00..2997.68 rows=96368
width=996) (actual time=0.007..0.007 rows=1 loops=1)'
'  ->  Hash  (cost=898.62..898.62 rows=482 width=354) (actual
time=0.161..0.161 rows=0 loops=1)'
'->  Bitmap Heap Scan on clinicaldocuments  (cost=4.69..898.62
rows=482 width=354) (actual time=0.159..0.159 rows=0 loops=1)'
'  Recheck Cond: (patientidentifier = 123)'
'  ->  Bitmap Index Scan on ix_cdocpid  (cost=0.00..4.69
rows=482 width=0) (actual time=0.153..0.153 rows=0 loops=1)'
'Index Cond: (patientidentifier = 123)'
'Total runtime: 0.392 ms'

note I have done these on a smaller db than what I am using but the
plans are the same 


Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555

-Original Message-
From: Scott Marlowe [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 10, 2006 5:39 PM
To: Tim Jones
Cc: Dave Dutcher; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan

On Fri, 2006-02-10 at 16:37, Tim Jones wrote:
> for first query
> 
> QUERY PLAN
> 'Limit  (cost=4.69..88.47 rows=10 width=1350) (actual
> time=32.195..32.338 rows=10 loops=1)'
> '  ->  Nested Loop  (cost=4.69..4043.09 rows=482 width=1350) (actual
> time=32.190..32.316 rows=10 loops=1)'
> '->  Bitmap Heap Scan on documentversions  (cost=4.69..1139.40
> rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)'
> '  Recheck Cond: (documentstatus = ''AC''::bpchar)'
> '  ->  Bitmap Index Scan on ix_docstatus  (cost=0.00..4.69
> rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)'
> 'Index Cond: (documentstatus = ''AC''::bpchar)'
> '->  Index Scan using ix_cdocdid on clinicaldocuments
> (cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1 
> loops=10)'
> '  Index Cond: ("outer".documentidentifier =
> clinicaldocuments.dssdocumentidentifier)'
>  
> 
>  for second query
> 
> QUERY PLAN
> 'Hash Join  (cost=899.83..4384.17 rows=482 width=1350)'
> '  Hash Cond: ("outer".documentidentifier = 
> "inner".dssdocumentidentifier)'
> '  ->  Seq Scan on documentversions  (cost=0.00..2997.68 rows=96368 
> width=996)'
> '  ->  Hash  (cost=898.62..898.62 rows=482 width=354)'
> '->  Bitmap Heap Scan on clinicaldocuments  (cost=4.69..898.62
> rows=482 width=354)'
> '  Recheck Cond: (patientidentifier = 123)'
> '  ->  Bitmap Index Scan on ix_cdocpid  (cost=0.00..4.69
> rows=482 width=0)'
> 'Index Cond: (patientidentifier = 123)'

OK, the first one is explain analyze, but the second one is just plain
explain...

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Large Database Design Help

2006-02-10 Thread Scott Marlowe
On Fri, 2006-02-10 at 16:39, Ragnar wrote:
> On Fri, 2006-02-10 at 11:24 +0100, Markus Schaber wrote:
> 
> > For lots non-read-only database workloads, RAID5 is a performance
> > killer. Raid 1/0 might be better, or having two mirrors of two disks
> > each, the first mirror holding system, swap, and the PostgreSQL WAL
> > files, the second one holding the data.
> 
> I was under the impression that it is preferable to keep the WAL on 
> its own spindles with no other activity there, to take full advantage
> of the sequential nature of the WAL writes.
> 
> That would mean one mirror for the WAL, and one for the rest.
> This, of course, may sometimes be too much wasted disk space, as the WAL
> typically will not use a whole disk, so you might partition this mirror
> into a small ext2 filesystem for WAL, and use the rest for files seldom 
> accessed, such as backups. 

Well, on most database servers, the actual access to the OS and swap
drives should drop to about zero over time, so this is a workable
solution if you've only got enough drives / drive slots for two mirrors.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Scott Marlowe
On Fri, 2006-02-10 at 16:37, Tim Jones wrote:
> for first query
> 
> QUERY PLAN
> 'Limit  (cost=4.69..88.47 rows=10 width=1350) (actual
> time=32.195..32.338 rows=10 loops=1)'
> '  ->  Nested Loop  (cost=4.69..4043.09 rows=482 width=1350) (actual
> time=32.190..32.316 rows=10 loops=1)'
> '->  Bitmap Heap Scan on documentversions  (cost=4.69..1139.40
> rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)'
> '  Recheck Cond: (documentstatus = ''AC''::bpchar)'
> '  ->  Bitmap Index Scan on ix_docstatus  (cost=0.00..4.69
> rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)'
> 'Index Cond: (documentstatus = ''AC''::bpchar)'
> '->  Index Scan using ix_cdocdid on clinicaldocuments
> (cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1
> loops=10)'
> '  Index Cond: ("outer".documentidentifier =
> clinicaldocuments.dssdocumentidentifier)'
>  
> 
>  for second query
> 
> QUERY PLAN
> 'Hash Join  (cost=899.83..4384.17 rows=482 width=1350)'
> '  Hash Cond: ("outer".documentidentifier =
> "inner".dssdocumentidentifier)'
> '  ->  Seq Scan on documentversions  (cost=0.00..2997.68 rows=96368
> width=996)'
> '  ->  Hash  (cost=898.62..898.62 rows=482 width=354)'
> '->  Bitmap Heap Scan on clinicaldocuments  (cost=4.69..898.62
> rows=482 width=354)'
> '  Recheck Cond: (patientidentifier = 123)'
> '  ->  Bitmap Index Scan on ix_cdocpid  (cost=0.00..4.69
> rows=482 width=0)'
> 'Index Cond: (patientidentifier = 123)'

OK, the first one is explain analyze, but the second one is just plain
explain...

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Large Database Design Help

2006-02-10 Thread Ragnar
On Fri, 2006-02-10 at 11:24 +0100, Markus Schaber wrote:

> For lots non-read-only database workloads, RAID5 is a performance
> killer. Raid 1/0 might be better, or having two mirrors of two disks
> each, the first mirror holding system, swap, and the PostgreSQL WAL
> files, the second one holding the data.

I was under the impression that it is preferable to keep the WAL on 
its own spindles with no other activity there, to take full advantage
of the sequential nature of the WAL writes.

That would mean one mirror for the WAL, and one for the rest.
This, of course, may sometimes be too much wasted disk space, as the WAL
typically will not use a whole disk, so you might partition this mirror
into a small ext2 filesystem for WAL, and use the rest for files seldom 
accessed, such as backups. 

gnari



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Tim Jones
for first query

QUERY PLAN
'Limit  (cost=4.69..88.47 rows=10 width=1350) (actual
time=32.195..32.338 rows=10 loops=1)'
'  ->  Nested Loop  (cost=4.69..4043.09 rows=482 width=1350) (actual
time=32.190..32.316 rows=10 loops=1)'
'->  Bitmap Heap Scan on documentversions  (cost=4.69..1139.40
rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)'
'  Recheck Cond: (documentstatus = ''AC''::bpchar)'
'  ->  Bitmap Index Scan on ix_docstatus  (cost=0.00..4.69
rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)'
'Index Cond: (documentstatus = ''AC''::bpchar)'
'->  Index Scan using ix_cdocdid on clinicaldocuments
(cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1
loops=10)'
'  Index Cond: ("outer".documentidentifier =
clinicaldocuments.dssdocumentidentifier)'
 
 
 for second query

QUERY PLAN
'Hash Join  (cost=899.83..4384.17 rows=482 width=1350)'
'  Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)'
'  ->  Seq Scan on documentversions  (cost=0.00..2997.68 rows=96368
width=996)'
'  ->  Hash  (cost=898.62..898.62 rows=482 width=354)'
'->  Bitmap Heap Scan on clinicaldocuments  (cost=4.69..898.62
rows=482 width=354)'
'  Recheck Cond: (patientidentifier = 123)'
'  ->  Bitmap Index Scan on ix_cdocpid  (cost=0.00..4.69
rows=482 width=0)'
'Index Cond: (patientidentifier = 123)'


thnx

Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
 



From: Dave Dutcher [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 10, 2006 5:15 PM
To: Tim Jones; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] joining two tables slow due to sequential scan



What version of postgres are you using?  Can you post the output from
EXPLAIN ANALYZE?

 

 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006 4:07 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] joining two tables slow due to sequential scan

 

 

I am trying to join two tables and keep getting a sequential scan in the
plan even though there is an index on the columns I am joining on.
Basically this the deal  ... I have two tables with docid in them which
is what I am using for the join.  

 

ClinicalDocs ... (no primary key) though it does not help if I make
docid primary key

docid integer (index)

patientid integer (index)

visitid integer (index)

 ...

 

Documentversions

docid integer (index)

docversionnumber (index)

docversionidentifier (primary key)

 

It seems to do an index scan if I put the primary key as docid.  This is
what occurs when I link on the patid from ClinicalDocs to patient table.
However I can not make the docid primary key because it gets repeated
depending on how may versions of a document I have.  I have tried using
a foreign key on documentversions with no sucess. 

 

In addition this query

 

select * from documentversions join clinicaldocuments on
documentversions.documentidentifier
= clinicaldocuments.dssdocumentidentifier where
documentversions.documentstatus = 'AC'; 

 

does index scan 

but if I change the order e.g

 

select * from clinicaldocuments join documentversions on
clinicaldocuments.dssdocumentidentifier
= documentversions .documentidentifier where
clinicaldocuments.patientidentifier= 123;

 

does sequential scan what I need is bottom query

it is extremely slow ... Any ideas ?

 

Tim Jones

Healthcare Project Manager

Optio Software, Inc.

(770) 576-3555

 


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Scott Marlowe
On Fri, 2006-02-10 at 16:35, Tim Jones wrote:
> OK.  I'm gonna make a couple of guesses here:
> 
> 1:  clinicaldocuments.patientidentifier is an int8 and you're running
> 7.4 or before.
> 
> -- nope int4  and 8.1
> 
> 2: There are more rows with clinicaldocuments.patientidentifier= 123
> than with documentversions.documentstatus = 'AC'.
> 
> -- nope generally speaking all statuses are 'AC'
> 
> 3: documentversions.documentidentifier and
> clinicaldocuments.dssdocumentidentifier are not the same type.
> 
> -- nope both int4

OK then, I guess we'll need to see the explain analyze output of both of
those queries.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Tim Jones


OK.  I'm gonna make a couple of guesses here:

1:  clinicaldocuments.patientidentifier is an int8 and you're running
7.4 or before.

-- nope int4  and 8.1

2: There are more rows with clinicaldocuments.patientidentifier= 123
than with documentversions.documentstatus = 'AC'.

-- nope generally speaking all statuses are 'AC'

3: documentversions.documentidentifier and
clinicaldocuments.dssdocumentidentifier are not the same type.

-- nope both int4

Any of those things true?

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Scott Marlowe
On Fri, 2006-02-10 at 16:06, Tim Jones wrote:
>  
> I am trying to join two tables and keep getting a sequential scan in
> the plan even though there is an index on the columns I am joining
> on.  Basically this the deal  ... I have two tables with docid in them
> which is what I am using for the join.  
>  

SNIP
 
> select * from documentversions join clinicaldocuments on
> documentversions.documentidentifier
> = clinicaldocuments.dssdocumentidentifier where
> documentversions.documentstatus = 'AC'; 
>  
> does index scan 
> but if I change the order e.g
>  
> select * from clinicaldocuments join documentversions on
> clinicaldocuments.dssdocumentidentifier
> = documentversions .documentidentifier where
> clinicaldocuments.patientidentifier= 123;

OK.  I'm gonna make a couple of guesses here:

1:  clinicaldocuments.patientidentifier is an int8 and you're running
7.4 or before.
2: There are more rows with clinicaldocuments.patientidentifier= 123
than with documentversions.documentstatus = 'AC'.
3: documentversions.documentidentifier and 
clinicaldocuments.dssdocumentidentifier are not the same type.

Any of those things true?

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Dave Dutcher









What version of postgres
are you using?  Can you post the
output from EXPLAIN ANALYZE?

 

 

-Original Message-
From:
[EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006
4:07 PM
To:
pgsql-performance@postgresql.org
Subject: [PERFORM] joining two
tables slow due to sequential scan

 



 





I am trying to join two tables and
keep getting a sequential scan in the plan even though there is an index on the
columns I am joining on.  Basically this the deal  ... I have two
tables with docid in them which is what I am using for the join.  





 





ClinicalDocs ... (no primary key)
though it does not help if I make docid primary key





docid integer (index)





patientid integer (index)





visitid integer (index)





 ...





 





Documentversions





docid integer (index)





docversionnumber (index)





docversionidentifier (primary key)





 





It seems to do an index scan if I
put the primary key as docid.  This is what occurs when I link on the
patid from ClinicalDocs to patient table.  However I can not make the
docid primary key because it gets repeated depending on how may versions of a
document I have.  I have tried using a foreign key on documentversions
with no sucess. 





 





In addition this query





 





select * from documentversions join
clinicaldocuments on documentversions.documentidentifier
= clinicaldocuments.dssdocumentidentifier where documentversions.documentstatus
= 'AC'; 





 





does index scan 





but if I change the order e.g





 





select * from clinicaldocuments join
documentversions on clinicaldocuments.dssdocumentidentifier
= documentversions .documentidentifier where
clinicaldocuments.patientidentifier= 123;





 





does sequential scan what I
need is bottom query





it is extremely slow ... Any ideas ?





 



Tim Jones

Healthcare Project Manager

Optio Software, Inc.

(770) 576-3555



 










[PERFORM] joining two tables slow due to sequential scan

2006-02-10 Thread Tim Jones



 
I am trying to join 
two tables and keep getting a sequential scan in the plan even though there is 
an index on the columns I am joining on.  Basically this the deal  ... 
I have two tables with docid in them which is what I am using for the 
join.  
 
ClinicalDocs ... (no 
primary key) though it does not help if I make docid primary 
key
docid integer 
(index)
patientid integer 
(index)
visitid integer 
(index)
 ...
 
Documentversions
docid integer 
(index)
docversionnumber 
(index)
docversionidentifier 
(primary key)
 
It seems to do an 
index scan if I put the primary key as docid.  This is what occurs when I 
link on the patid from ClinicalDocs to patient table.  However I can not 
make the docid primary key because it gets repeated depending on how may 
versions of a document I have.  I have tried using a foreign key on 
documentversions with no sucess. 
 
In addition this 
query
 
select * from 
documentversions join clinicaldocuments on 
documentversions.documentidentifier= clinicaldocuments.dssdocumentidentifier 
where documentversions.documentstatus = 'AC'; 
 
does index scan 

but if I change the 
order e.g
 
select * from clinicaldocuments 
join documentversions on clinicaldocuments.dssdocumentidentifier= 
documentversions .documentidentifier where clinicaldocuments.patientidentifier= 
123;
 
does sequential scan what I need is bottom 
query
it is extremely slow ... Any ideas 
?
 
Tim Jones
Healthcare Project Manager
Optio Software, 
Inc.
(770) 576-3555
 


Re: [PERFORM] help required in design of database

2006-02-10 Thread Steinar H. Gunderson
On Fri, Feb 10, 2006 at 12:20:34PM -0800, david drummard wrote:
> 1) create a new table every time a new feed file comes in. Create table with
> indexes. Use the copy command to dump the data into the table.
> 2) rename the current table to some old table name and rename the new table
> to current table name so that applications can access them directly.

That sounds like a working plan.

> Should i create indexes before or after import ? I need to do this in
> shortest period of time so that the data is always uptodate. Note that
> incremental updates are not possible since almost every row will be changed
> in the new file.

You should create indexes after the import. Remember to pump up your memory
settings (maintenance_work_mem) if you want this to be quick.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] help required in design of database

2006-02-10 Thread david drummard
Hi,

I have an unique requirement. I have a feed of 2.5 - 3 million rows of
data which arrives every 1/2 an hour. Each row has 2 small string
values  (about 50 chars each) and 10 int values. I need
searcheability and running arbitrary queries on any of these values.
This means i have to create an index on every column. The feed comes in
as a text file comma separated. Here is what i am planning to do

1) create a new table every time a new feed file comes in. Create table
with indexes. Use the copy command to dump the data into the table. 
2) rename the current table to some old table name and rename the new
table to current table name so that applications can access them
directly. 

Note that these are read only tables and it is fine if the step 2 takes
a small amount of time (it is not a mission critical table hence, a
small downtime of some secs is fine).

My question is what is the best way to do step (1) so that after the
copy is done, the table is fully indexed  and properly balanced
and optimized for query.
Should i create indexes before or after import ? I need to do this in
shortest period of time so that the data is always uptodate. Note that
incremental updates are not possible since almost every row will be
changed in the new file.

my table creation script looks like this

create table datatablenew(fe varchar(40), va varchar(60), a int, b int,
c int, d int, e int, f int, g int, h int, i int, j int, k int, l int, m
int, n int, o int, p int, q real);
create index fe_idx on datatablenew using hash (fe);
create index va_idx on datatablenew using hash(va);
create index a_idx on datatablenew (a);
..
create index q_idx on datatablenew(q);


please advice.

thanks
vijay


Re: [PERFORM] Large Database Design Help

2006-02-10 Thread PFC



was origionally designed for Postgres 7.0 on a PIII 500Mhz and some


Argh.


1) The database is very large, the largest table has 40 million tuples.


Is this simple types (like a few ints, text...) ?
How much space does it use on disk ? can it fit in RAM ?

2) The database needs to import 10's of thousands of tuples each night  
quickly.  The current method is VERY slow.


	You bet, COMMIT'ing after each insert or update is about the worst that  
can be done. It works fine on MySQL/MyISAM (which doesn't know about  
commit...) so I'd guess the system designer had a previous experience with  
MySQL.


My advice woule be :

- get a decent machine with some RAM (I guess you already knew this)...

Now, the update.

I would tend to do this :

- Generate a text file with your update data, using whatever tool you like  
best (perl, php, python, java...)

- CREATE TEMPORARY TABLE blah ...
- COPY blah FROM your update file.

	COPY is super fast. I think temporary tables don't write to the xlog, so  
they are also very fast. This should not take more than a few seconds for  
a few 10 K's of simple rows on modern hardware. It actually takes a  
fraction of a second on my PC for about 9K rows with 5 INTEGERs on them.


	You can also add constraints on your temporary table, to sanitize your  
data, in order to be reasonably sure that the following updates will work.


	The data you feed to copy should be correct, or it will rollback. This is  
your script's job to escape everything.


Now you got your data in the database. You have several options :

	- You are confident that the UPDATE will work without being rolled back  
by some constraint violation. Therefore, you issue a big joined UPDATE to  
update all the rows in your main table which are also in your temp table.  
Then you issue an INSERT INTO ... SELECT ... to insert the ones which were  
not already in the big table.


	Joined updates can be slow if your RAM is too small and it has to thrash  
the disk looking for every tuple around.
	You can cheat and CLUSTER your main table (say, once a week), so it is  
all in index order. Then you arrange your update data so it is in the same  
order (for instance, you SELECT INTO another temp table, with an ORDER BY  
corresponding to the CLUSTER on the main table). Having both in the same  
order will help reducing random disk accesses.


	- If you don't like this method, then you might want to use the same  
strategy as before (ie. a zillion queries), but write it in PSQL instead.  
PSQL is a lot faster, because everything is already parsed and planned  
beforehand. So you could do the following :


- for each row in the temporary update table :
- UPDATE the corresponding row in the main table
- IF FOUND, then cool, it was updated, nothing more to do.
 You don't need to SELECT in order to know if the row is there.
 UPDATE does it for you, without the race condition.
- IF NOT FOUND, then insert.
This has a race condition.
You know your application, so you'll know if it matters or not.

What do you think ?

3) I can't import new records with a COPY or drop my indexes b/c some of  
them are new records (INSERTS) and some are altered records (UPDATES)  
and the only way I can think of to identify these records is to perform  
a select for each record.


	Yes and no ; if you must do this, then use PSQL, it's a lot faster. And  
skip the SELECT.

Also, use the latest version. It really rocks.
	Like many said on the list, put pg_xlog on its own physical disk, with  
ext2fs.


3) Wrap each load into a transaction ( tens of thousands of records per  
load )


	That's the idea. The first strategy here (big update) uses one  
transaction anyway. For the other one, your choice. You can either do it  
all in 1 transaction, or in bunches of 1000 rows... but 1 row at a time  
would be horrendously slow.


Regards,

P.F.C

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Basic Database Performance

2006-02-10 Thread PFC

We are running a prototype of a system running on
PHP/Postgresql on an Intel Xeon 2ghz server, 1GB RAM, 40GB hard drive,


I think this is a decent server...

Now, I guess you are using Apache and PHP like everyone.

Know these facts :

	- A client connection means an apache process (think HTTP 1.1  
Keep-Alives...)
	- The PHP interpreter in mod_php will be active during all the time it  
takes to receive the request, parse it, generate the dynamic page, and  
send it to the client to the last byte (because it is sent streaming). So,  
a php page that might take 10 ms to generate will actually hog an  
interpreter for between 200 ms and 1 second, depending on client ping time  
and other network latency figures.
	- This is actually on-topic for this list, because it will also hog a  
postgres connection and server process during all that time. Thus, it will  
most probably be slow and unscalable.


The solutions I use are simple :

	First, use lighttpd instead of apache. Not only is it simpler to use and  
configure, it uses a lot less RAM and resources, is faster, lighter, etc.  
It uses an asynchronous model. It's there on my server, a crap Celeron,  
pushing about 100 hits/s, and it sits at 4% CPU and 18 megabytes of RAM in  
the top. It's impossible to overload this thing unless you benchmark it on  
gigabit lan, with 100 bytes files.


	Then, plug php in, using the fast-cgi protocol. Basically php spawns a  
process pool, and you chose the size of this pool. Say you spawn 20 PHP  
interpreters for instance.


	When a PHP page is requested, lighttpd asks the process pool to generate  
it. Then, a PHP interpreter from the pool does the job, and hands the page  
over to lighttpd. This is very fast. lighttpd handles the slow  
transmission of the data to the client, while the PHP interpreter goes  
back to the pool to service another request.


	This gives you database connection pooling for free, actually. The  
connections are limited to the number of processes in the pool, so you  
won't get hundreds of them all over the place. You can use php's  
persistent connections without worries. You don't need to configure a  
connection pool. It just works (TM).


	Also you might want to use eaccelerator on your PHP. It precompiles your  
PHP pages, so you don't lose time on parsing. Page time on my site went  
from 50-200 ms to 5-20 ms just by installing this. It's free.


Try this and you might realize that after all, postgres was fast enough 
!




---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] What do the Windows pg hackers out there like for dev tools?

2006-02-10 Thread Ron
Subject line says it all.  I'm going to be testing changes under both 
Linux and WinXP, so I'm hoping those of you that do M$ hacking will 
pass along your list of suggestions and/or favorite (and hated so I 
know what to avoid) tools.


TiA,
Ron



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] 10+hrs vs 15min because of just one index

2006-02-10 Thread Aaron Turner
On 2/10/06, Matthew T. O'Connor  wrote:
> Aaron Turner wrote:
> > So I'm trying to figure out how to optimize my PG install (8.0.3) to
> > get better performance without dropping one of my indexes.
>
> What about something like this:
>
> begin;
> drop slow_index_name;
> update;
> create index slow_index_name;
> commit;
> vacuum;

Right.  That's exactly what I'm doing to get the update to occur in 15
minutes.  Unfortunately though, I'm basically at the point of every
time I insert/update into that table I have to drop the index which is
making my life very painful (having to de-dupe records in RAM in my
application is a lot faster but also more complicated/error prone).

Basically, I need some way to optimize PG so that I don't have to drop
that index every time.

Suggestions?

--
Aaron Turner
http://synfin.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 10+hrs vs 15min because of just one index

2006-02-10 Thread Matthew T. O'Connor

Aaron Turner wrote:

So I'm trying to figure out how to optimize my PG install (8.0.3) to
get better performance without dropping one of my indexes.


What about something like this:

begin;
drop slow_index_name;
update;
create index slow_index_name;
commit;
vacuum;

Matt

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 10+hrs vs 15min because of just one index

2006-02-10 Thread Aaron Turner
On 2/10/06, hubert depesz lubaczewski <[EMAIL PROTECTED]> wrote:
> On 2/10/06, Aaron Turner <[EMAIL PROTECTED]> wrote:
> > So I'm trying to figure out how to optimize my PG install (8.0.3) to
> > get better performance without dropping one of my indexes.
> > Basically, I have a table of 5M records with 3 columns:
> > pri_key (SERIAL)
> > data char(48)
> > groupid integer
> > there is an additional unique index on the data column.
> > The problem is that when I update the groupid column for all the
> > records, the query takes over 10hrs (after that I just canceled the
> > update).  Looking at iostat, top, vmstat shows I'm horribly disk IO
> > bound (for data not WAL, CPU 85-90% iowait) and not swapping.
> > Dropping the unique index on data (which isn't used in the query),
>
> for such a large update i would suggest to go with different scenario:
> split update into packets (1, or 5 rows at the time)
> and do:
> update packet
> vacuum table
> for all packets. and then reindex the table. should work much nicer.

The problem is that all 5M records are being updated by a single
UPDATE statement, not 5M individual statements.   Also, vacuum can't
run inside of a transaction.

On a side note, is there any performance information on updating
indexes (via insert/update) over the size of the column?  Obviously,
char(48) is larger then most for indexing purposes, but I wonder if
performance drops linerally or exponentially as the column width
increases.  Right now my column is hexidecimal... if I stored it as a
binary representation it would be smaller.

Thanks,
Aaron

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] pgbench output

2006-02-10 Thread Gourish Singbal
 
Hi All,
 
Here are some of the results i got after performing pgbench marking between postgresql 7.4.5 and postgresql 8.1.2. having parameters with same values in the postgresql.conf file.
 
[EMAIL PROTECTED]:/newdisk/postgres/data> /usr/local/pgsql7.4.5/bin/pgbench -c 10 -t 1 regressionstarting vacuum...end.transaction type: TPC-B (sort of)
scaling factor: 10number of clients: 10number of transactions per client: 1number of transactions actually processed: 10/10tps = 80.642615 (including connections establishing)tps = 80.650638
 (excluding connections establishing)

[EMAIL PROTECTED]:/newdisk/postgres/data> /usr/local/pgsql/bin/pgbench -c 10 -t 1 regressionstarting vacuum...end.transaction type: TPC-B (sort of)
scaling factor: 10number of clients: 10number of transactions per client: 1number of transactions actually processed: 10/10tps = 124.134926 (including connections establishing)tps = 124.148749
 (excluding connections establishing)
Conclusion : So please correct me if i am wrong ... this result set shows that the postgresql version  8.1.2 has perform better than 7.4.5 in the bench marking process since 8.1.2 was able to complete more transcations per second successfully . 

On 2/7/06, Jim C. Nasby <[EMAIL PROTECTED]> wrote:

Well, it tells you how many transactions per second it was able to do.Do you have specific questions?
On Thu, Feb 02, 2006 at 12:39:59PM +0530, Pradeep Parmar wrote:> Hi,>> I'm fairly new to PostgreSQL. I was trying pgbench , but could not> understand the output . Can anyone help me out to understand the output of
> pgbench>>> Pradeep--Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]Pervasive Software  
http://pervasive.comwork: 512-231-6117vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster-- Best,Gourish Singbal 


Re: [PERFORM] Basic Database Performance

2006-02-10 Thread Markus Schaber
Hi, James,

James Dey wrote:

> Apologies if this is a novice queston, but I think it is a performance
> one nevertheless. We are running a prototype of a system running on
> PHP/Postgresql on an Intel Xeon 2ghz server, 1GB RAM, 40GB hard drive,
> as a test bench. The system will be used for tens of thousands of users,
> and at the moment we are testing on a base of around 400 users
> concurrently during the day.

The first thing that comes into my mind here is "connection pooling /
recycling".

Try to make shure that connections are reused between http requests.
Reopening the connection on every http request will break your system,
as the backend startup time is rather high.

> During the day, the system is incredibly slow to a point where it is
> unusable. The reason we are testing on such as small server is to test
> performance under pressure, and my estimation is that spec should handle
> thousands of users.

Note that amount of data, concurrent users, hardware and speed don't
always scale linearly.

> The server spikes from 5% usage to 95% up and down. The system is a very
> simple e-learning and management system and has not given us any issues
> to date, only since we’ve been testing with more users has it done so.
> The fact that 400 users doing inserts and queries every few minutes is
> very concerning, I would like to know if I could be tweaking some config
> settings.

You should make shure that you run vacuum / analyze regularly (either
autovacuum, or vacuum full at night when you have no users on the system).

Use statement logging or other profiling means to isolate the slow
queries, and EXPLAIN ANALYZE them to see what goes wrong. Create the
needed indices, and drop unneded one. (insert usual performance tuning
tips here...)

> We are running PG 7.4 on a Debian Sarge server, and will be upgrading to
> pg8.0 on a new server, but have some migration issues (that’s for
> another list!)

Ignore 8.0 and go to 8.1 directly.


HTH,
Markus

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Large Database Design Help

2006-02-10 Thread Markus Schaber
Hi, Henry,

Orion Henry wrote:

> 1) The database is very large, the largest table has 40 million tuples.

I'm afraid this doesn't qualify as '_very_ large' yet, but it
definitively is large enough to have some deep thoughts about it. :-)

> 1) The data is easily partitionable by client ID.  In an attempt to keep
> the indexes small and the inserts fast one table was made per client
> ID.  Thus the primary table in the database (the one with 40 million
> tuples) is really 133 tables each ending with a three digit suffix.  
> The largest of these client tables has 8 million of the 40 million
> tuples.   The system started with around a half dozen clients and is now
> a huge pain to manage with so many tables.  I was hoping new hardware
> and new postgres features would allow for this data to be merged safely
> into a single table.

It possibly is a good idea to merge them.

If you decide to keep them separated for whatever reason, you might want
to use schemas instead of three digit suffixes. Together with
appropriate named users or 'set search_path', this may help you to
simplify your software.

In case you want to keep separate tables, but need some reports touching
all tables from time to time, table inheritance may help you. Just
create a base table, and then inherit all user specific tables from that
base table. Of course, this can be combined with the schema approach by
having the child tables in their appropriate schemas.

> 2) The imports are not done inside of transactions.  I'm assuming the
> system designers excluded this for a reason.  Will I run into problems
> performing tens of thousands of inserts and updates inside a single
> transaction?

Yes, it should give you a huge boost. Every commit has to flush the WAL
out to disk, which takes at least one disk spin. So on a simple 7200 RPM
disk, you cannot have more than 120 transactions/second.

It may make sense to split such a bulk load into transactions of some
tens of thousands of rows, but that depends on how easy it is for your
application to resume in the middle of the bulk if the connection
aborts, and how much concurrent access you have on the backend.

> 3) The current code that bulk loads data into the database is a loop
> that looks like this:
> 
>$result = exe("INSERT INTO $table ($name_str) SELECT
> $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)");
>if ($result == 0)
>{
>$result = exe("UPDATE $table SET $non_keys WHERE
> $keys");
>} 
> Is there a faster way to bulk load data when it's not known ahead of
> time if it's a new record or an updated record?

Perhaps the easiest way might be to issue the update first. Update
returns a row count of the updated rows. If it is 0, you have to insert
the row.

This can even be encapsulated into a "before insert" trigger on the
table, which tries the update and ignores the insert if the update
succeeded. This way, you can even use COPY on the client side.

We're using this approach for one of our databases, where a client side
crash can result in occasional duplicates being COPYed to the table.

> Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM

For lots non-read-only database workloads, RAID5 is a performance
killer. Raid 1/0 might be better, or having two mirrors of two disks
each, the first mirror holding system, swap, and the PostgreSQL WAL
files, the second one holding the data. Don't forget to tune the
postgresql settings appropriately. :-)

> # du -sh /var/lib/postgres/data/
> 16G /var/lib/postgres/data/ 

Your database seems to be small enough to fit on a single disk, so the
two mirrors approach I described above will be the best IMHO.

> ( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and
> when I do this refactoring )

This is a very good idea, 8.1 is miles ahead of 7.4 in many aspects.

> ( the current OS is Debian Unstable but I intend to be running RHEL 4.0
> if and when I do this refactoring )

This should not make too much difference.

HTH,
Markus
-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Large Database Design Help

2006-02-10 Thread Markus Schaber
Hi, Greg,

Greg Stark wrote:

>>(Aside question: if I were to find a way to use COPY and I were loading
>>data on a single client_id, would dropping just the indexes for that client_id
>>accelerate the load?)
> Dropping indexes would accelerate the load but unless you're loading a large
> number of records relative the current size I'm not sure it would be a win
> since you would then have to rebuild the index for the entire segment.

And, additionally, rebuilding a partial index with "WHERE client_id=42"
needs a full table scan, which is very slow, so temporarily dropping the
indices will not be useful if you merge the tables.

Btw, I don't know whether PostgreSQL can make use of partial indices
when building other partial indices. If yes, you could temporarily drop
all but one of the partial indices for a specific client.


HTH,
Markus

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Basic Database Performance

2006-02-10 Thread James Dey
Sorry about that

James Dey

tel +27 11 704-1945
cell+27 82 785-5102
fax +27 11 388-8907
mail[EMAIL PROTECTED]


-Original Message-
From: Richard Huxton [mailto:[EMAIL PROTECTED] 
Sent: 10 February 2006 11:50 AM
To: James Dey
Cc: 'Postgresql Performance'
Subject: Re: [PERFORM] Basic Database Performance


Don't forget to cc: the list.

James Dey wrote:
> Hi Richard,
> 
> Firstly, thanks a million for the reply.
> 
> To answer your questions:
> 1. Are you limited by CPU, memory or disk i/o?
> I am not limited, but would like to get the most out of the config I have
in
> order to be able to know what I'll get, when I scale up.

But you said: "During the day, the system is incredibly slow to a point 
where it is unusable". So presumably one or more of cpu,memory or disk 
i/o is the problem.

> 2. Are you happy your config settings are good?   How do you know?
> I'm not, and would appreciate any help with these.

If you have a look here, there is an introduction for 7.4
  http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
For 8.x you might find the following more useful.
  http://www.powerpostgresql.com/PerfList

> 3. Are there particular queries that are causing the problem, or lock 
> contention?
> Not that I can see

What is the balance between activity on Apache/PHP/PostgreSQL?

-- 
   Richard Huxton
   Archonet Ltd


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Basic Database Performance

2006-02-10 Thread Richard Huxton


Don't forget to cc: the list.

James Dey wrote:

Hi Richard,

Firstly, thanks a million for the reply.

To answer your questions:
1. Are you limited by CPU, memory or disk i/o?
I am not limited, but would like to get the most out of the config I have in
order to be able to know what I'll get, when I scale up.


But you said: "During the day, the system is incredibly slow to a point 
where it is unusable". So presumably one or more of cpu,memory or disk 
i/o is the problem.



2. Are you happy your config settings are good?   How do you know?
I'm not, and would appreciate any help with these.


If you have a look here, there is an introduction for 7.4
 http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
For 8.x you might find the following more useful.
 http://www.powerpostgresql.com/PerfList

3. Are there particular queries that are causing the problem, or lock 
contention?

Not that I can see


What is the balance between activity on Apache/PHP/PostgreSQL?

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Basic Database Performance

2006-02-10 Thread Richard Huxton

James Dey wrote:


Apologies if this is a novice queston, but I think it is a performance one
nevertheless. We are running a prototype of a system running on
PHP/Postgresql on an Intel Xeon 2ghz server, 1GB RAM, 40GB hard drive, as a
test bench. The system will be used for tens of thousands of users, and at
the moment we are testing on a base of around 400 users concurrently during
the day.


OK, that's 400 web-users, so presumably a fraction of that for 
concurrent database connections.



During the day, the system is incredibly slow to a point where it is
unusable. The reason we are testing on such as small server is to test
performance under pressure, and my estimation is that spec should handle
thousands of users.


It'll depend on what the users are doing
It'll depend on what your code is doing
It'll depend on how you've configured PostgreSQL.


The server spikes from 5% usage to 95% up and down.


Usage? Do you mean CPU?

> The system is a very

simple e-learning and management system and has not given us any issues to
date, only since we've been testing with more users has it done so. The fact
that 400 users doing inserts and queries every few minutes is very
concerning, I would like to know if I could be tweaking some config
settings.


You haven't said what config settings you're working with.

OK - the main questions have to be:
1. Are you limited by CPU, memory or disk i/o?
2. Are you happy your config settings are good?
   How do you know?
3. Are there particular queries that are causing the problem, or lock 
contention?



We are running PG 7.4 on a Debian Sarge server, and will be upgrading to
pg8.0 on a new server, but have some migration issues (that's for another
list!)


Go straight to 8.1 - no point in upgrading half-way. If you don't like 
compiling from source it's in backports.org


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] 10+hrs vs 15min because of just one index

2006-02-10 Thread hubert depesz lubaczewski
On 2/10/06, Aaron Turner <[EMAIL PROTECTED]> wrote:
> So I'm trying to figure out how to optimize my PG install (8.0.3) to
> get better performance without dropping one of my indexes.
> Basically, I have a table of 5M records with 3 columns:
> pri_key (SERIAL)
> data char(48)
> groupid integer
> there is an additional unique index on the data column.
> The problem is that when I update the groupid column for all the
> records, the query takes over 10hrs (after that I just canceled the
> update).  Looking at iostat, top, vmstat shows I'm horribly disk IO
> bound (for data not WAL, CPU 85-90% iowait) and not swapping.
> Dropping the unique index on data (which isn't used in the query),

for such a large update i would suggest to go with different scenario:
split update into packets (1, or 5 rows at the time)
and do:
update packet
vacuum table
for all packets. and then reindex the table. should work much nicer.

depesz

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Basic Database Performance

2006-02-10 Thread James Dey








Hi Guys,

 

Apologies if this is a novice queston, but I think it is a
performance one nevertheless. We are running a prototype of a system running on
PHP/Postgresql on an Intel Xeon 2ghz server, 1GB RAM, 40GB hard drive, as a
test bench. The system will be used for tens of thousands of users, and at the
moment we are testing on a base of around 400 users concurrently during the
day.

 

During the day, the system is incredibly slow to a point
where it is unusable. The reason we are testing on such as small server is to
test performance under pressure, and my estimation is that spec should handle
thousands of users.

 

The server spikes from 5% usage to 95% up and down. The system
is a very simple e-learning and management system and has not given us any
issues to date, only since we’ve been testing with more users has it done
so. The fact that 400 users doing inserts and queries every few minutes is very
concerning, I would like to know if I could be tweaking some config settings.


We are running PG 7.4 on a Debian Sarge server, and will be upgrading to pg8.0
on a new server, but have some migration issues (that’s for another
list!)


Any help would be greatly appreciated!


All the very best,

 

James Dey

 

tel   +27 11 704-1945

cell  +27
82 785-5102

fax   +27
11 388-8907

mail    [EMAIL PROTECTED]

 

myGUS / SLT retains all its intellectual
property rights in any information contained in e-mail messages (or any
attachments thereto) which relates to the official business of myGUS / SLT or
of any of its associates. Such information may be legally privileged, is to be
treated as confidential and myGUS / SLT will take legal steps against any
unauthorised use. myGUS / SLT does not take any responsibility for, or endorses
any information which does not relate to its official business, including
personal mail and/or opinions by senders who may or may not be employed by
myGUS / SLT. In the event that you receive a message not intended for you, we
request that you notify the sender immediately, do not read, disclose or use
the content in any way whatsoever and destroy/delete the message immediately.
While myGUS / SLT will take reasonable precautions, it cannot ensure that this
e-mail will be free of errors, viruses, interception or interference therewith.
myGUS / SLT does not, therefore, issue any guarantees or warranties in this
regard and cannot be held liable for any loss or damages incurred by the
recipient which have been caused by any of the above-mentioned factors.

 








[PERFORM] 10+hrs vs 15min because of just one index

2006-02-10 Thread Aaron Turner
So I'm trying to figure out how to optimize my PG install (8.0.3) to
get better performance without dropping one of my indexes.

Basically, I have a table of 5M records with 3 columns:

pri_key (SERIAL)
data char(48)
groupid integer

there is an additional unique index on the data column.

The problem is that when I update the groupid column for all the
records, the query takes over 10hrs (after that I just canceled the
update).  Looking at iostat, top, vmstat shows I'm horribly disk IO
bound (for data not WAL, CPU 85-90% iowait) and not swapping.

Dropping the unique index on data (which isn't used in the query),
running the update and recreating the index  runs in under 15 min. 
Hence it's pretty clear to me that the index is the problem and
there's really nothing worth optimizing in my query.

As I understand from #postgresql, doing an UPDATE on one column causes
all indexes for the effected row to have to be updated due to the way
PG replaces the old row with a new one for updates.  This seems to
explain why dropping the unique index on data solves the performance
problem.

interesting settings:
shared_buffers = 32768
maintenance_work_mem = 262144
fsync = true
wal_sync_method = open_sync
wal_buffers = 512
checkpoint_segments = 30
effective_cache_size = 1
work_mem =  (1024 i think?)

box:
Linux 2.6.9-11EL (CentOS 4.1)
2x Xeon 3.4 HT
2GB of RAM (but Apache and other services are running)
4 disk raid 10 (74G Raptor) for data
4 disk raid 10 (7200rpm) for WAL

other then throwing more spindles at the problem, any suggestions?

Thanks,
Aaron

--
Aaron Turner
http://synfin.net/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings