Re: [PERFORM] Large objetcs performance

2007-04-22 Thread Ulrich Cech

Hello Alexandre,

We have an application subjected do sign documents and store them 
somewhere.


I developed a relative simple file archive with PostgreSQL (web 
application with JSF for user interface). The major structure is one 
table with some key word fields, and 3 blob-fields (because exactly 3 
files belong to one record). I have do deal with millions of files (95% 
about 2-5KB, 5% are greater than 1MB).
The great advantage is that I don't have to communicate with the file 
system (try to open a directory with 300T files on a windows system... 
it's horrible, even on the command line).


The database now is 12Gb, but searching with the web interface has a 
maximum of 5 seconds (most searches are faster). The one disadvantage is 
the backup (I use pg_dump once a week which needs about 10 hours). But 
for now, this is acceptable for me. But I want to look at slony or port 
everything to a linux machine.


Ulrich

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Redundant sub query triggers slow nested loop left join

2007-04-22 Thread henk de wit

Since you have two redundant tests, the selectivity is being
double-counted, leading to a too-small rows estimate and a not very
appropriate choice of join plan.


I see, thanks for the explanation. I did notice though that in the second 
case, with 1 redundant test removed, the estimate is still low:


Hash Left Join (cost=1449.99..2392.68 rows=2 width=714) (actual 
time=24.257..25.292 rows=553 loops=1)


In that case the prediction is 2 rows, which is only 1 row more than in the 
previous case. Yet the plan is much better and performance improved 
dramatically. Is there a reason/explanation for that?



FWIW, CVS HEAD does get rid of the duplicate conditions for the common
case of mergejoinable equality operators --- but it's not explicitly
looking for duplicate conditions, rather this is falling out of a new
method for making transitive equality deductions.


This sounds very interesting Tom. Is there some documentation somewhere 
where I can read about this new method?


_
Live Search, for accurate results! http://www.live.nl


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Redundant sub query triggers slow nested loop left join

2007-04-22 Thread Tom Lane
henk de wit [EMAIL PROTECTED] writes:
 In that case the prediction is 2 rows, which is only 1 row more than in the 
 previous case. Yet the plan is much better and performance improved 
 dramatically. Is there a reason/explanation for that?

Well, it's just an estimated-cost comparison.  If there's only one row
then a nestloop join looks like the best way since it requires no extra
overhead.  But as soon as you get to two rows, the other side of the
join would have to be executed twice, and that's more expensive than
doing it once and setting up a hash table.  In the actual event, with
359 rows out of the scan, the nestloop way is just horrid because it
repeats the other side 359 times :-(

It strikes me that it might be interesting to use a minimum rowcount
estimate of two rows, not one, for any case where we can't actually
prove there is at most one row (ie, the query conditions match a unique
index).  That is probably enough to discourage this sort of brittle
behavior ... though no doubt there'd still be cases where it's the
wrong thing.  We do not actually have any code right now to make such
proofs, but there's been some discussion recently about adding such
logic in support of removing useless outer joins.

 FWIW, CVS HEAD does get rid of the duplicate conditions for the common
 case of mergejoinable equality operators --- but it's not explicitly
 looking for duplicate conditions, rather this is falling out of a new
 method for making transitive equality deductions.

 This sounds very interesting Tom. Is there some documentation somewhere 
 where I can read about this new method?

Check the archives for mention of equivalence classes, notably these
two threads:
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00568.php
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00826.php

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Redundant sub query triggers slow nested loop left join

2007-04-22 Thread henk de wit



In the actual event, with
359 rows out of the scan, the nestloop way is just horrid because it
repeats the other side 359 times :-(


Indeed. :(

Btw, I tried to apply the removal of the redundant check in the larger query 
(the one from which I extracted the part shown earlier) but it only performs 
worse after that. The more redundant checks I remove, the slower the query 
gets. I figure the original designer of the query inserted those checks to 
quickly limit the number of rows involved in the nested loop. Of course, the 
problem is probably not the number of rows involved, but the unfortunate 
choice of the nested loop.


I spend a few hours today in trying to figure it all out, but I'm pretty 
stuck at the moment.


For what its worth, this is the plan PG 8.2 comes up with right after I 
remove the same check that made the isolated query in the openings post so 
much faster:


Sort  (cost=6006.54..6006.55 rows=1 width=597) (actual 
time=14561.499..14561.722 rows=553 loops=1)

 Sort Key: public.banners_links.id
 -  Nested Loop Left Join  (cost=3917.68..6006.53 rows=1 width=597) 
(actual time=64.723..14559.811 rows=553 loops=1)
   Join Filter: (public.banners_links.id = 
public.fetch_banners.banners_links_id)
   -  Nested Loop Left Join  (cost=3917.68..6002.54 rows=1 width=527) 
(actual time=64.607..14509.291 rows=553 loops=1)
 Join Filter: (public.banners_links.id = 
reward_ratings.banner_id)
 -  Nested Loop Left Join  (cost=2960.36..4395.12 rows=1 
width=519) (actual time=52.761..8562.575 rows=553 loops=1)
   Join Filter: (public.banners_links.id = 
banners_banner_types.banner_id)
   -  Nested Loop Left Join  (cost=2000.60..2956.57 rows=1 
width=484) (actual time=32.026..304.700 rows=359 loops=1)
 Join Filter: (public.banners_links.id = 
ecpc_per_banner_link.banners_links_id)
 -  Nested Loop  (cost=124.58..1075.70 rows=1 
width=468) (actual time=9.793..187.724 rows=359 loops=1)
   -  Nested Loop Left Join  
(cost=124.58..1067.42 rows=1 width=89) (actual time=9.786..184.671 rows=359 
loops=1)
 Join Filter: (public.banners_links.id 
= users_banners_tot_sub.banner_id)
 -  Hash Left Join  
(cost=107.97..1050.78 rows=1 width=81) (actual time=6.119..7.605 rows=359 
loops=1)
   Hash Cond: 
(public.banners_links.id = special_deals.id)
   Filter: 
(special_deals.special_deal IS NULL)
   -  Bitmap Heap Scan on 
banners_links  (cost=11.43..954.03 rows=2 width=73) (actual 
time=0.128..1.069 rows=359 loops=1)
 Recheck Cond: (merchant_id 
= 5631)
 Filter: ((status)::text = 
'0'::text)
 -  Bitmap Index Scan on 
banners_links_merchant_id_idx  (cost=0.00..11.43 rows=424 width=0) (actual 
time=0.089..0.089 rows=424 loops=1)
   Index Cond: 
(merchant_id = 5631)
   -  Hash  (cost=86.93..86.93 
rows=769 width=16) (actual time=5.982..5.982 rows=780 loops=1)
 -  Subquery Scan 
special_deals  (cost=69.62..86.93 rows=769 width=16) (actual 
time=4.179..5.414 rows=780 loops=1)
   -  HashAggregate  
(cost=69.62..79.24 rows=769 width=16) (actual time=4.179..4.702 rows=780 
loops=1)
 -  Seq Scan 
on banner_deals  (cost=0.00..53.75 rows=3175 width=16) (actual 
time=0.006..1.480 rows=3175 loops=1)
 -  HashAggregate  (cost=16.61..16.62 
rows=1 width=24) (actual time=0.011..0.292 rows=424 loops=359)
   -  Nested Loop  
(cost=0.00..16.60 rows=1 width=24) (actual time=0.029..3.096 rows=424 
loops=1)
 -  Index Scan using 
users_banners_affiliate_id_idx on users_banners  (cost=0.00..8.30 rows=1 
width=16) (actual time=0.021..0.523 rows=424 loops=1)
   Index Cond: 
((affiliate_id = 5631) AND (affiliate_id = 5631))
   Filter: 
((status)::text = '3'::text)
 -  Index Scan using 
users_banners_id_idx on users_banners_rotation  (cost=0.00..8.29 rows=1 
width=16) (actual time=0.003..0.004 rows=1 loops=424)
   Index Cond: 
(users_banners_rotation.users_banners_id = users_banners.id)
   -  Index Scan using 
banners_org_id_banner.idx on banners_org  (cost=0.00..8.27 rows=1 

Re: [PERFORM] Redundant sub query triggers slow nested loop left join

2007-04-22 Thread Tom Lane
henk de wit [EMAIL PROTECTED] writes:
 I understand the above looks like a complicated mess, but would you 
 have any pointers of what I could possibly do next to force a better plan?

Taking a closer look, it seems the problem is the underestimation of the
number of rows resulting from this relation scan:

  -  Bitmap Heap Scan on 
 banners_links  (cost=11.43..954.03 rows=2 width=73) (actual 
 time=0.128..1.069 rows=359 loops=1)
Recheck Cond: (merchant_id = 5631)
Filter: ((status)::text = '0'::text)
-  Bitmap Index Scan on 
 banners_links_merchant_id_idx  (cost=0.00..11.43 rows=424 width=0) (actual 
 time=0.089..0.089 rows=424 loops=1)
  Index Cond: (merchant_id = 5631)

You might be able to improve matters by increasing the statistics target
for this table.  I have a bad feeling though that the problem may be
lack of cross-column statistics --- the thing is evidently assuming
that only about 1 in 200 rows have status = '0', which might be accurate
as a global average but not for this particular merchant.  What exactly
is the relationship between status and merchant_id, anyway?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Redundant sub query triggers slow nested loop left join

2007-04-22 Thread henk de wit

You might be able to improve matters by increasing the statistics target
for this table.


I have tried to increase the statistics for the status column to the maximum 
of 1000. After that I performed an analyze, vacuum analyze and vacuum full 
analyze on the table. Unfortunately this didn't seem to make any difference.



I have a bad feeling though that the problem may be
lack of cross-column statistics


I assume this isn't a thing that can be tweaked/increased in PG 8.2?


--- the thing is evidently assuming
that only about 1 in 200 rows have status = '0', which might be accurate
as a global average but not for this particular merchant.  What exactly
is the relationship between status and merchant_id, anyway?


The meaning is that a banners_link belongs to a merchant with the id 
merchant_id. A banners_link can be disabled (status 1) or enabled (status 
0). Globally about 1/3 of the banners_links have status 0 and 2/3 have 
status 1. The 1 in 200 estimate is indeed way off.


For the merchant in question the numbers are a bit different though, but not 
that much. Out of 424 rows total, 359 have status 0 and 65 have status 1.


_
FREE pop-up blocking with the new Windows Live Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Redundant sub query triggers slow nested loop left join

2007-04-22 Thread henk de wit
One interesting other thing to note; if I remove the banners_links.status = 
0 condition from the query altogether the execution times improve 
dramatically again. The results are not correct right now, but if worse 
comes to worst I can always remove the unwanted rows in a procedural 
language (it's a simple case of iterating a resultset and omitting rows with 
status 1). Of course this would not really be a neat solution.


Anyway, the plan without the status = 0 condition now looks like this:

Sort  (cost=6058.87..6058.88 rows=2 width=597) (actual time=305.869..306.138 
rows=658 loops=1)

 Sort Key: public.banners_links.id
 -  Nested Loop Left Join  (cost=5051.23..6058.86 rows=2 width=597) 
(actual time=69.956..304.259 rows=658 loops=1)
   Join Filter: (public.banners_links.id = 
public.fetch_banners.banners_links_id)
   -  Nested Loop Left Join  (cost=5048.26..6051.92 rows=2 width=527) 
(actual time=69.715..249.122 rows=658 loops=1)
 Join Filter: (public.banners_links.id = 
reward_ratings.banner_id)
 -  Nested Loop Left Join  (cost=3441.91..4441.39 rows=2 
width=519) (actual time=57.795..235.954 rows=658 loops=1)
   Join Filter: (public.banners_links.id = 
ecpc_per_banner_link.banners_links_id)
   -  Nested Loop  (cost=1563.28..2554.02 rows=2 
width=503) (actual time=35.359..42.018 rows=658 loops=1)
 -  Hash Left Join  (cost=1563.28..2545.93 rows=2 
width=124) (actual time=35.351..37.987 rows=658 loops=1)
   Hash Cond: (public.banners_links.id = 
users_banners_tot_sub.banner_id)
   -  Hash Left Join  (cost=1546.63..2529.27 
rows=2 width=116) (actual time=30.757..32.552 rows=658 loops=1)
 Hash Cond: (public.banners_links.id = 
banners_banner_types.banner_id)
 -  Hash Left Join  
(cost=108.08..1090.62 rows=2 width=81) (actual time=6.087..7.085 rows=424 
loops=1)
   Hash Cond: 
(public.banners_links.id = special_deals.id)
   Filter: 
(special_deals.special_deal IS NULL)
   -  Bitmap Heap Scan on 
banners_links  (cost=11.54..952.02 rows=424 width=73) (actual 
time=0.125..0.514 rows=424 loops=1)
 Recheck Cond: (merchant_id 
= 5631)
 -  Bitmap Index Scan on 
banners_links_merchant_id_idx  (cost=0.00..11.43 rows=424 width=0) (actual 
time=0.089..0.089 rows=424 loops=1)
   Index Cond: 
(merchant_id = 5631)
   -  Hash  (cost=86.93..86.93 
rows=769 width=16) (actual time=5.951..5.951 rows=780 loops=1)
 -  Subquery Scan 
special_deals  (cost=69.62..86.93 rows=769 width=16) (actual 
time=4.164..5.389 rows=780 loops=1)
   -  HashAggregate  
(cost=69.62..79.24 rows=769 width=16) (actual time=4.164..4.670 rows=780 
loops=1)
 -  Seq Scan 
on banner_deals  (cost=0.00..53.75 rows=3175 width=16) (actual 
time=0.005..1.496 rows=3175 loops=1)
 -  Hash  (cost=1432.13..1432.13 
rows=514 width=43) (actual time=24.661..24.661 rows=658 loops=1)
   -  Hash Join  
(cost=959.77..1432.13 rows=514 width=43) (actual time=1.780..24.147 rows=658 
loops=1)
 Hash Cond: 
(banners_banner_types.type_id = banner_types.id)
 -  Hash IN Join  
(cost=957.32..1422.52 rows=540 width=16) (actual time=1.738..23.332 rows=658 
loops=1)
   Hash Cond: 
(banners_banner_types.banner_id = public.banners_links.id)
   -  Seq Scan on 
banners_banner_types  (cost=0.00..376.40 rows=22240 width=16) (actual 
time=0.005..10.355 rows=22240 loops=1)
   -  Hash  
(cost=952.02..952.02 rows=424 width=8) (actual time=0.808..0.808 rows=424 
loops=1)
 -  Bitmap 
Heap Scan on banners_links  (cost=11.54..952.02 rows=424 width=8) (actual 
time=0.114..0.515 rows=424 loops=1)
   Recheck 
Cond: (merchant_id = 5631)
   -  
Bitmap Index Scan on banners_links_merchant_id_idx  (cost=0.00..11.43 
rows=424 width=0) (actual time=0.085..0.085 rows=424 loops=1)
 
Index Cond: (merchant_id = 5631)

Re: [PERFORM] Redundant sub query triggers slow nested loop left join

2007-04-22 Thread Tom Lane
henk de wit [EMAIL PROTECTED] writes:
 --- the thing is evidently assuming
 that only about 1 in 200 rows have status = '0', which might be accurate
 as a global average but not for this particular merchant.  What exactly
 is the relationship between status and merchant_id, anyway?

 The meaning is that a banners_link belongs to a merchant with the id 
 merchant_id. A banners_link can be disabled (status 1) or enabled (status 
 0). Globally about 1/3 of the banners_links have status 0 and 2/3 have 
 status 1. The 1 in 200 estimate is indeed way off.

Well, that's darn odd.  It should not be getting that so far wrong.
What's the datatype of the status column exactly (I'm guessing varchar
but maybe not)?  Would you show us the pg_stats row for the status column?

 One interesting other thing to note; if I remove the banners_links.status = 
 0 condition from the query altogether the execution times improve 
 dramatically again.

Right, because it was dead on about how many merchant_id = 5631 rows
there are.  The estimation error is creeping in where it guesses how
selective the status filter is.  It should be using the global fraction
of status = 0 rows for that, but it seems to be using a default estimate
instead (1/200 is actually the default eqsel estimate now that I think
about it).  I'm not sure why, but I think it must have something to do
with the subquery structure of your query.  Were you showing us the
whole truth about your query, or were there details you left out?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Fwd: [PERFORM] Strangely Variable Query Performance

2007-04-22 Thread Robins Tharakan

Hi Tom / Steve,

Am one of the silent readers of performance issues that come up on this list
(and are discussed in detail) ... just like this one.

If and when you do come up with a solution, please do post some details
about them here... (i say that coz it seems that for obvious reasons, things
must have gone off air after tom's last email, and one can understand that).
But an analysis, or atleast a pointer may be of help to someone (like me)
reading this list.

Thanks
Robins

-- Forwarded message --
From: Tom Lane [EMAIL PROTECTED]
Date: Apr 13, 2007 10:08 AM
Subject: Re: [PERFORM] Strangely Variable Query Performance
To: Steve [EMAIL PROTECTED]
Cc: Scott Marlowe [EMAIL PROTECTED], PostgreSQL Performance 
pgsql-performance@postgresql.org

Steve [EMAIL PROTECTED] writes:

On Thu, 12 Apr 2007, Tom Lane wrote:

I'm still not having any luck reproducing the failure here.  Grasping at
straws again, I wonder if it's got something to do with the order in
which the planner examines the indexes --- which is OID order.  Could
you send me the results of



Here you go:


Nope, still doesn't fail for me.  [ baffled and annoyed... ]  There must
be something about your situation that's relevant but we aren't
recognizing it.  Are you in a position to let me ssh into your machine
and try to debug it?  Or other options like sending me a dump of your
database?   I'm about out of steam for tonight, but let's talk about it
off-list tomorrow.

   regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


--
Robins


[PERFORM] Help with TOAST Compression

2007-04-22 Thread David Hinkle
I have a table where I store email,  the bodies are mostly kept in a
toast table.The toast table is 940 Meg in size.   The whole database
is about 1.2 Gig in size.   When I back the database up using pg_dump in
custom output mode, I pipe the output into gzip.   My backups are only
about 600 meg in size.   From this, I assume the that toe toast table
isn't getting compressed.

 

I am keeping the bodies in a column of type bytea.  

 

Is there any way I can tell for sure if the messages from this column
are being stored compressed?   I know I can set the compression settings
using the ALTER TABLE ALTER SET STORAGE syntax, but is there a way I
can see what this value is currently set to?

 

David   



Re: [PERFORM] Large objetcs performance

2007-04-22 Thread Ulrich Cech

Hello Alexandre,

We have an application subjected do sign documents and store them 
somewhere.


I developed a relative simple file archive with PostgreSQL (web 
application with JSF for user interface). The major structure is one 
table with some key word fields, and 3 blob-fields (because exactly 3 
files belong to one record). I have do deal with millions of files (95% 
about 2-5KB, 5% are greater than 1MB).
The great advantage is that I don't have to communicate with the file 
system (try to open a directory with 300T files on a windows system... 
it's horrible, even on the command line).


The database now is 12Gb, but searching with the web interface has a 
maximum of 5 seconds (most searches are faster). The one disadvantage is 
the backup (I use pg_dump once a week which needs about 10 hours). But 
for now, this is acceptable for me. But I want to look at slony or port 
everything to a linux machine.


Ulrich


[PERFORM] postgres: 100% CPU utilization

2007-04-22 Thread Sergey Tsukinovsky
Hi,

 

I'm currently dealing with performance issues of postgres and looking
for some advice.

 

Platform

Postgres: 7.0.2

OS: FreeBSD4.4

DB: size - about 50M, most frequently updated tables are of an average
size of 1000-2000 rows and there are not many of them, about 15 in total

 

Description

My current system load keeps the postgres CPU utilization at the level
of 90-100%. 

'vacuumdb' results in a sharp drop of the CPU usage down to 25-30%, but
not for a long period of time - it gets back to 100% within 30 minutes.

Disk IO ratio during the test keeps on about 0.5 MB/s

 

Questions:

1. When reading the 'vacuum analyze' output how to identify which one of
the actions had the most effect on reducing the CPU usage - garbage
cleaning or statistics recalculation for the analyzer?

2. What would be the recommended set of parameters to tune up in order
to improve the performance over the time, instead of considering an
option to vacuum every 30 minutes or so?

3. Is it safe to run 'vacuum' as frequently as every 15-30 minutes?

4. Suggestions?

 

I know that 7.0.2 is an old version and therefore ran the same test on
7.3.18 - the performance behavior was similar. 

 

Thank you in advance,

Sergey

_

This message, including any attachments, is confidential and/or
privileged and contains information intended only for the person(s)
named above. Any other distribution, copying or disclosure is strictly
prohibited. If you are not the intended recipient or have received this
message in error, please notify us immediately by reply email and
permanently delete the original transmission from all of your systems
and hard drives, including any attachments, without making a copy.

 



[PERFORM] Odd problem with planner choosing seq scan

2007-04-22 Thread Colin McGuigan
I have two tables, staff (530 rows) and location (2.5 million rows).  I 
do a query that joins the two together, as so:


SELECT s.ProprietorId, l.LocationId, s.RoleId
FROM Location l
INNER JOIN (
SELECT *
FROM Staff
) s ON l.ProprietorId = s.ProprietorId
WHERE s.UserId = 123456
AND s.LocationId IS NULL

Ignore the fact that it's a subquery -- the query plan is the same if 
its a straight JOIN, and I'm going to use the subquery to demonstrate 
something interesting.


Anyways, this takes ~45 seconds to run, and returns 525 rows (just about 
1 per record in the Staff table; 5 records are not for that user are so 
are excluded).  The EXPLAIN is:


Nested Loop  (cost=243.50..34315.32 rows=10286 width=12)
  -  Subquery Scan s  (cost=0.00..21.93 rows=1 width=8)
Filter: ((userid = 123456) AND (locationid IS NULL))
-  Limit  (cost=0.00..15.30 rows=530 width=102)
  -  Seq Scan on staff  (cost=0.00..15.30 rows=530 width=102)
  -  Bitmap Heap Scan on location l  (cost=243.50..34133.68 
rows=12777 width=8)

Recheck Cond: (s.proprietorid = l.proprietorid)
-  Bitmap Index Scan on idx_location_proprietorid_locationid 
(cost=0.00..240.30 rows=12777 width=0)

  Index Cond: (s.proprietorid = l.proprietorid)

The EXPLAIN ANALYZE is:

Hash Join  (cost=23.16..129297.25 rows=2022281 width=12) (actual 
time=62.315..48632.406 rows=525 loops=1)

  Hash Cond: (l.proprietorid = staff.proprietorid)
  -  Seq Scan on location l  (cost=0.00..101337.11 rows=2057111 
width=8) (actual time=0.056..44504.431 rows=2057111 loops=1)
  -  Hash  (cost=16.63..16.63 rows=523 width=8) (actual 
time=46.411..46.411 rows=525 loops=1)
-  Seq Scan on staff  (cost=0.00..16.63 rows=523 width=8) 
(actual time=0.022..45.428 rows=525 loops=1)

  Filter: ((userid = 123456) AND (locationid IS NULL))
Total runtime: 48676.282 ms

Now, the interesting thing is, if I add LIMIT 5000 into that inner 
subquery on the staff table, it no longer seq scans location, and the 
whole thing runs in less than a second.


SELECT s.ProprietorId, l.LocationId, s.RoleId
FROM Location l
INNER JOIN (
SELECT *
FROM Staff
		LIMIT 5000 	-- Only change; remember, this table  		-- only has 
530 rows

) s ON l.ProprietorId = s.ProprietorId
WHERE s.UserId = 123456
AND s.LocationId IS NULL

EXPLAIN:

Nested Loop  (cost=243.50..34315.32 rows=10286 width=12)
  -  Subquery Scan s  (cost=0.00..21.93 rows=1 width=8)
Filter: ((userid = 123456) AND (locationid IS NULL))
-  Limit  (cost=0.00..15.30 rows=530 width=102)
  -  Seq Scan on staff  (cost=0.00..15.30 rows=530 width=102)
  -  Bitmap Heap Scan on location l  (cost=243.50..34133.68 
rows=12777 width=8)

Recheck Cond: (s.proprietorid = l.proprietorid)
-  Bitmap Index Scan on idx_location_proprietorid_locationid 
(cost=0.00..240.30 rows=12777 width=0)

  Index Cond: (s.proprietorid = l.proprietorid)

EXPLAIN ANALYZE:

Nested Loop  (cost=243.50..34315.32 rows=10286 width=12) (actual 
time=74.097..569.372 rows=525 loops=1)
  -  Subquery Scan s  (cost=0.00..21.93 rows=1 width=8) (actual 
time=16.452..21.092 rows=525 loops=1)

Filter: ((userid = 123456) AND (locationid IS NULL))
-  Limit  (cost=0.00..15.30 rows=530 width=102) (actual 
time=16.434..19.128 rows=530 loops=1)
  -  Seq Scan on staff  (cost=0.00..15.30 rows=530 
width=102) (actual time=16.429..17.545 rows=530 loops=1)
  -  Bitmap Heap Scan on location l  (cost=243.50..34133.68 
rows=12777 width=8) (actual time=1.027..1.029 rows=1 loops=525)

Recheck Cond: (s.proprietorid = l.proprietorid)
-  Bitmap Index Scan on idx_location_proprietorid_locationid 
(cost=0.00..240.30 rows=12777 width=0) (actual time=0.151..0.151 rows=1 
loops=525)

  Index Cond: (s.proprietorid = l.proprietorid)
Total runtime: 570.868 ms

This confuses me.  As far as I can tell, the EXPLAIN output is the same 
regardless of whether LIMIT 5000 is in there or not.  However, I don't 
know why a) the EXPLAIN ANALYZE plan is different in the first case, 
where there is no LIMIT 5000, or b) why adding a LIMIT 5000 onto a table 
would change anything when the table has only 530 rows in it. 
Furthermore, I can repeat this experiment over and over, so I know that 
its not caching.  Removing the LIMIT 5000 returns performance to  45 
seconds.


I've ANALYZEd both tables, so I'm relatively certain statistics are up 
to date.  This is test data, so there are no ongoing 
inserts/updates/deletes -- only selects.


I'd really prefer this query run in  1 second rather than  45, but I'd 
really like to do that without having hacks like adding in pointless 
LIMIT clauses.


Any help would be much appreciated.

--Colin McGuigan

---(end of broadcast)---
TIP 3: 

[PERFORM] seeking advise on char vs text or varchar in search table

2007-04-22 Thread chrisj

I have a table that contains a column for keywords that I expect to become
quite large and will be used for web searches.  I will either index the
column or come up with a simple hashing algorithm add the hash key to the
table and index that column.

I am thinking the max length in the keyword column I need to support is 30,
but the average would be less than10

Any suggestions on whether to use char(30), varchar(30) or text, would be
appreciated.  I am looking for the best performance option, not necessarily
the most economical on disk.

Or any other suggestions would be greatly appreciated.
-- 
View this message in context: 
http://www.nabble.com/seeking-advise-on-char-vs-text-or-varchar-in-search-table-tf3618204.html#a10103002
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-22 Thread Craig A. James

Merlin Moncure wrote:

In the context of this debate, I see this argument all the time, with
the implied suffix: 'If only we used integer keys we would not have
had this problem...'.  Either the customer identifies parts with a
part number or they don't...and if they do identify parts with a
number and recycle the numbers, you have a problem...period.


On the contrary.  You create a new record with the same part number.  You mark the old 
part number obsolete.  Everything else (the part's description, and all the 
relationships that it's in, such as order history, catalog inclusion, revision history, 
etc.) is unaffected.  New orders are placed against the new part number's DB record; for 
safety the old part number can have a trigger that prevent new orders from being placed.

Since the part number is NOT the primary key, duplicate part numbers are not a 
problem.  If you had instead used the part number as the primary key, you'd be 
dead in the water.

You can argue that the customer is making a dumb decision by reusing catalog 
numbers, and I'd agree.  But they do it, and as database designers we have to 
handle it.  In my particular system, we aggregate information from several 
hundred companies, and this exact scenario happens frequently.  Since we're 
only aggregating information, we have no control over the data that these 
companies provide.  If we'd used catalog numbers for primary keys, we'd have 
big problems.

Craig





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] postgres: 100% CPU utilization

2007-04-22 Thread Tom Lane
Sergey Tsukinovsky [EMAIL PROTECTED] writes:
 I'm currently dealing with performance issues of postgres and looking
 for some advice.

 Postgres: 7.0.2

Stop right there.  You have *no* business asking for help on an
installation you have not updated in more than six years.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] seeking advise on char vs text or varchar in search table

2007-04-22 Thread Merlin Moncure

On 4/20/07, chrisj [EMAIL PROTECTED] wrote:


I have a table that contains a column for keywords that I expect to become
quite large and will be used for web searches.  I will either index the
column or come up with a simple hashing algorithm add the hash key to the
table and index that column.

I am thinking the max length in the keyword column I need to support is 30,
but the average would be less than10

Any suggestions on whether to use char(30), varchar(30) or text, would be
appreciated.  I am looking for the best performance option, not necessarily
the most economical on disk.


Don't use char...it pads out the string to the length always.   It
also has no real advantage over varchar in any practical situation.
Think of varchar as text with a maximum length...its no faster or
slower but the database will throw out entries based on length (which
can be good or a bad thing)...in this case, text feels better.

Have you looked at tsearch2, gist, etc?

merlin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate