from:"Mark Stosberg"

Re: [PERFORM] application of KNN code to US zipcode searches?

2011-10-29 Thread Mark Stosberg


Hello,

I want to report that I have now solved the challenges I ran into using
KNN for US zipcode searching. I've found the new approach to not only be
viable, but to benchmark about 3x faster for our own real-world
application than the previous approach we used, involving
cube_distance() and earth_box().

Here's some details about my research so far.

To evaluate it, I installed PostgreSQL 9.1 and a current PostGIS 2.0
snapshot (not yet released as stable).

A primary challenge I had to solve was that KNN is designed for a
slightly different problem than what I needed to solve. I need to answer
the question:

 What are all the objects that are in zipcodes with 50 miles of a given
zipcode?

However, KNN only directly provides a performance boost to this
variation:

 What are the N nearest objects to this point?

Just adding a WHERE clause to check the 50 mile rule would erase the
benefits of KNN, which works through an ORDER BY clause.

I solved my issue by using a WITH clause that creates a pseudo-table
called nearby_zipcodes. In this example, I select all the zipcodes
that are within 50 miles of the 47374 zipcode. The trick I've
employed is that I've set the LIMIT value to 286-- exactly the number of
zipcodes within 50 miles of 47374. My plan is to add another column to
my zipcodes table for each of the small number distances I need to
search. Then, when I load new zipcodes I can pre-compute how many
zipcodes would be found at this distance.

This have approach would not have worked without a WITH clause, or
some equivalent, because the number of objects within the radius is not
known, but the number of nearby zipcodes is fixed.

This approach allows me to get the performance benefits of KNN, while
also returning exactly those objects within 50 miles of my
target zipcode, by JOINing on the nearby_zipcodes table:

 WITH nearby_zipcodes AS (
 SELECT zipcode,
 st_distance_sphere(lonlat_point, (SELECT lonlat_point from
zipcodes WHERE zipcode = '47374')) / 1609.344 as radius
 FROM zipcodes
 ORDER BY lonlat_point  - (SELECT lonlat_point from zipcodes WHERE
zipcode = '47374')
 LIMIT 286
 )
 SELECT ...

You might also notice that st_distance_sphere() doesn't mean exactly
the same thing as the - operator. That's something I could refine
going forward.

That's what I've got so far. How could I improve this further?

For reference, here are the key parts of the zipcodes table:

# \d zipcodes
 Table public.zipcodes
Column| Type  | Modifiers
--+---+---
 zipcode  | character varying(5)  | not null
 lonlat_point | geometry(Point,4326)  |
Indexes:
zipcodes_pkey PRIMARY KEY, btree (zipcode)
lonlat_point_idx gist (lonlat_point)

Thanks for the peer review!

Mark Stosberg


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] application of KNN code to US zipcode searches?

2011-02-17 Thread Mark Stosberg


We perform over 1,000,000 searches each day for adoptable shelter pets
near your zipcode. We already have adequate performance for these
searches using the cube contrib, but the new KNN work in 9.1 seemed
like it might be a promising way to speed this up even further.

I installed PostgreSQL 9.1 on my laptop to evaluate it, using this post
as a reference:
http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/

The first task was to translate a geo-spatial search to use the new KNN
syntax.

I'm most familiar with two approaches to geo-spatial searching with
PostgreSQL. The first is the older earthdistance approach, using
point types and the @ operator.

The second is the one we are using now, which uses a cube type, the
cube_distance() and earth_box() method and a GIST index on the cube
type.

Immediately there is a hurdle in that KNN only appears to work with
point types and the - operator, which does simple point-to-point
distance, instead of the distance-around-the-earth. Still, I thought
that could be enough of an approximation to test the waters.

I started with some real world queries that involved some table joins,
and when those failed to show improvement, I worked with some
reduced-test-case queries.

While I could confirm the new GIST index was being used on the point
type, I couldn't get a query to benchmark better when it was invoked.
I'm wondering if perhaps US zipcode searches aren't good use of this
technology, perhaps because the data set is too small ( About 40,000
zipcodes ).

Given that we can already do GIST-indexed searches with the cube type
that provide good reasonable approximations for zipcode-radius searches,
are others planning to eventually apply the KNN work to US zipcode
searches?

Sample EXPLAIN output and query times are below.

Mark

EXPLAIN ANALYZE SELECT zipcode,
lon_lat - '(-118.412426,34.096629)' AS radius
FROM zipcodes ;
---
 Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483 width=22) (actual
time=0.019..84.543 rows=41483 loops=1)
 Total runtime: 148.129 ms


EXPLAIN ANALYZE SELECT zipcode,
lon_lat - '(-118.412426,34.096629)' As radius
FROM zipcodes
ORDER BY lon_lat - '(-118.412426,34.096629)';
--
 Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
rows=41483 width=22) (actual time=0.451..141.590 rows=41483 loops=1)
   Order By: (lon_lat - '(-118.412426,34.096629)'::point)
 Total runtime: 206.392 ms








-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] application of KNN code to US zipcode searches?

2011-02-17 Thread Mark Stosberg


 I thought the benefit of KNN was that you could retrieve the rows in
 distance order, so that a query for the closest 20 locations (for
 example) would be very fast.  I wouldn't have expected it to be
 helpful when you're selecting all the rows regardless of distance.

Kevin,

Thanks for the feedback. You are right that my reduced test case
wasn't a good approximation. I added a limit, to simulate finding the
100 zipcodes closest to 90210.

Below I compare 4 approaches to the same query:

1. Cube search
2. Earth Distance Search
3. Simple point distance (no index)
4. Simple point distance (KNN)

Now KNN benchmarks to be almost 100x faster! That's very promising.
Then there's only the issue that simple point distance is not expected
to be a good enough approximation of earth-distances. Perhaps that can
be solved by pre-computing coordinates based on the lat/long pairs
much like the map projections used to present a curved surface on a flat
map? Given that's OK to be be a few miles off, it seems we have some
leeway here.

Recommendations?

Mark

EXPLAIN ANALYZE
SELECT zipcode,
cube_distance( '(-2513120.64361786, -4645511.0460328,
3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius
FROM zipcodes ORDER BY radius LIMIT 100;

---
 Limit  (cost=2946.70..2946.95 rows=100 width=62) (actual
time=167.650..168.064 rows=100 loops=1)
   -  Sort  (cost=2946.70..3050.40 rows=41483 width=62) (actual
time=167.644..167.829 rows=100 loops=1)
 Sort Key: ((cube_distance('(-2513120.64361786,
-4645511.0460328, 3575538.9507084)'::cube, earth_coords) /
1609.344::double precision))
 Sort Method: top-N heapsort  Memory: 20kB
 -  Seq Scan on zipcodes  (cost=0.00..1361.24 rows=41483
width=62) (actual time=0.030..90.807 rows=41483 loops=1)
 Total runtime: 168.300 ms

3

-- Using Earthdistance
EXPLAIN ANALYZE SELECT zipcode,
lon_lat @ '(-118.412426,34.096629)' As radius
FROM zipcodes
ORDER BY lon_lat @ '(-118.412426,34.096629)'
LIMIT 100;


 Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=187.995..188.451 rows=100 loops=1)
   -  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=187.989..188.149 rows=100 loops=1)
 Sort Key: ((lon_lat @ '(-118.412426,34.096629)'::point))
 Sort Method: top-N heapsort  Memory: 20kB
 -  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.033..108.203 rows=41483 loops=1)
 Total runtime: 188.660 ms

##

Using simple point distance, but with no Gist Index:

EXPLAIN ANALYZE SELECT zipcode,
lon_lat - '(-118.412426,34.096629)' As radius
FROM zipcodes
ORDER BY lon_lat - '(-118.412426,34.096629)'
LIMIT 100;


 Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=160.574..161.057 rows=100 loops=1)
   -  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=160.568..160.691 rows=100 loops=1)
 Sort Key: ((lon_lat - '(-118.412426,34.096629)'::point))
 Sort Method: top-N heapsort  Memory: 20kB
 -  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.027..84.610 rows=41483 loops=1)
 Total runtime: 161.226 ms

#

-- Using KNN-GIST index
EXPLAIN ANALYZE SELECT zipcode,
lon_lat - '(-118.412426,34.096629)' As radius
FROM zipcodes
ORDER BY lon_lat - '(-118.412426,34.096629)'
LIMIT 100;
--
 Limit  (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892
rows=100 loops=1)
   -  Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1)
 Order By: (lon_lat - '(-118.412426,34.096629)'::point)
 Total runtime: 2.198 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] application of KNN code to US zipcode searches?

2011-02-17 Thread Mark Stosberg

 PostGIS, geometry columns, and UTM..  I'm not sure where they are wrt
 adding KNN support, but it's something they've been anxious to have for
 a while, so I expect support will come quickly.

I've looked into this a little more.

One approach seems to be to project the lat/long pairs on to a flat
plane using the Albers projection (which would be a one-time
calculation), and then the current KNN point/distance calculations could
be used.

Here's a Perl module that references the Albers projection (although
it's not yet clear to me how to use it):

http://search.cpan.org/dist/PDL/

And a Wikipedia page on various calculation possibilities:
http://en.wikipedia.org/wiki/Geographical_distance#Flat-surface_formulae

Further suggestions welcome.

   Thanks,

Mark


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] application of KNN code to US zipcode searches?

2011-02-17 Thread Mark Stosberg


I tried again to use KNN for a real-world query, and I was able to get
it to add an approximately 6x speed-up vs the cube search or
earthdistance methods ( from 300 ms to 50ms ).

I had to make some notable changes for the KNN index to be considered.

- Of course, I had to switch to using basic point/distance calculation.
  As previously noted, this still needs more work to confirm the
  accuracy and get the distance reported in miles.

- The query planner didn't like it when the ORDER BY referred to a
  column value instead of a static value, even when I believe it should
  know that the column value never changes. See this pseudo-query where
  we look-up the coordinates for 90210 once:

  EXPLAIN ANALYZE
  SELECT pets.pet_id,
  zipcodes.lon_lat - center.lon_lat AS radius
  FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS
center, pets
  JOIN shelters USING (shelter_id)
  JOIN zipcodes USING (zipcode)
   ORDER BY postal_codes.lon_lat - center.lon_lat limit 1000;

  This didn't use the KNN index until I changed the center.lon_lat in
  the ORDER BY to an explicit point value. I'm not sure if that's
  expected, or something I should take up with -hackers.

  This could be worked around by doing a initial query to look-up this
  value, and then feed a static value into this query. That's not ideal,
  but the combination would still be faster.

- I had to drop the part of the WHERE clause which restricted the
  results to shelters within 50 miles from the target zipcode. However,
  I could set the LIMIT so high that I could get back enough pets,
  and then the application could trim out the results. Or, perhaps
  I could push this query down into a sub-select, and let PostgreSQL
  do a second pass to throw out some of the results.

In any case, with a real-world speed-up of 6x, this looks like it will
be worth it to us to continue to investigate.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] application of KNN code to US zipcode searches?

2011-02-17 Thread Mark Stosberg

On 02/17/2011 03:17 PM, Oleg Bartunov wrote:
 Mark,
 
 we investigating pgsphere http://pgsphere.projects.postgresql.org/, if
 we could add KNN support.

Great, thanks Oleg.

I'll be happy to test it when something is ready.

Mark


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Re: getting the most of out multi-core systems for repeated complex SELECT statements

2011-02-04 Thread Mark Stosberg

On 02/03/2011 10:57 AM, gnuo...@rcn.com wrote:
 For PG to prosper in the future, it has to embrace the 
 multi-core/processor/SSD machine at the query level

As the person who brought up the original concern, I'll add that
multi-core at the query level really isn't important for us. Most of
our PostgreSQL usage is through a web application which fairly
automatically takes advantage of multiple cores, because there are
several parallel connections.

A smaller but important piece of what we do is run this cron script
needs to run hundreds of thousands of variations of the same complex
SELECT as fast it can.

What honestly would have helped most is not technical at all-- it would
have been some documentation on how to take advantage of multiple cores
for this case.

It looks like it's going to be trivial-- Divide up the data with a
modulo, and run multiple parallel cron scripts that each processes a
slice of the data. A benchmark showed that this approach sped up our
processing 3x when splitting the application 4 ways across 4 processors.
(I think we failed to achieve a 4x improvement because the server was
already busy handling some other tasks).

Part of our case is likely fairly common *today*: many servers are
multi-core now, but people don't necessarily understand how to take
advantage of that if it doesn't happen automatically.

Mark

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements

2011-02-03 Thread Mark Stosberg


Each night we run over a 100,000 saved searches against PostgreSQL
9.0.x. These are all complex SELECTs using cube functions to perform a
geo-spatial search to help people find adoptable pets at shelters.

All of our machines in development in production have at least 2 cores
in them, and I'm wondering about the best way to maximally engage all
the processors.

Now we simply run the searches in serial. I realize PostgreSQL may be
taking advantage of the multiple cores some in this arrangement, but I'm
seeking advice about the possibility and methods for running the
searches in parallel.

One naive I approach I considered was to use parallel cron scripts. One
would run the odd searches and the other would run the even
searches. This would be easy to implement, but perhaps there is a better
way.  To those who have covered this area already, what's the best way
to put multiple cores to use when running repeated SELECTs with PostgreSQL?

Thanks!

Mark


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements

2011-02-03 Thread Mark Stosberg

On 02/03/2011 10:54 AM, Oleg Bartunov wrote:
 Mark,
 
 you could try gevel module to get structure of GIST index and look if
 items distributed more or less homogenous (see different levels). You
 can visualize index like http://www.sai.msu.su/~megera/wiki/Rtree_Index
 Also, if your searches are neighbourhood searches, them you could try
 knn, available
 in 9.1 development version.

Oleg,

Those are interesting details to consider. I read more about KNN here:

http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/

Will I be able to use it improve the performance of finding nearby
zipcodes? It sounds like KNN has great potential for performance
improvements!

   Mark

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] debugging handle exhaustion and 15 min/ 5mil row delete

2010-05-07 Thread Mark Stosberg


Hello,

We've been a satified user of PostgreSQL for several years, and use it
to power a national pet adoption website: http://www.adoptapet.com/

Recently we've had a regularly-timed middle-of-the-night problem where
database handles are exhausted for a very brief period.

In tracking it down, I have found that the event seems to correspond to
a time when a cron script is deleting from a large logging table, but
I'm not certain if this is the cause or a correlation.

We are deleting about 5 million rows from a time-based logging table
that is replicated by Slony. We are currently using a single delete
statement, which takes about 15 minutes to run. There is no RI on the
table, but the use of Slony means that a trigger call and action is made
for every row deleted, which causes a corresponding insertion in another
table so the deletion can be replicated to the slave.

My questions:

- Could this kind of activity lead to an upward spiral in database
  handle usage?

- Would it be advisable to use several small DELETE statements instead,
  to delete rows in batches of 1,000. We could use the recipe for this
  that was posted earlier to this list:

  delete from table where pk in
(select pk from table where delete_condition limit X);

Partitions seems attractive here, but aren't easy to use Slony. Perhaps
once we migrate to PostgreSQL 9.0 and the hot standby feature we can
consider that.

Thanks for your help!

Mark

 . . . . . . . . . . . . . . . . . . . . . . . . . . . 
   Mark StosbergPrincipal Developer  
   m...@summersault.com Summersault, LLC 
   765-939-9301 ext 202 database driven websites
 . . . . . http://www.summersault.com/ . . . . . . . .



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] debugging handle exhaustion and 15 min/ 5mil row delete

2010-05-07 Thread Mark Stosberg


 You can use TRUNCATE instead DELETE. TRUNCATE is more efficient and 
 faster that DELETE.

Thanks for the suggestion. However, TRUNCATE is not compatible with
Slony, and we also have some rows which remain in table. 

 Now, we need more information about your system to give you a certain 
 solution:
 Are you using a RAID controller for you data? 

Yes.

 Do you have separated the xlog directory from the data directory?

No.

 Which is your Operating System?

FreeBSD.

 Which is you architecture?

i386.

Thanks for the feedback. I'm going to try batching the deletes for now,
which is approach was worked well for some of our other long-running
deletes.

Mark

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Fw: Help me put 2 Gigs of RAM to use

2009-12-10 Thread Mark Stosberg



Hello,

PostgreSQL has served us very well powering a busy national pet
adoption website. Now I'd like to tune our setup further get more out
of hardware. 

What I'm noticing is that the while the FreeBSD server has 4 Gigs of
memory, there are rarely every more than 2 in use-- the memory use
graphs as being rather constant. My goal is to make good use of those 2
Gigs of memory to improve performance and reduce the CPU usage. 

The server has 4 2.33 Ghz processors in it, and RAIDed 15k RPM SCSI
disks.

Here are some current memory-related settings from our postgresql.conf
file. (We currently run 8.2, but are planning an upgrade to 8.4
soon). Do you see an obvious suggestions for improvement? 

I find the file a bit hard to read because of the lack of units in 
the examples, but perhaps that's already been addressed in future
versions.

 max_connections= 400 # Seems to be enough us
 shared_buffers = 8192
 effective_cache_size   = 1000
 work_mem   = 4096
 maintenance_work_mem   = 160MB  

Thanks for your suggestions!

   Mark

[I tried to post this yesterday but didn't see it come through. This
message is a second attempt.)

-- 
 . . . . . . . . . . . . . . . . . . . . . . . . . . . 
   Mark StosbergPrincipal Developer  
   m...@summersault.com Summersault, LLC 
   765-939-9301 ext 202 database driven websites
 . . . . . http://www.summersault.com/ . . . . . . . .



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Help me put 2 Gigs of RAM to use

2009-12-10 Thread Mark Stosberg


Thanks for the response, Matthew.

 On Thu, 10 Dec 2009, Mark Stosberg wrote:
  What I'm noticing is that the while the FreeBSD server has 4 Gigs of
  memory, there are rarely every more than 2 in use-- the memory use
  graphs as being rather constant. My goal is to make good use of those 2
  Gigs of memory to improve performance and reduce the CPU usage.
 
 I think you'll find that the RAM is already being used quite effectively 
 as disc cache by the OS. It sounds like the server is actually set up 
 pretty well. You may get slightly better performance by tweaking a thing 
 here or there, but the server needs some OS disc cache to perform well.

As part of reviewing this status, I it appears that the OS is only
addresses 3 of the 4 Gigs of memory. We'll work on our FreeBSD setup to
cure that.

Here's how top reports the memory breakdown:

Mem: 513M Active, 2246M Inact, 249M Wired, 163M Cache, 112M Buf, 7176K
Free Swap: 9216M Total, 1052K Used, 9215M Free

So perhaps the OS disc cache is represented in the Inactive memory
statistic? I suppose once we have the 4th Gig of memory actually
available, that would all be doing to the disk cache. 

  (We currently run 8.2, but are planning an upgrade to 8.4 soon).
 
 Highly recommended.

For performance improvements in particular?

Mark

-- 
 . . . . . . . . . . . . . . . . . . . . . . . . . . . 
   Mark StosbergPrincipal Developer  
   m...@summersault.com Summersault, LLC 
   765-939-9301 ext 202 database driven websites
 . . . . . http://www.summersault.com/ . . . . . . . .



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] recommendations for web/db connection pooling or DBD::Gofer reviews

2008-04-10 Thread Mark Stosberg



Under heavy load, Apache has the usual failure mode of spawning so 
many threads/processes and database connections that it just exhausts 
all the memory on the webserver and also kills the database.
As usual, I would use lighttpd as a frontend (also serving static 
files) to handle the large number of concurrent connections to clients, 
and then have it funnel this to a reasonable number of perl backends, 
something like 10-30. I don't know if fastcgi works with perl, but with 
PHP it certainly works very well. If you can't use fastcgi, use lighttpd 
as a HTTP proxy and apache with mod_perl behind.
Recipe for good handling of heavy load is using an asynchronous 
server (which by design can handle any number of concurrent connections 
up to the OS' limit) in front of a small number of dynamic webpage 
generating threads/processes.


Thanks for the response.

To be clear, it sounds like you are advocating solving the problem with 
scaling the number of connections with a different approach, by limiting 
the number of web server processes.


So, the front-end proxy would have a number of max connections, say 200, 
 and it would connect to another httpd/mod_perl server behind with a 
lower number of connections, say 20. If the backend httpd server was 
busy, the proxy connection to it would just wait in a queue until it was 
available.


Is that the kind of design you had in mind?

That seems like a reasonable option as well. We already have some 
lightweight Apache servers in use on the project which currently just 
serve static content.


   Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Re: what worked: performance improvements for geo-spatial searching on FreeBSD

2008-04-08 Thread Mark Stosberg


Oleg Bartunov wrote:

Mark,

do you know about our sky segmentation code Q3C,
see details http://www.sai.msu.su/~megera/wiki/SkyPixelization
We use it for billions objects in database and quite happy.


Oleg,

Thanks for the response. That sounds interesting, but it's not clear to 
me how I would put together a geo-spatial search calculating distances 
around the curvature of the earth using this technique. Is there is a 
SQL sample for this that you could point to?


Also, I didn't recognize the names of the techniques you were 
benchmarking against RADEC and Rtree, are either of these related to 
the earthdistance or cube() based searches I would have used already?


  Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] bulk insert performance problem

2008-04-08 Thread Mark Stosberg


Christian Bourque wrote:


Any idea? Is there any other improvements I could do?


Are you using the COPY syntax in the import script or individual 
insert statements? Using COPY will always be *much* faster.


I believe COPY always appends to tables rather than replacing the 
contents, you can combine this technique with the possibility of 
splitting up the task into multiple copy statements, but that has never 
been necessary in my case, switching from INSERTS to a COPY statement 
always provided the huge performance improvement I needed.


It's easy to confuse pg_dump -d with psql -d ...it's too bad they 
mean very different things.


For pg_dump, -d causes INSERT statements to be generated instead of a 
COPY statement, and is has been a mistake I made in the past, because I 
expected to work like psql -d, where -d means database name.


I suppose the safe thing to do is to avoid using -d altogether!


Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] what worked: performance improvements for geo-spatial searching on FreeBSD

2008-04-07 Thread Mark Stosberg


The http://www.1-800-save-a-pet.com site is hosted with FreeBSD and
PostgreSQL and as a geo-spatial search as a central feature.

One thing that made a substantial performance improvement was switching
from the geo_distance() search in the earthdistance contrib, to use
the cube based geo-spatial calculations, also in available in contrib/
In our case, the slight loss of precision between the two methods didn't
matter.

The other things that made a noticeable performance improvement was
upgrading our servers from FreeBSD 4.x or 5.x (old, I know!) to FreeBSD
6.2 or later. We also upgrade these systems from PostgreSQL 8.1 to 8.2
at the same time.  I assume the upgrade to 8.2 must be responsible at
least in part for the performance gains.

The result of these two rounds of updates is that our overall CPU
capacity in the cluster seems to be double or triple what it was before.

As the site grows we continue to be very happy with the performance,
features and stability of PostgreSQL.

   Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] recommendations for web/db connection pooling or DBD::Gofer reviews

2008-04-07 Thread Mark Stosberg



When traffic to our PostgreSQL-backed website spikes, the first resource
we see being exhausted is the DB slots on the master server (currently
set to about 400).

I expect that as new Apache/mod_perl children are being put to us, they
are creating new database connections.

I'm interested in recommendations to funnel more of that traffic through
 fewer DB slots, if that's possible. (We could also consider increasing
the handles available, since the DB server has some CPU and memory to
spare).

I'm particularly interested in review of DBD::Gofer, which seems like it
would help with this in our Perl application:
http://search.cpan.org/dist/DBI/lib/DBD/Gofer.pm

I realize it has limitations, like no transactions, but I think we
would still able to use it selectively in our application.

   Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] per-review of PgBouncer / Slony design

2008-04-07 Thread Mark Stosberg




I would stick to proven postgresql technologies such as pgbouncer.


Thanks for the fast recommendation, Joshua.  I'll consider it.

Our application is Slony-replicated web/db project with two slaves.

Does this design seem sensible?

- Run one pgbouncer server on the master, with settings to
  service the master and both slaves.

- We already handle balancing traffic between the slaves separately, so 
that can remain unchanged.


- Use Session Pooling both both the masters and the slaves. In theory, 
the slaves should just be doing transaction-less SELECT statements, so a 
more aggressive setting might be possible, but I believe there might be 
a leak in the logic where we create a temporary table on the slave in 
one case.


- Redirect all application connections through pgbouncer

###

From graphs we keep, we can see that the slaves currently use a max of 
about 64 connections...they are far from maxing out what's possible. So 
I was trying to think through if made sense to bother using the 
pgBouncer layer with them. I through of two potential reasons to still 
use it:
 - In the event of a major traffic spike on the web servers, pgbouncer 
would keep the number of db slots under control.
 - Potentially there's a performance gain in having PgBouncer hold the 
connections open.


Does that analysis seem correct?

For the master's pool size, I thought I would just choose a number 
that's a little larger that the daily max number of DB slots in use.


   Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] low memory usage reported by 'top' indicates poor tuning?

2007-02-26 Thread Mark Stosberg

Hello,

I'm trying to make sense of the memory usage reported by 'top', compared
to what pg_database_size shows.   Here's one result:

select pg_size_pretty(pg_database_size('production'));
 pg_size_pretty

 6573 MB

Now, looking at memory use with top, there is a lot memory that isn't
being used on the system:

 Mem: 470M Active, 2064M Inact

( 3 Gigs RAM, total ).

Overall performance is decent, so maybe there's no
problem. However, I wonder if we've under-allocated memory to
PostgreSQL. (This is a dedicated FreeBSD DB server).

Some memory settings include:

shared_buffers = 8192 (we have 450 connections)
max_fsm_pages = 125 (we kept getting HINTs to bump it, so we did)

Maybe we should be bumping up the sort_mem and vacuum_mem as well?

I do sometimes see sorting and vacuuming as showing up as things I'd
like to run faster.

This list has been a great resource for performance tuning help, and I
continue to appreciate your help. We've used PostgreSQL on every project
we've had a choice on for the last 10 years. (Has it been that long?!)
We've never regretted it once.

   Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Writting a search engine for a pgsql DB

2007-02-26 Thread Mark Stosberg

Joshua D. Drake wrote:
 Madison Kelly wrote:
 Hi all,

   I am asking in this list because, at the end of the day, this is a
 performance question.

   I am looking at writing a search engine of sorts for my database. I
 have only ever written very simple search engines before which amounted
 to not much more that the query string being used with ILIKE on a pile
 of columns. This was pretty rudimentary and didn't offer anything like
 relevance sorting and such (I'd sort by result name, age or whatnot).

   So I am hoping some of you guys and gals might be able to point me
 towards some resources or offer some tips or gotcha's before I get
 started on this. I'd really like to come up with a more intelligent
 search engine that doesn't take two minutes to return results. :) I
 know, in the end good indexes and underlying hardware will be important,
 but a sane as possible query structure helps to start with.
 
 See search.postgresql.org, you can download all source from
 gborg.postgresql.org.

Joshua,

What's the name of the project referred to? There's nothing named
search hosted on Gborg according to this project list:

http://gborg.postgresql.org/project/projdisplaylist.php

Madison,

For small data sets and simpler searches, the approach you have been
using can be appropriate. You may just want to use a small tool in a
regular programming language to help build the query. I wrote such a
tool for Perl:

http://search.cpan.org/~markstos/SQL-KeywordSearch-1.11/lib/SQL/KeywordSearch.pm

For large or complex searches, a more specialized search system may be
appropriate. I suspect that's kind of tool that Joshua is referencing.

  Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Writting a search engine for a pgsql DB

2007-02-26 Thread Mark Stosberg

Madison Kelly wrote:

   I think the more direct question I was trying to get at is How do you
 build a 'relavence' search engine? One where results are returned/sorted
 by relevance of some sort?. At this point, the best I can think of,
 would be to perform multiple queries; first matching the whole search
 term, then the search term starting a row, then ending a row, then
 anywhere in a row and scoring the results based on which query they
 came out on. This seems terribly cumbersome (and probably slow, indexes
 be damned) though. I'm hoping there is a better way! :)

Madison,

I think your basic thinking is correct. However, the first select can
done offline -- sometime beforehand.

For example, you might create a table called keywords that includes
the list of words mined in the other tables, along with references to
where the words are found, and how many times they are mentioned.

Then, when someone actually searches, the search is primarily on the
keywords table, which is now way to sort by rank, since the table
contains how many times each keyword matches. The final result can be
constructed by using the details in the keywords table to pull up the
actual records needed.

My expectation however is that there are enough details in the system,
that I would first look at trying a package like tsearch2 to help solve
the problem, before trying to write another system like this from scratch.

  Mark


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [PERFORM] Writting a search engine for a pgsql DB

2007-02-26 Thread Mark Stosberg


 Now see, this is exactly the kind of sagely advice I was hoping for! :)
 
 I'll look into tsearch2, and failing that for some reason, I love the
 keyword table idea.

For example keyword search code, you can try this package:

http://downloads.sourceforge.net/cascade/cascade-devel-pieces-1.1.tgz?modtime=999556617big_mirror=0

There is a keywords subdirectory with the Perl and SQL. I'm sure this
code is not ideal in a number of ways:

1. It's from 2001.
2. It doesn't actually function on it's own anymore. However, you can
read the code and get ideas.
3. I'm sure someone has a better looking/functioning example!

Anyway, it's there if you want to take a look.

  Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend

[PERFORM] Using the 8.2 autovacuum values with 8.1

2007-02-22 Thread Mark Stosberg


Thanks to everyone for the feedback about vacuuming. It's been very
useful. The pointers to the pgstattuple and Pgfouine tools were also
helpful.

I'm now considering the following plan for trying Autovacuuming again
with 8.1. I'd like any peer review you have to offer of the following:

1. First, I'll move the settings to match the defaults in 8.2. The ones
I noticed in particular were:

autovacuum_vacuum_threshold changes: 1000 - 500
autovacuum_anayze_threshold changes: 500 - 250
autovacuum_scale_factor  changes:  .4 - .2
autovacuum_analyze_scale_factor changes .2 - .1

2. Try the vacuum cost delay feature, starting with a 20ms value:

autovacuum_vacuum_cost_delay = 20

3. Immediately add a row to pg_autovacuum for a huge logging table that
would be too slow to vacuum usually. We'll still vacuum it once a week
for good measure by cron.

4. For good measure, I think I still keep the nightly cron entry that
does a complete vacuum analyze (except for that large table...).

Seem like a reasonable plan?

   Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] How to debug performance problems

2007-02-21 Thread Mark Stosberg

Ray Stell wrote:
 I'd like to have a toolbox prepared for when performance goes south.
 I'm clueless.  Would someone mind providing some detail about how to
 measure these four items Craig listed:
 
 1. The first thing is to find out which query is taking a lot of time.
 
 2. A long-running transaction keeps vacuum from working.
 
 3. A table grows just enough to pass a threshold in the
planner and a drastically different plan is generated.

I just ran into a variation of this:

3.5 A table grows so large so that VACUUMING it takes extremely long,
interfering with the general performance of the system.

In our case, we think the table had about 36 million rows when it hit
that threshold.

I'm now working on a better solution for that table.

  Mark

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

[PERFORM] How to avoid vacuuming a huge logging table

2007-02-21 Thread Mark Stosberg

Our application has a table that is only logged to, and infrequently
used for reporting. There generally no deletes and updates.

Recently, the shear size (an estimated 36 million rows) caused a serious
problem because it prevented a vacuum analyze on the whole database
from finishing in a timely manner.

As I understand, a table with this usage pattern wouldn't need to be
vacuumed anyway.

I'm looking for general advice from people who have faced the same
issue. I'm looking at a number of alternatives:

1. Once a month, we could delete and archive old rows, for possible
re-import later if we need to report on them. It would seem this would
need to be done as proper insert statements for re-importing. (Maybe
there is a solution for that with table partitioning? )

2. We could find a way to exclude the table for vacuuming, and let it
grow even larger. Putting the table in it's own database would
accomplish that, but it would nice to avoid the overhead of a second
database connection.

3. Take a really different approach. Log in CSV format to text files
instead, And only import the date ranges we need on demand if a report
is requested on the data.

Thanks for any tips.

Mark

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

[PERFORM] Auto-Vacuum in 8.1 was ineffective for me. 8.2 may work better?

2007-02-21 Thread Mark Stosberg


When I upgraded a busy database system to PostgreSQL 8.1, I was excited
about AutoVacuum, and promptly enabled it, and turned off the daily
vacuum process.

(
I set the following, as well as the option to enable auto vacuuming
stats_start_collector = true
stats_row_level = true
)

I could see in the logs that related activity was happening, but within
a few days, the performance became horrible, and enabling the regular
vacuum fixed it.

Eventually autovacuum was completely disabled.

What could have happened? Is 8.2 more likely to just work in the
regard? Is the the table-specific tuning that I would have needed to do?

I realize getting autovacuuming to work could be one way to exclude the
large table I wrote about in a recent post.

   Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] reindex vs 'analyze'

2007-02-16 Thread Mark Stosberg

Tom Lane wrote:
 Mark Stosberg [EMAIL PROTECTED] writes:
 Your suggestion about the pet_state index was right on. I tried 
 Analyze on it, but still got the same bad estimate. However, I then 
 used reindex on that index, and that fixed the estimate accuracy, 
 which made the query run faster!
 
 No, the estimate is about the same, and so is the plan.  The data seems
 to have changed though --- on Monday you had
 
 -  Bitmap Index Scan on pets_pet_state_idx  (cost=0.00..562.50 
 rows=39571 width=0) (actual time=213.620..213.620 rows=195599 loops=82)
Index Cond: ((pet_state)::text = 'available'::text)
  
 and now it's
 
  -  Bitmap Index Scan on pets_pet_state_idx  (cost=0.00..285.02 
 rows=41149 width=0) (actual time=22.043..22.043 rows=40397 loops=82)
Index Cond: ((pet_state)::text = 'available'::text)
 
 Don't tell me you got 155000 pets adopted out yesterday ... what
 happened here?
 
 [ thinks... ] One possibility is that those were dead but
 not-yet-vacuumed rows.  What's your vacuuming policy on this table?
 (A bitmap-index-scan plan node will count dead rows as returned,
 unlike all other plan node types, since we haven't actually visited
 the heap yet...)

Today I noticed a combination of related mistakes here.

1. The Vacuum commands were being logged to a file that didn't exist.
I'm mot sure if this prevented them being run. I had copied the cron
entry for another machine, but neglected to create /var/log/pgsql:

vacuumdb -z --table pets -d saveapet   /var/log/pgsql/vacuum.log 21

###

However, I again noticed that the row counts were horribly off on the
'pet_state' index, and again used REINDEX to fix it. (Examples below).
However, if the VACUUM ANALYZE wasn't actually run, that does seem
like it could have been related.

I'll have to see how things are tomorrow after a full round of database
vacuuming.

   Mark


-  Bitmap Index Scan on pets_pet_state_idx  (cost=0.00..337.29
rows=39226 width=0) (actual time=77.158.  .77.158 rows=144956
loops=81)
   Index Cond: ((pet_state)::text =
'available'::text)
 Total runtime: 8327.261 ms


-  Bitmap Index Scan on pets_pet_state_idx  (cost=0.00..271.71
rows=39347 width=0) (actual time=15.466..15.466 rows=40109 loops=81)
   Index Cond: ((pet_state)::text =
'available'::text)
 Total runtime: 1404.124 ms

---(end of broadcast)---
TIP 6: explain analyze is your friend

[PERFORM] reindex vs 'analyze' (was: Re: cube operations slower than geo_distance() on production server)

2007-02-14 Thread Mark Stosberg


Merlin Moncure wrote:

On 2/14/07, Tom Lane [EMAIL PROTECTED] wrote:

There are two things wrong here: first, that the estimated row
count is only 20% of actual; it should certainly not be that far
off for such a simple condition. I wonder if your vacuum/analyze
procedures are actually working. Second, you mentioned somewhere
along the line that 'available' pets are about 10% of all the
entries, which means that this indexscan is more than likely
entirely counterproductive: it would be cheaper to ignore this
index altogether.


Tom,

Thanks for the generosity of your time. We are using  8.1.3 currently. I 
have read there are some performance improvements in 8.2, but we have 
not started evaluating that yet.


Your suggestion about the pet_state index was right on. I tried 
Analyze on it, but still got the same bad estimate. However, I then 
used reindex on that index, and that fixed the estimate accuracy, 
which made the query run faster! The cube search now benchmarks faster 
than the old search in production, taking about 2/3s of the time of the 
old one.


Any ideas why the manual REINDEX did something that analyze didn't? It 
makes me wonder if there is other tuning like this to do.


Attached is the EA output from the most recent run, after the re-index.


I think switching the index on pet_state to a composite on (pet_state,
species_id) might help too.

or even better:

create function is_pet_available(text) returns bool as
$$
 select $1='available';
$$ language sql immutable;

create index pets_available_species_idx on
pets(is_pet_available(pet_state), species_id);


Merlin,

Thanks for this suggestion. It is not an approach I had used before, and 
I was interested to try it. However, the new index didn't get chosen. 
(Perhaps I would need to drop the old one?) However, Tom's suggestions 
did help. I'll follow up on that in just a moment.




refactor your query something similar to:

SELECT * FROM
(
SELECT
earth_coords(q.earth_coords, s.earth_coords)/1609.344 as radius
FROM pets
JOIN shelters_active as shelters USING (shelter_id)
JOIN zipcodes s ON shelters.postal_code_for_joining = zipcodes.zipcode
JOIN zipcodes q ON q.zipcode = '90210'
WHERE
  is_pet_available(pet_state)
  AND species_id = 1
  AND earth_box(q.earth_coords, 10*1609.344) @ s.earth_coords
) p order by radius

merlin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Sort  (cost=5276.93..5277.00 rows=28 width=64) (actual time=1981.830..1984.415 
rows=1344 loops=1)
   Sort Key: (cube_distance(public.zipcodes.earth_coords, 
public.zipcodes.earth_coords) / 1609.344::double precision)
   -  Nested Loop  (cost=291.32..5276.26 rows=28 width=64) (actual 
time=24.080..1976.479 rows=1344 loops=1)
 -  Nested Loop  (cost=2.15..575.79 rows=11 width=68) (actual 
time=2.637..34.067 rows=131 loops=1)
   -  Nested Loop  (cost=2.15..153.48 rows=42 width=73) (actual 
time=1.939..3.972 rows=240 loops=1)
 -  Index Scan using zipcodes_pkey on zipcodes  
(cost=0.00..3.01 rows=1 width=32) (actual time=0.283..0.287 rows=1 loops=1)
   Index Cond: ((zipcode)::text = '90210'::text)
 -  Bitmap Heap Scan on zipcodes  (cost=2.15..149.84 
rows=42 width=41) (actual time=1.403..2.323 rows=240 loops=1)
   Recheck Cond: 
(cube_enlarge((outer.earth_coords)::cube, 16093.4357308298::double precision, 
3) @ zipcodes.earth_coords)
   -  Bitmap Index Scan on zip_earth_coords_idx  
(cost=0.00..2.15 rows=42 width=0) (actual time=1.377..1.377 rows=240 loops=1)
 Index Cond: 
(cube_enlarge((outer.earth_coords)::cube, 16093.4357308298::double precision, 
3) @ zipcodes.earth_coords)
   -  Index Scan using shelters_postal_code_for_joining_idx on 
shelters  (cost=0.00..10.03 rows=2 width=12) (actual time=0.064..0.118 rows=1 
loops=240)
 Index Cond: ((shelters.postal_code_for_joining)::text = 
(outer.zipcode)::text)
 Filter: ((shelter_state)::text = 'active'::text)
 -  Bitmap Heap Scan on pets  (cost=289.17..426.86 rows=35 width=4) 
(actual time=14.362..14.746 rows=10 loops=131)
   Recheck Cond: ((pets.shelter_id = outer.shelter_id) AND 
((pets.pet_state)::text = 'available'::text))
   Filter: (species_id = 1)
   -  BitmapAnd  (cost=289.17..289.17 rows=35 width=0) (actual 
time=14.219..14.219 rows=0 loops=131)
 -  Bitmap Index Scan on pets_shelter_id_idx  
(cost=0.00..3.89 rows=256 width=0) (actual time=0.188..0.188 rows=168 loops=131)
   Index Cond: (pets.shelter_id = outer.shelter_id)
 -  Bitmap Index Scan on pets_pet_state_idx  
(cost=0.00..285.02 rows=41149 width=0) (actual time=22.043..22.043 rows=40397 
loops=82)
   Index Cond: ((pet_state)::text = 'available'::text)

Re: [PERFORM] reindex vs 'analyze'

2007-02-14 Thread Mark Stosberg

On Wed, Feb 14, 2007 at 01:07:23PM -0500, Tom Lane wrote:
 Mark Stosberg [EMAIL PROTECTED] writes:
  Your suggestion about the pet_state index was right on. I tried 
  Analyze on it, but still got the same bad estimate. However, I then 
  used reindex on that index, and that fixed the estimate accuracy, 
  which made the query run faster!
 
 No, the estimate is about the same, and so is the plan.  The data seems
 to have changed though --- on Monday you had
 
 -  Bitmap Index Scan on pets_pet_state_idx  (cost=0.00..562.50 
 rows=39571 width=0) (actual time=213.620..213.620 rows=195599 loops=82)
Index Cond: ((pet_state)::text = 'available'::text)
  
 and now it's
 
  -  Bitmap Index Scan on pets_pet_state_idx  (cost=0.00..285.02 
 rows=41149 width=0) (actual time=22.043..22.043 rows=40397 loops=82)
Index Cond: ((pet_state)::text = 'available'::text)
 
 Don't tell me you got 155000 pets adopted out yesterday ... what
 happened here?

That seemed be the difference that the reindex made. The number of
rows in the table and the number marked available is roughly
unchanged.

select count(*) from pets;

304951
  (1 row)

select count(*) from pets where pet_state = 'available';
---
39857

It appears just about 400 were marked as adopted yesterday. 

 [ thinks... ] One possibility is that those were dead but
 not-yet-vacuumed rows.  What's your vacuuming policy on this table?

It gets vacuum analyzed ery two hours throughout most of the day. Once
Nightly we vacuum analyze everything, but most of the time we just do
this table. 

 (A bitmap-index-scan plan node will count dead rows as returned,
 unlike all other plan node types, since we haven't actually visited
 the heap yet...)

Thanks again for your help, Tom.

Mark

--
 . . . . . . . . . . . . . . . . . . . . . . . . . . . 
   Mark StosbergPrincipal Developer  
   [EMAIL PROTECTED] Summersault, LLC 
   765-939-9301 ext 202 database driven websites
 . . . . . http://www.summersault.com/ . . . . . . . .

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] cube operations slower than geo_distance() on production server

2007-02-12 Thread Mark Stosberg

Merlin Moncure wrote:
 On 2/10/07, Mark Stosberg [EMAIL PROTECTED] wrote:

 With the help of some of this list, I was able to successfully  set up
 and benchmark a cube-based replacement for geo_distance() calculations.

 On a development box, the cube-based variations benchmarked consistently
 running in about 1/3 of the time of the gel_distance() equivalents.

 After setting up the same columns and indexes on a production
 database, it's a different story. All the cube operations show
 themselves to be about the same as, or noticeably slower than, the same
 operations done with geo_distance().

 I've stared at the EXPLAIN ANALYZE output as much I can to figure what's
 gone. Could you help?

 Here's the plan on the production server, which seems too slow. Below
 is the plan I get in
 on the development server, which is much faster.

 I tried set enable_nestloop = off, which did change the plan, but
 the performance.

 The production DB has much more data in it, but I still expected
 comparable results relative
 to using geo_distance() calculations.

 any objection to posting the query (any maybe tables, keys, indexes, etc)?

Here the basic query I'm using:
SELECT
 -- 1609.344 is a constant for meters per mile
 cube_distance( (SELECT earth_coords from zipcodes WHERE zipcode =
'90210') , earth_coords)/1609.344
   AS RADIUS
   FROM pets
   -- shelters_active is a view where shelter_state = 'active'
   JOIN shelters_active as shelters USING (shelter_id)
   -- The zipcode fields here are varchars
   JOIN zipcodes ON (
shelters.postal_code_for_joining = zipcodes.zipcode )
   -- search for just 'dogs'
   WHERE species_id = 1
   AND pet_state='available'
  AND earth_box(
(SELECT earth_coords from zipcodes WHERE zipcode = '90210') ,
10*1609.344
  ) @ earth_coords
   ORDER BY RADIUS;

All the related columns are indexed:
   pets.species_id
   pets.shelter_id
   pets.pet_state

   shelters.shelter_id (pk)
   shelters.postal_code_for_joining
   shelters.active

   zipcodes.zipcode (pk)
   zipcodes.earth_coords

The pets table has about 300,000 rows, but only about 10% are
available. It sees regular updates and is vacuum analyzed every
couple of hours now. the rest of the tables get vacuum analyzed
nightly.  The shelters table is about 99% shelter_state = active.
It's updated infrequently.

The zipcodes table has about 40,000 rows in it and doesn't change.

I tried a partial index on the pets table WHERE pet_state =
'available'. I could see the index was used, but the performance was
unaffected.

The EXPLAIN ANALYZE output is attached, to try to avoid mail-client
wrapping. The query is running 10 times slower today than on Friday,
perhaps because of server load, or because we are at the end of a VACUUM
cycle.

Thanks for any help!

Mark
Sort  (cost=6887.03..6887.10 rows=27 width=32) (actual 
time=17925.098..17927.979 rows=1324 loops=1)
   Sort Key: (cube_distance($0, zipcodes.earth_coords) / 1609.344::double 
precision)
   InitPlan
 -  Index Scan using zipcodes_pkey on zipcodes  (cost=0.00..3.01 rows=1 
width=32) (actual time=0.028..0.031 rows=1 loops=1) Index Cond: 
((zipcode)::text = '90210'::text)
 -  Index Scan using zipcodes_pkey on zipcodes  (cost=0.00..3.01 rows=1 
width=32) (actual time=0.687..0.692 rows=1 loops=1) Index Cond: 
((zipcode)::text = '90210'::text)
   -  Nested Loop  (cost=568.82..6880.36 rows=27 width=32) (actual 
time=346.932..17919.697 rows=1324 loops=1)
 -  Nested Loop  (cost=2.15..572.14 rows=9 width=36) (actual 
time=8.321..43.378 rows=136 loops=1)
   -  Bitmap Heap Scan on zipcodes  (cost=2.15..150.05 rows=42 
width=41) (actual time=3.442..4.402 rows=240 loops=1)
 Recheck Cond: (cube_enlarge(($1)::cube, 
16093.4357308298::double precision, 3) @ earth_coords)
 -  Bitmap Index Scan on zip_earth_coords_idx  
(cost=0.00..2.15 rows=42 width=0) (actual time=3.426..3.426 rows=240 loops=1)
 Index Cond: (cube_enlarge(($1)::cube, 
16093.4357308298::double precision, 3) @ earth_coord)
   -  Index Scan using shelters_postal_code_for_joining_idx on 
shelters  (cost=0.00..10.02 rows=2 width=12) (actual time=0.093..0.155 rows=1 
loops=240) Index Cond: ((shelters.postal_code_for_joining)::text = 
(outer.zipcode)::text)
 -  Bitmap Heap Scan on pets  (cost=566.67..700.47 rows=34 width=4) 
(actual time=130.363..131.367 rows=10 loops=136) Recheck Cond: 
((pets.shelter_id = outer.shelter_id) AND ((pets.pet_state)::text = 
'available'::text))
   Filter: (species_id = 1)
   -  BitmapAnd  (cost=566.67..566.67 rows=34 width=0) (actual 
time=129.333..129.333 rows=0 loops=136)
 -  Bitmap Index Scan on pets_shelter_id_idx  
(cost=0.00..3.92 rows=263 width=0) (actual time=0.164..0.164 rows=178 
loops=136) Index Cond: (pets.shelter_id = outer.shelter_id)
 -  Bitmap Index Scan on pets_pet_state_idx  
(cost=0.00..562.50

Re: [PERFORM] cube operations slower than geo_distance() on production server

2007-02-12 Thread Mark Stosberg


Merlin--

Thanks so much for your help. Some follow-ups are below.

Merlin Moncure wrote:

 Here the basic query I'm using:
 SELECT
  -- 1609.344 is a constant for meters per mile
  cube_distance( (SELECT earth_coords from zipcodes WHERE zipcode =
 '90210') , earth_coords)/1609.344
AS RADIUS
FROM pets
-- shelters_active is a view where shelter_state = 'active'
JOIN shelters_active as shelters USING (shelter_id)
-- The zipcode fields here are varchars
JOIN zipcodes ON (
 shelters.postal_code_for_joining = zipcodes.zipcode )
-- search for just 'dogs'
WHERE species_id = 1
AND pet_state='available'
   AND earth_box(
 (SELECT earth_coords from zipcodes WHERE zipcode = '90210') ,
 10*1609.344
   ) @ earth_coords
ORDER BY RADIUS;

 your query looks a bit funky. here are the problems I see.

 * in your field list, you don't need to re-query the zipcode table.
  cube_distance( (SELECT earth_coords from zipcodes WHERE zipcode =
 '90210') , earth_coords)/1609.344 AS RADIUS

 becomes

 cube_distance(pets.earth_coords, earth_coords ) / 1609.344 AS RADIUS

It may not have been clear from the query, but only the 'zipcodes' table
has an 'earth_coords' column. Also, I think your refactoring means
something different. My query expresses number of miles this pet is
from 90210, while I think the refactor expresses a distance between a
pet and another calculated value.

 also, dont. re-refer to the zipcodes table in the join clause. you are
 already joining to it:
   AND earth_box(
 (SELECT earth_coords from zipcodes WHERE zipcode = '90210') ,
 10*1609.344) @ earth_coords

 becomes

  AND earth_box(zipcodes.earth_coords, 10*1609.344) ) @ pets.earth_coords

I have the same question here as above-- I don't see how the new syntax
includes the logic of distance from the 90210 zipcode.

 * also, does pet_state have any other states than 'available' and  '
 not available'? if not, you should be using a boolean. if so, you can
 consider a functional index to convert it to a booelan.

Yes, it has three states.

 * if you always look up pets by species, we can explore composite
 index columns on species, available (especially using the above
 functional suggestion), etc.  composite  partial (imo)

We nearly always search by species. Right now it's mostly dogs and some
cats. I searched for references to composite index columns, and didn't
find much. Could you provide a direct reference to what you have in
mind?

Any other ideas appreciated!

Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend

[PERFORM] cube operations slower than geo_distance() on production server

2007-02-09 Thread Mark Stosberg


With the help of some of this list, I was able to successfully  set up
and benchmark a cube-based replacement for geo_distance() calculations.

On a development box, the cube-based variations benchmarked consistently
running in about 1/3 of the time of the gel_distance() equivalents.

After setting up the same columns and indexes on a production
database, it's a different story. All the cube operations show
themselves to be about the same as, or noticeably slower than, the same
operations done with geo_distance().

I've stared at the EXPLAIN ANALYZE output as much I can to figure what's
gone. Could you help?

Here's the plan on the production server, which seems too slow. Below is the 
plan I get in
on the development server, which is much faster.

I tried set enable_nestloop = off, which did change the plan, but the 
performance.

The production DB has much more data in it, but I still expected comparable 
results relative
to using geo_distance() calculations.

The production db gets a VACUUM ANALYZE  every couple of hours now.

Thanks!

  Mark



 Sort  (cost=6617.03..6617.10 rows=27 width=32) (actual time=2482.915..2487.008 
rows=1375 loops=1)
   Sort Key: (cube_distance($0, zipcodes.earth_coords) / 1609.344::double 
precision)
   InitPlan
 -  Index Scan using zipcodes_pkey on zipcodes  (cost=0.00..3.01 rows=1 
width=32) (actual time=0.034..0.038 rows=1 loops=1)
   Index Cond: ((zipcode)::text = '90210'::text)
 -  Index Scan using zipcodes_pkey on zipcodes  (cost=0.00..3.01 rows=1 
width=32) (actual time=0.435..0.438 rows=1 loops=1)
   Index Cond: ((zipcode)::text = '90210'::text)
   -  Nested Loop  (cost=538.82..6610.36 rows=27 width=32) (actual 
time=44.660..2476.919 rows=1375 loops=1)
 -  Nested Loop  (cost=2.15..572.14 rows=9 width=36) (actual 
time=4.877..39.037 rows=136 loops=1)
   -  Bitmap Heap Scan on zipcodes  (cost=2.15..150.05 rows=42 
width=41) (actual time=3.749..4.951 rows=240 loops=1)
 Recheck Cond: (cube_enlarge(($1)::cube, 
16093.4357308298::double precision, 3) @ earth_coords)
 -  Bitmap Index Scan on zip_earth_coords_idx  
(cost=0.00..2.15 rows=42 width=0) (actual time=3.658..3.658 rows=240 loops=1)
   Index Cond: (cube_enlarge(($1)::cube, 
16093.4357308298::double precision, 3) @ earth_coords)
   -  Index Scan using shelters_postal_code_for_joining_idx on 
shelters  (cost=0.00..10.02 rows=2 width=12) (actual time=0.079..0.133 rows=1 
loops=240)
 Index Cond: ((shelters.postal_code_for_joining)::text = 
(outer.zipcode)::text)
 -  Bitmap Heap Scan on pets  (cost=536.67..670.47 rows=34 width=4) 
(actual time=16.844..17.830 rows=10 loops=136)
   Recheck Cond: ((pets.shelter_id = outer.shelter_id) AND 
((pets.pet_state)::text = 'available'::text))
   Filter: (species_id = 1) Sort  (cost=7004.53..7004.62 rows=39 
width=32) (actual time=54.635..55.450 rows=475 loops=1)
   -  BitmapAnd  (cost=536.67..536.67 rows=34 width=0) (actual 
time=16.621..16.621 rows=0 loops=136)
 -  Bitmap Index Scan on pets_shelter_id_idx  
(cost=0.00..3.92 rows=263 width=0) (actual time=0.184..0.184 rows=132 loops=136)
   Index Cond: (pets.shelter_id = outer.shelter_id)
 -  Bitmap Index Scan on pets_pet_state_idx  
(cost=0.00..532.50 rows=39571 width=0) (actual time=26.922..26.922 rows=40390 
loops=82)
   Index Cond: ((pet_state)::text = 'available'::text)
 Total runtime: 2492.852 ms


### Faster plan in development:

 Sort  (cost=7004.53..7004.62 rows=39 width=32) (actual time=54.635..55.450 
rows=475 loops=1)
   Sort Key: (cube_distance($0, earth_distance.earth_coords) / 1609.344::double 
precision)
   InitPlan
 -  Bitmap Heap Scan on earth_distance  (cost=4.74..624.60 rows=212 
width=32) (actual time=0.113..0.115 rows=1 loops=1)
   Recheck Cond: ((zipcode)::text = '90210'::text)
   -  Bitmap Index Scan on earth_distance_zipcode_idx  
(cost=0.00..4.74 rows=212 width=0) (actual time=0.101..0.101 rows=2 loops=1)
 Index Cond: ((zipcode)::text = '90210'::text)
 -  Bitmap Heap Scan on earth_distance  (cost=4.74..624.60 rows=212 
width=32) (actual time=0.205..0.208 rows=1 loops=1)
   Recheck Cond: ((zipcode)::text = '90210'::text)
   -  Bitmap Index Scan on earth_distance_zipcode_idx  
(cost=0.00..4.74 rows=212 width=0) (actual time=0.160..0.160 rows=2 loops=1)
 Index Cond: ((zipcode)::text = '90210'::text)
   -  Hash Join  (cost=618.67..5754.30 rows=39 width=32) (actual 
time=13.499..52.924 rows=475 loops=1)
 Hash Cond: (outer.shelter_id = inner.shelter_id)
 -  Bitmap Heap Scan on pets  (cost=44.85..5158.42 rows=4298 width=4) 
(actual time=4.278..34.192 rows=3843 loops=1)
   Recheck Cond: ((pet_state)::text = 'available'::text)

Re: [PERFORM] Can anyone make this code tighter? Too slow, Please help!

2007-02-09 Thread Mark Stosberg

[EMAIL PROTECTED] wrote:
 I have this function in my C#.NET app that goes out to find the
 business units for each event and returns a string (for my report).
 I'm finding that for larger reports it takes too long and times out.
 
 Does anyone know how I can speed this process up?  Is this code very
 tight or could it be cleaner?  thanks in advance for your help, this
 is a big issue used on several reports.

Perhaps try EXPLAIN ANALYZE on this query, given a valid event ID:

SELECT Distinct Companies.Name
FROM Companies INNER JOIN
  ExpenseAllocations ON Companies.ID =
ExpenseAllocations.BusinessUnitID
WHERE (ExpenseAllocations.EventID =
@EventID)
ORDER BY Companies.Name DESC


###
Do the columns used in the join and WHERE clause have indexes?

It's also possible the optimization needs to happen at a different level. 
Perhaps you are frequently
looking up the same results in a large report, or throughout the day.

If this part doesn't need to be up-to-second fresh, perhaps your application 
could
cache some of the results of this function, instead of repeatedly asking the 
database
to recompute it.

   Mark

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

[PERFORM] tip: faster sorting for proximity queries by using cube_distance()

2007-02-07 Thread Mark Stosberg

Hello,

I wanted to share something else I learned in my proximity search work.
 One my requirements is to order by the distance that matches are found
from the center point.

When did this using earth_distance(), the benefit of the earth_box()
technique over the old geo_distance became minimal as I approached a
250mi radius.

Switching to sorting by cube_distance() offered a huge benefit, allowing
the earth_distance() query to run in about 100ms vs 300ms for the
geo_distance() equivalent.

I checked the results that cube_distance() produced versus
earth_distance(). cube_distance() is always (not surprisingly) a little
smaller, but the difference seems only grows to about a mile for a 250
mile radius. That's an acceptable margin of error for this application,
and may be for others as well.

   Mark



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [PERFORM] explain analyze output for review (was: optimizing a geo_distance()...)

2007-02-06 Thread Mark Stosberg

Bruno Wolff III wrote:

 Some people here may be able to tell you more if you show us explain
 analyze output.

Here is my explain analyze output. Some brief context of what's going
on. The goal is to find Pets Near You.

We join the pets table on the shelters table to get a zipcode, and then
join a shelters table with earth_distance to get the coordinates of
the zipcode.  (  Is there any significant penalty for using a varchar vs
an int for a joint? ).

I've been investigating partial indexes for the pets table. It has about
300,000 rows, but only about 10 are active, and those are the ones we
are care about. Queries are also frequently made on males vs females, dogs vs 
cats
or specific ages, and those specific cases seem like possible candidates for 
partial indexes
as well. I played with that approach some, but had trouble coming up with any 
thing that
benchmarked faster.

I'm reading the explain analyze output correctly myself, nearly all of
the time spent is related to the 'pets' table, but I can't see what to
about it.

Help appreciated!

  Mark

Nested Loop  (cost=11.82..29.90 rows=1 width=0) (actual time=37.601..1910.787 
rows=628 loops=1)
   -  Nested Loop  (cost=6.68..20.73 rows=1 width=24) (actual 
time=35.525..166.547 rows=1727 loops=1)
 -  Bitmap Heap Scan on pets  (cost=6.68..14.71 rows=1 width=4) 
(actual time=35.427..125.594 rows=1727 loops=1)
   Recheck Cond: (((sex)::text = 'f'::text) AND (species_id = 1))
   Filter: ((pet_state)::text = 'available'::text)
   -  BitmapAnd  (cost=6.68..6.68 rows=2 width=0) (actual 
time=33.398..33.398 rows=0 loops=1)
 -  Bitmap Index Scan on pets_sex_idx  (cost=0.00..3.21 
rows=347 width=0) (actual time=14.739..14.739 rows=35579 loops=1)
   Index Cond: ((sex)::text = 'f'::text)
 -  Bitmap Index Scan on pet_species_id_idx  
(cost=0.00..3.21 rows=347 width=0) (actual time=16.779..16.779 rows=48695 
loops=1)
   Index Cond: (species_id = 1)
 -  Index Scan using shelters_pkey on shelters  (cost=0.00..6.01 
rows=1 width=28) (actual time=0.012..0.014 rows=1 loops=1727)
   Index Cond: (outer.shelter_id = shelters.shelter_id)
   -  Bitmap Heap Scan on earth_distance  (cost=5.14..9.15 rows=1 width=9) 
(actual time=0.984..0.984 rows=0 loops=1727)
 Recheck Cond: ((cube_enlarge(('(-2512840.11676572, 4646218.19036629, 
3574817.21369166)'::cube)::cube, 160930.130863421::double precision, 3) @ 
earth_distance.earth_coords) AND
((outer.postal_code_for_joining)::text = (earth_distance.zipcode)::text))
 -  BitmapAnd  (cost=5.14..5.14 rows=1 width=0) (actual 
time=0.978..0.978 rows=0 loops=1727)
   -  Bitmap Index Scan on earth_coords_idx  (cost=0.00..2.15 
rows=42 width=0) (actual time=0.951..0.951 rows=1223 loops=1727)
 Index Cond: (cube_enlarge(('(-2512840.11676572, 
4646218.19036629, 3574817.21369166)'::cube)::cube, 160930.130863421::double 
precision, 3) @ earth_coords)
   -  Bitmap Index Scan on earth_distance_zipcode_idx  
(cost=0.00..2.74 rows=212 width=0) (actual time=0.015..0.015 rows=1 loops=1727)
 Index Cond: ((outer.postal_code_for_joining)::text = 
(earth_distance.zipcode)::text)
 Total runtime: 1913.099 ms



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] optimizing a geo_distance() proximity query

2007-02-05 Thread Mark Stosberg

Bruno Wolff III wrote:
 On Sat, Feb 03, 2007 at 14:00:26 -0500,
   Mark Stosberg [EMAIL PROTECTED] wrote:
 I'm using geo_distance() from contrib/earthdistance would like to find a
 way to spend up the geo distance calculation if possible. This is for a
 proximity search: Show me adoptable pets within 250 miles of this
 zipcode.
 
 If you are using the cube based part of the earth distance package,
 then you can use gist indexes to speed those searches up. 

Thanks for the tip. Any idea what kind of improvement I can expect to
see, compared to using geo_distance()?

 There are functions for creating boxes that include all of the points some 
 distance
 from a fixed point. This is lossy, so you need to recheck if you don't
 want some points a bit farther away returned. Also you would need to
 pick a point to be where the zip code is located, rather than using area
 based zip codes. 

This is also interesting. Is this approach practical if I want to index
what's near each of about 40,000 US zipcodes, or the approach mostly
useful if you there are just a small number of fixed points to address?

I'm going to start installing the cube() and earth_distance() functions
today and see where I can get with the approach.

  Mark

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] optimizing a geo_distance() proximity query (example and benchmark)

2007-02-05 Thread Mark Stosberg

Merlin Moncure wrote:
 On 2/5/07, Mark Stosberg [EMAIL PROTECTED] wrote:
 Bruno Wolff III wrote:
  On Sat, Feb 03, 2007 at 14:00:26 -0500,
Mark Stosberg [EMAIL PROTECTED] wrote:
  I'm using geo_distance() from contrib/earthdistance would like to
 find a
  way to spend up the geo distance calculation if possible. This is
 for a
  proximity search: Show me adoptable pets within 250 miles of this
  zipcode.
 
  If you are using the cube based part of the earth distance package,
  then you can use gist indexes to speed those searches up.

 Thanks for the tip. Any idea what kind of improvement I can expect to
 see, compared to using geo_distance()?
 
 a lot. be aware that gist takes longer to build than btree, but very
 fast to search.  Index search and filter to box is basically an index
 lookup (fast!). for mostly static datasets that involve a lot of
 searching, gist is ideal.

The documentation in contrib/ didn't provide examples of how to create
or the index or actually a the proximity search. Here's what I figured
out to do:

I added a new column as type 'cube':

 ALTER table zipcodes add column earth_coords cube;

Next I converted the old lat/lon data I had stored in a 'point'
type to the new format:

-- Make to get lat/lon in the right order for your data model!
 UPDATE zipcodes set earth_coords = ll_to_earth( lon_lat[1], lon_lat[0] );

Now I added a GIST index on the field:

 CREATE index earth_coords_idx on zipcodes using gist (earth_coords);

Finally, I was able to run a query, which I could see used the index (by
checking EXPLAIN ANALYZE ...

   select * from zipcodes where earth_box('(436198.322855334,
4878562.8732218, 4085386.43843934)'::cube,16093.44) @ earth_coords;

It's also notable that the units used are meters, not miles like
geo_distance(). That's what the magic number of 16093.44 is-- 10 miles
converted to meters.

When I benchmarked this query against the old geo_distance() variation,
it was about 200 times faster (~100ms vs .5ms).

However, my next step was to try a more real world query that involved
 a more complex where clause and a couple of table joins. So far, that
result is coming out /slower/ with the new approach, even though the
index is being used. I believe this may be cause of the additional
results found that are outside of the sphere, but inside the cube. This
causes additional rows that need processing in the joined tables.

Could someone post an example of how to further refine this so the
results more closely match what geo_distance returns() ?

Any other indexing or optimization tips would be appreciated.

   Mark

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

[PERFORM] Using statement_timeout as a performance tool?

2007-02-01 Thread Mark Stosberg


Hello,

I'm working on setting up replication with Slony, and will soon have a
slave that a lot of SELECT traffic will be sent to (over 500k/day).

The primary query we need to run is somewhat complex, but seems to
complete on average in well under a second.

However, every so often (less in 1 in 10,000 queries) we'll see the
query take 2 or 3 minutes.

It's not clear why this is happening-- perhaps there is something else
going on that is affecting this query.

I'm considering the use of  statement_timeout to limit the time of
this particular query, to suppress the rare run away, and avoid tying
up the processor for that additional time.

I think it may be better to give up, quit spending cycles on it right
then, and return an oops, try again in a few minutes message instead.
From the data we have, seems like it has a strong chance of working again.

Is anyone else using statement_timeout as part of an overall
performance plan?

Mark


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Most effective tuning choices for busy website?

2005-06-14 Thread Mark Stosberg

Neil Conway wrote:

 Mark Stosberg wrote:
 I've used PQA to analyze my queries and happy overall with how they are
 running. About 55% of the query time is going to variations of the pet
 searching query, which seems like where it should be going. The query is
 frequent and complex. It has already been combed over for appropriate
 indexing.
 
 It might be worth posting the EXPLAIN ANALYZE and relevant schema
 definitions for this query, in case there is additional room for
 optimization.
 
Our hardware: Dual 3 Ghz processors 3 GB RAM, running on FreeBSD.
 
 Disk?
 
 You are presumably using Xeon processors, right? If so, check the list
 archives for information on the infamous context switching storm that
 causes performance problems for some people using SMP Xeons.

I wanted to follow-up to report a positive outcome to tuning this Xeon
SMP machine on FreeBSD. We applied the following techniques, and saw the
average CPU usage drop by about 25%.

- in /etc/sysctl.conf, we set it to use raw RAM for shared memory:
kern.ipc.shm_use_phys=1

- We updated our kernel config and postmaster.conf to set
  shared_buffers to about 8000.

- We disabled hyperthreading in the BIOS, which had a label like
  Logical Processors?   : Disabled.

I recall there was tweak my co-worker made that's not on my list.

I realize it's not particularly scientific because we changed several things
at once...but at least it is working well enough for now. 

   Mark
 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

[PERFORM] Most effective tuning choices for busy website?

2005-06-05 Thread Mark Stosberg

Hello,

I'm the fellow who was interviewed in the fall about using PostgreSQL on
1-800-Save-A-Pet.com:
http://techdocs.postgresql.org/techdocs/interview-stosberg.php

The site traffic continues to grow, and we are now seeing parts of the
day where the CPU load (according to MRTG graphs) on the database server
is stuck at 100%. I would like to improve this, and I'm not sure where
to look first. The machine is a dedicated PostgreSQL server which two
web server boxes talk to. 

I've used PQA to analyze my queries and happy overall with how they are
running. About 55% of the query time is going to variations of the pet
searching query, which seems like where it should be going. The query is
frequent and complex. It has already been combed over for appropriate
indexing.

I'm more interested at this point in tuning the software and hardware
infrastructure, but would like to get a sense about which choices will
bring the greatest reward. 

Let me explain some avenues I'm considering. 

 - We are currently running 7.4. If I upgrade to 8.0 and DBD::Pg 1.42,
   then the server side prepare feature will be available for use. 
   We do run the same queries a number of times. 

 - PhpPgAds seems to sucking up about 7.5% of our query time and is
   unrelated to the core application. We could move this work to another
   machine. The queries it generates seem like they have some room to
   optimized, or simply don't need to be run in some cases. However, I
   would like to stay out of modifying third-party code and PHP if
   possible.

 - I saw the hardware tip to Separate the Transaction Log from the
   Database. We have about 60% SELECT statements and 14% UPDATE
   statements. Focusing more on SELECT performance seems more important
   for us.

 - We have tried to tune 'shared_buffers' some, but haven't seen a
   noticeable performance improvement. 

   Our hardware: Dual 3 Ghz processors 3 GB RAM, running on FreeBSD. 

   I'm not quite sure how to check our average connection usage, but
   maybe this is helpful: When I do:
 select count(*) from pg_stat_activity ;
   I get values around 170.

   We have these values:
 max_connections = 400
 shared_buffers = 4096

Most other values in postgresql.conf are still at the their defaults.

Any suggestions are which avenues might offer the most bang for the buck
are appreciated!

( I have already  found: http://www.powerpostgresql.com/PerfList/ and it has
been a very helpful source of suggestions. )

Mark



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

41 matches

Mail list logo