Re: [PERFORM] application of KNN code to US zipcode searches?
Hello, I want to report that I have now solved the challenges I ran into using KNN for US zipcode searching. I've found the new approach to not only be viable, but to benchmark about 3x faster for our own real-world application than the previous approach we used, involving cube_distance() and earth_box(). Here's some details about my research so far. To evaluate it, I installed PostgreSQL 9.1 and a current PostGIS 2.0 snapshot (not yet released as stable). A primary challenge I had to solve was that KNN is designed for a slightly different problem than what I needed to solve. I need to answer the question: What are all the objects that are in zipcodes with 50 miles of a given zipcode? However, KNN only directly provides a performance boost to this variation: What are the N nearest objects to this point? Just adding a WHERE clause to check the 50 mile rule would erase the benefits of KNN, which works through an ORDER BY clause. I solved my issue by using a WITH clause that creates a pseudo-table called nearby_zipcodes. In this example, I select all the zipcodes that are within 50 miles of the 47374 zipcode. The trick I've employed is that I've set the LIMIT value to 286-- exactly the number of zipcodes within 50 miles of 47374. My plan is to add another column to my zipcodes table for each of the small number distances I need to search. Then, when I load new zipcodes I can pre-compute how many zipcodes would be found at this distance. This have approach would not have worked without a WITH clause, or some equivalent, because the number of objects within the radius is not known, but the number of nearby zipcodes is fixed. This approach allows me to get the performance benefits of KNN, while also returning exactly those objects within 50 miles of my target zipcode, by JOINing on the nearby_zipcodes table: WITH nearby_zipcodes AS ( SELECT zipcode, st_distance_sphere(lonlat_point, (SELECT lonlat_point from zipcodes WHERE zipcode = '47374')) / 1609.344 as radius FROM zipcodes ORDER BY lonlat_point - (SELECT lonlat_point from zipcodes WHERE zipcode = '47374') LIMIT 286 ) SELECT ... You might also notice that st_distance_sphere() doesn't mean exactly the same thing as the - operator. That's something I could refine going forward. That's what I've got so far. How could I improve this further? For reference, here are the key parts of the zipcodes table: # \d zipcodes Table public.zipcodes Column| Type | Modifiers --+---+--- zipcode | character varying(5) | not null lonlat_point | geometry(Point,4326) | Indexes: zipcodes_pkey PRIMARY KEY, btree (zipcode) lonlat_point_idx gist (lonlat_point) Thanks for the peer review! Mark Stosberg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] application of KNN code to US zipcode searches?
We perform over 1,000,000 searches each day for adoptable shelter pets near your zipcode. We already have adequate performance for these searches using the cube contrib, but the new KNN work in 9.1 seemed like it might be a promising way to speed this up even further. I installed PostgreSQL 9.1 on my laptop to evaluate it, using this post as a reference: http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/ The first task was to translate a geo-spatial search to use the new KNN syntax. I'm most familiar with two approaches to geo-spatial searching with PostgreSQL. The first is the older earthdistance approach, using point types and the @ operator. The second is the one we are using now, which uses a cube type, the cube_distance() and earth_box() method and a GIST index on the cube type. Immediately there is a hurdle in that KNN only appears to work with point types and the - operator, which does simple point-to-point distance, instead of the distance-around-the-earth. Still, I thought that could be enough of an approximation to test the waters. I started with some real world queries that involved some table joins, and when those failed to show improvement, I worked with some reduced-test-case queries. While I could confirm the new GIST index was being used on the point type, I couldn't get a query to benchmark better when it was invoked. I'm wondering if perhaps US zipcode searches aren't good use of this technology, perhaps because the data set is too small ( About 40,000 zipcodes ). Given that we can already do GIST-indexed searches with the cube type that provide good reasonable approximations for zipcode-radius searches, are others planning to eventually apply the KNN work to US zipcode searches? Sample EXPLAIN output and query times are below. Mark EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' AS radius FROM zipcodes ; --- Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.019..84.543 rows=41483 loops=1) Total runtime: 148.129 ms EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)'; -- Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.451..141.590 rows=41483 loops=1) Order By: (lon_lat - '(-118.412426,34.096629)'::point) Total runtime: 206.392 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
I thought the benefit of KNN was that you could retrieve the rows in distance order, so that a query for the closest 20 locations (for example) would be very fast. I wouldn't have expected it to be helpful when you're selecting all the rows regardless of distance. Kevin, Thanks for the feedback. You are right that my reduced test case wasn't a good approximation. I added a limit, to simulate finding the 100 zipcodes closest to 90210. Below I compare 4 approaches to the same query: 1. Cube search 2. Earth Distance Search 3. Simple point distance (no index) 4. Simple point distance (KNN) Now KNN benchmarks to be almost 100x faster! That's very promising. Then there's only the issue that simple point distance is not expected to be a good enough approximation of earth-distances. Perhaps that can be solved by pre-computing coordinates based on the lat/long pairs much like the map projections used to present a curved surface on a flat map? Given that's OK to be be a few miles off, it seems we have some leeway here. Recommendations? Mark EXPLAIN ANALYZE SELECT zipcode, cube_distance( '(-2513120.64361786, -4645511.0460328, 3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius FROM zipcodes ORDER BY radius LIMIT 100; --- Limit (cost=2946.70..2946.95 rows=100 width=62) (actual time=167.650..168.064 rows=100 loops=1) - Sort (cost=2946.70..3050.40 rows=41483 width=62) (actual time=167.644..167.829 rows=100 loops=1) Sort Key: ((cube_distance('(-2513120.64361786, -4645511.0460328, 3575538.9507084)'::cube, earth_coords) / 1609.344::double precision)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1361.24 rows=41483 width=62) (actual time=0.030..90.807 rows=41483 loops=1) Total runtime: 168.300 ms 3 -- Using Earthdistance EXPLAIN ANALYZE SELECT zipcode, lon_lat @ '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat @ '(-118.412426,34.096629)' LIMIT 100; Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=187.995..188.451 rows=100 loops=1) - Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=187.989..188.149 rows=100 loops=1) Sort Key: ((lon_lat @ '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.033..108.203 rows=41483 loops=1) Total runtime: 188.660 ms ## Using simple point distance, but with no Gist Index: EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)' LIMIT 100; Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=160.574..161.057 rows=100 loops=1) - Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=160.568..160.691 rows=100 loops=1) Sort Key: ((lon_lat - '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.027..84.610 rows=41483 loops=1) Total runtime: 161.226 ms # -- Using KNN-GIST index EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)' LIMIT 100; -- Limit (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892 rows=100 loops=1) - Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1) Order By: (lon_lat - '(-118.412426,34.096629)'::point) Total runtime: 2.198 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
PostGIS, geometry columns, and UTM.. I'm not sure where they are wrt adding KNN support, but it's something they've been anxious to have for a while, so I expect support will come quickly. I've looked into this a little more. One approach seems to be to project the lat/long pairs on to a flat plane using the Albers projection (which would be a one-time calculation), and then the current KNN point/distance calculations could be used. Here's a Perl module that references the Albers projection (although it's not yet clear to me how to use it): http://search.cpan.org/dist/PDL/ And a Wikipedia page on various calculation possibilities: http://en.wikipedia.org/wiki/Geographical_distance#Flat-surface_formulae Further suggestions welcome. Thanks, Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
I tried again to use KNN for a real-world query, and I was able to get it to add an approximately 6x speed-up vs the cube search or earthdistance methods ( from 300 ms to 50ms ). I had to make some notable changes for the KNN index to be considered. - Of course, I had to switch to using basic point/distance calculation. As previously noted, this still needs more work to confirm the accuracy and get the distance reported in miles. - The query planner didn't like it when the ORDER BY referred to a column value instead of a static value, even when I believe it should know that the column value never changes. See this pseudo-query where we look-up the coordinates for 90210 once: EXPLAIN ANALYZE SELECT pets.pet_id, zipcodes.lon_lat - center.lon_lat AS radius FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS center, pets JOIN shelters USING (shelter_id) JOIN zipcodes USING (zipcode) ORDER BY postal_codes.lon_lat - center.lon_lat limit 1000; This didn't use the KNN index until I changed the center.lon_lat in the ORDER BY to an explicit point value. I'm not sure if that's expected, or something I should take up with -hackers. This could be worked around by doing a initial query to look-up this value, and then feed a static value into this query. That's not ideal, but the combination would still be faster. - I had to drop the part of the WHERE clause which restricted the results to shelters within 50 miles from the target zipcode. However, I could set the LIMIT so high that I could get back enough pets, and then the application could trim out the results. Or, perhaps I could push this query down into a sub-select, and let PostgreSQL do a second pass to throw out some of the results. In any case, with a real-world speed-up of 6x, this looks like it will be worth it to us to continue to investigate. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
On 02/17/2011 03:17 PM, Oleg Bartunov wrote: Mark, we investigating pgsphere http://pgsphere.projects.postgresql.org/, if we could add KNN support. Great, thanks Oleg. I'll be happy to test it when something is ready. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: getting the most of out multi-core systems for repeated complex SELECT statements
On 02/03/2011 10:57 AM, gnuo...@rcn.com wrote: For PG to prosper in the future, it has to embrace the multi-core/processor/SSD machine at the query level As the person who brought up the original concern, I'll add that multi-core at the query level really isn't important for us. Most of our PostgreSQL usage is through a web application which fairly automatically takes advantage of multiple cores, because there are several parallel connections. A smaller but important piece of what we do is run this cron script needs to run hundreds of thousands of variations of the same complex SELECT as fast it can. What honestly would have helped most is not technical at all-- it would have been some documentation on how to take advantage of multiple cores for this case. It looks like it's going to be trivial-- Divide up the data with a modulo, and run multiple parallel cron scripts that each processes a slice of the data. A benchmark showed that this approach sped up our processing 3x when splitting the application 4 ways across 4 processors. (I think we failed to achieve a 4x improvement because the server was already busy handling some other tasks). Part of our case is likely fairly common *today*: many servers are multi-core now, but people don't necessarily understand how to take advantage of that if it doesn't happen automatically. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements
Each night we run over a 100,000 saved searches against PostgreSQL 9.0.x. These are all complex SELECTs using cube functions to perform a geo-spatial search to help people find adoptable pets at shelters. All of our machines in development in production have at least 2 cores in them, and I'm wondering about the best way to maximally engage all the processors. Now we simply run the searches in serial. I realize PostgreSQL may be taking advantage of the multiple cores some in this arrangement, but I'm seeking advice about the possibility and methods for running the searches in parallel. One naive I approach I considered was to use parallel cron scripts. One would run the odd searches and the other would run the even searches. This would be easy to implement, but perhaps there is a better way. To those who have covered this area already, what's the best way to put multiple cores to use when running repeated SELECTs with PostgreSQL? Thanks! Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements
On 02/03/2011 10:54 AM, Oleg Bartunov wrote: Mark, you could try gevel module to get structure of GIST index and look if items distributed more or less homogenous (see different levels). You can visualize index like http://www.sai.msu.su/~megera/wiki/Rtree_Index Also, if your searches are neighbourhood searches, them you could try knn, available in 9.1 development version. Oleg, Those are interesting details to consider. I read more about KNN here: http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/ Will I be able to use it improve the performance of finding nearby zipcodes? It sounds like KNN has great potential for performance improvements! Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] debugging handle exhaustion and 15 min/ 5mil row delete
Hello, We've been a satified user of PostgreSQL for several years, and use it to power a national pet adoption website: http://www.adoptapet.com/ Recently we've had a regularly-timed middle-of-the-night problem where database handles are exhausted for a very brief period. In tracking it down, I have found that the event seems to correspond to a time when a cron script is deleting from a large logging table, but I'm not certain if this is the cause or a correlation. We are deleting about 5 million rows from a time-based logging table that is replicated by Slony. We are currently using a single delete statement, which takes about 15 minutes to run. There is no RI on the table, but the use of Slony means that a trigger call and action is made for every row deleted, which causes a corresponding insertion in another table so the deletion can be replicated to the slave. My questions: - Could this kind of activity lead to an upward spiral in database handle usage? - Would it be advisable to use several small DELETE statements instead, to delete rows in batches of 1,000. We could use the recipe for this that was posted earlier to this list: delete from table where pk in (select pk from table where delete_condition limit X); Partitions seems attractive here, but aren't easy to use Slony. Perhaps once we migrate to PostgreSQL 9.0 and the hot standby feature we can consider that. Thanks for your help! Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark StosbergPrincipal Developer m...@summersault.com Summersault, LLC 765-939-9301 ext 202 database driven websites . . . . . http://www.summersault.com/ . . . . . . . . -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] debugging handle exhaustion and 15 min/ 5mil row delete
You can use TRUNCATE instead DELETE. TRUNCATE is more efficient and faster that DELETE. Thanks for the suggestion. However, TRUNCATE is not compatible with Slony, and we also have some rows which remain in table. Now, we need more information about your system to give you a certain solution: Are you using a RAID controller for you data? Yes. Do you have separated the xlog directory from the data directory? No. Which is your Operating System? FreeBSD. Which is you architecture? i386. Thanks for the feedback. I'm going to try batching the deletes for now, which is approach was worked well for some of our other long-running deletes. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Fw: Help me put 2 Gigs of RAM to use
Hello, PostgreSQL has served us very well powering a busy national pet adoption website. Now I'd like to tune our setup further get more out of hardware. What I'm noticing is that the while the FreeBSD server has 4 Gigs of memory, there are rarely every more than 2 in use-- the memory use graphs as being rather constant. My goal is to make good use of those 2 Gigs of memory to improve performance and reduce the CPU usage. The server has 4 2.33 Ghz processors in it, and RAIDed 15k RPM SCSI disks. Here are some current memory-related settings from our postgresql.conf file. (We currently run 8.2, but are planning an upgrade to 8.4 soon). Do you see an obvious suggestions for improvement? I find the file a bit hard to read because of the lack of units in the examples, but perhaps that's already been addressed in future versions. max_connections= 400 # Seems to be enough us shared_buffers = 8192 effective_cache_size = 1000 work_mem = 4096 maintenance_work_mem = 160MB Thanks for your suggestions! Mark [I tried to post this yesterday but didn't see it come through. This message is a second attempt.) -- . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark StosbergPrincipal Developer m...@summersault.com Summersault, LLC 765-939-9301 ext 202 database driven websites . . . . . http://www.summersault.com/ . . . . . . . . -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Help me put 2 Gigs of RAM to use
Thanks for the response, Matthew. On Thu, 10 Dec 2009, Mark Stosberg wrote: What I'm noticing is that the while the FreeBSD server has 4 Gigs of memory, there are rarely every more than 2 in use-- the memory use graphs as being rather constant. My goal is to make good use of those 2 Gigs of memory to improve performance and reduce the CPU usage. I think you'll find that the RAM is already being used quite effectively as disc cache by the OS. It sounds like the server is actually set up pretty well. You may get slightly better performance by tweaking a thing here or there, but the server needs some OS disc cache to perform well. As part of reviewing this status, I it appears that the OS is only addresses 3 of the 4 Gigs of memory. We'll work on our FreeBSD setup to cure that. Here's how top reports the memory breakdown: Mem: 513M Active, 2246M Inact, 249M Wired, 163M Cache, 112M Buf, 7176K Free Swap: 9216M Total, 1052K Used, 9215M Free So perhaps the OS disc cache is represented in the Inactive memory statistic? I suppose once we have the 4th Gig of memory actually available, that would all be doing to the disk cache. (We currently run 8.2, but are planning an upgrade to 8.4 soon). Highly recommended. For performance improvements in particular? Mark -- . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark StosbergPrincipal Developer m...@summersault.com Summersault, LLC 765-939-9301 ext 202 database driven websites . . . . . http://www.summersault.com/ . . . . . . . . -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] recommendations for web/db connection pooling or DBD::Gofer reviews
Under heavy load, Apache has the usual failure mode of spawning so many threads/processes and database connections that it just exhausts all the memory on the webserver and also kills the database. As usual, I would use lighttpd as a frontend (also serving static files) to handle the large number of concurrent connections to clients, and then have it funnel this to a reasonable number of perl backends, something like 10-30. I don't know if fastcgi works with perl, but with PHP it certainly works very well. If you can't use fastcgi, use lighttpd as a HTTP proxy and apache with mod_perl behind. Recipe for good handling of heavy load is using an asynchronous server (which by design can handle any number of concurrent connections up to the OS' limit) in front of a small number of dynamic webpage generating threads/processes. Thanks for the response. To be clear, it sounds like you are advocating solving the problem with scaling the number of connections with a different approach, by limiting the number of web server processes. So, the front-end proxy would have a number of max connections, say 200, and it would connect to another httpd/mod_perl server behind with a lower number of connections, say 20. If the backend httpd server was busy, the proxy connection to it would just wait in a queue until it was available. Is that the kind of design you had in mind? That seems like a reasonable option as well. We already have some lightweight Apache servers in use on the project which currently just serve static content. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: what worked: performance improvements for geo-spatial searching on FreeBSD
Oleg Bartunov wrote: Mark, do you know about our sky segmentation code Q3C, see details http://www.sai.msu.su/~megera/wiki/SkyPixelization We use it for billions objects in database and quite happy. Oleg, Thanks for the response. That sounds interesting, but it's not clear to me how I would put together a geo-spatial search calculating distances around the curvature of the earth using this technique. Is there is a SQL sample for this that you could point to? Also, I didn't recognize the names of the techniques you were benchmarking against RADEC and Rtree, are either of these related to the earthdistance or cube() based searches I would have used already? Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] bulk insert performance problem
Christian Bourque wrote: Any idea? Is there any other improvements I could do? Are you using the COPY syntax in the import script or individual insert statements? Using COPY will always be *much* faster. I believe COPY always appends to tables rather than replacing the contents, you can combine this technique with the possibility of splitting up the task into multiple copy statements, but that has never been necessary in my case, switching from INSERTS to a COPY statement always provided the huge performance improvement I needed. It's easy to confuse pg_dump -d with psql -d ...it's too bad they mean very different things. For pg_dump, -d causes INSERT statements to be generated instead of a COPY statement, and is has been a mistake I made in the past, because I expected to work like psql -d, where -d means database name. I suppose the safe thing to do is to avoid using -d altogether! Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] what worked: performance improvements for geo-spatial searching on FreeBSD
The http://www.1-800-save-a-pet.com site is hosted with FreeBSD and PostgreSQL and as a geo-spatial search as a central feature. One thing that made a substantial performance improvement was switching from the geo_distance() search in the earthdistance contrib, to use the cube based geo-spatial calculations, also in available in contrib/ In our case, the slight loss of precision between the two methods didn't matter. The other things that made a noticeable performance improvement was upgrading our servers from FreeBSD 4.x or 5.x (old, I know!) to FreeBSD 6.2 or later. We also upgrade these systems from PostgreSQL 8.1 to 8.2 at the same time. I assume the upgrade to 8.2 must be responsible at least in part for the performance gains. The result of these two rounds of updates is that our overall CPU capacity in the cluster seems to be double or triple what it was before. As the site grows we continue to be very happy with the performance, features and stability of PostgreSQL. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] recommendations for web/db connection pooling or DBD::Gofer reviews
When traffic to our PostgreSQL-backed website spikes, the first resource we see being exhausted is the DB slots on the master server (currently set to about 400). I expect that as new Apache/mod_perl children are being put to us, they are creating new database connections. I'm interested in recommendations to funnel more of that traffic through fewer DB slots, if that's possible. (We could also consider increasing the handles available, since the DB server has some CPU and memory to spare). I'm particularly interested in review of DBD::Gofer, which seems like it would help with this in our Perl application: http://search.cpan.org/dist/DBI/lib/DBD/Gofer.pm I realize it has limitations, like no transactions, but I think we would still able to use it selectively in our application. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] per-review of PgBouncer / Slony design
I would stick to proven postgresql technologies such as pgbouncer. Thanks for the fast recommendation, Joshua. I'll consider it. Our application is Slony-replicated web/db project with two slaves. Does this design seem sensible? - Run one pgbouncer server on the master, with settings to service the master and both slaves. - We already handle balancing traffic between the slaves separately, so that can remain unchanged. - Use Session Pooling both both the masters and the slaves. In theory, the slaves should just be doing transaction-less SELECT statements, so a more aggressive setting might be possible, but I believe there might be a leak in the logic where we create a temporary table on the slave in one case. - Redirect all application connections through pgbouncer ### From graphs we keep, we can see that the slaves currently use a max of about 64 connections...they are far from maxing out what's possible. So I was trying to think through if made sense to bother using the pgBouncer layer with them. I through of two potential reasons to still use it: - In the event of a major traffic spike on the web servers, pgbouncer would keep the number of db slots under control. - Potentially there's a performance gain in having PgBouncer hold the connections open. Does that analysis seem correct? For the master's pool size, I thought I would just choose a number that's a little larger that the daily max number of DB slots in use. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] low memory usage reported by 'top' indicates poor tuning?
Hello, I'm trying to make sense of the memory usage reported by 'top', compared to what pg_database_size shows. Here's one result: select pg_size_pretty(pg_database_size('production')); pg_size_pretty 6573 MB Now, looking at memory use with top, there is a lot memory that isn't being used on the system: Mem: 470M Active, 2064M Inact ( 3 Gigs RAM, total ). Overall performance is decent, so maybe there's no problem. However, I wonder if we've under-allocated memory to PostgreSQL. (This is a dedicated FreeBSD DB server). Some memory settings include: shared_buffers = 8192 (we have 450 connections) max_fsm_pages = 125 (we kept getting HINTs to bump it, so we did) Maybe we should be bumping up the sort_mem and vacuum_mem as well? I do sometimes see sorting and vacuuming as showing up as things I'd like to run faster. This list has been a great resource for performance tuning help, and I continue to appreciate your help. We've used PostgreSQL on every project we've had a choice on for the last 10 years. (Has it been that long?!) We've never regretted it once. Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Writting a search engine for a pgsql DB
Joshua D. Drake wrote: Madison Kelly wrote: Hi all, I am asking in this list because, at the end of the day, this is a performance question. I am looking at writing a search engine of sorts for my database. I have only ever written very simple search engines before which amounted to not much more that the query string being used with ILIKE on a pile of columns. This was pretty rudimentary and didn't offer anything like relevance sorting and such (I'd sort by result name, age or whatnot). So I am hoping some of you guys and gals might be able to point me towards some resources or offer some tips or gotcha's before I get started on this. I'd really like to come up with a more intelligent search engine that doesn't take two minutes to return results. :) I know, in the end good indexes and underlying hardware will be important, but a sane as possible query structure helps to start with. See search.postgresql.org, you can download all source from gborg.postgresql.org. Joshua, What's the name of the project referred to? There's nothing named search hosted on Gborg according to this project list: http://gborg.postgresql.org/project/projdisplaylist.php Madison, For small data sets and simpler searches, the approach you have been using can be appropriate. You may just want to use a small tool in a regular programming language to help build the query. I wrote such a tool for Perl: http://search.cpan.org/~markstos/SQL-KeywordSearch-1.11/lib/SQL/KeywordSearch.pm For large or complex searches, a more specialized search system may be appropriate. I suspect that's kind of tool that Joshua is referencing. Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Writting a search engine for a pgsql DB
Madison Kelly wrote: I think the more direct question I was trying to get at is How do you build a 'relavence' search engine? One where results are returned/sorted by relevance of some sort?. At this point, the best I can think of, would be to perform multiple queries; first matching the whole search term, then the search term starting a row, then ending a row, then anywhere in a row and scoring the results based on which query they came out on. This seems terribly cumbersome (and probably slow, indexes be damned) though. I'm hoping there is a better way! :) Madison, I think your basic thinking is correct. However, the first select can done offline -- sometime beforehand. For example, you might create a table called keywords that includes the list of words mined in the other tables, along with references to where the words are found, and how many times they are mentioned. Then, when someone actually searches, the search is primarily on the keywords table, which is now way to sort by rank, since the table contains how many times each keyword matches. The final result can be constructed by using the details in the keywords table to pull up the actual records needed. My expectation however is that there are enough details in the system, that I would first look at trying a package like tsearch2 to help solve the problem, before trying to write another system like this from scratch. Mark ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Writting a search engine for a pgsql DB
Now see, this is exactly the kind of sagely advice I was hoping for! :) I'll look into tsearch2, and failing that for some reason, I love the keyword table idea. For example keyword search code, you can try this package: http://downloads.sourceforge.net/cascade/cascade-devel-pieces-1.1.tgz?modtime=999556617big_mirror=0 There is a keywords subdirectory with the Perl and SQL. I'm sure this code is not ideal in a number of ways: 1. It's from 2001. 2. It doesn't actually function on it's own anymore. However, you can read the code and get ideas. 3. I'm sure someone has a better looking/functioning example! Anyway, it's there if you want to take a look. Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] Using the 8.2 autovacuum values with 8.1
Thanks to everyone for the feedback about vacuuming. It's been very useful. The pointers to the pgstattuple and Pgfouine tools were also helpful. I'm now considering the following plan for trying Autovacuuming again with 8.1. I'd like any peer review you have to offer of the following: 1. First, I'll move the settings to match the defaults in 8.2. The ones I noticed in particular were: autovacuum_vacuum_threshold changes: 1000 - 500 autovacuum_anayze_threshold changes: 500 - 250 autovacuum_scale_factor changes: .4 - .2 autovacuum_analyze_scale_factor changes .2 - .1 2. Try the vacuum cost delay feature, starting with a 20ms value: autovacuum_vacuum_cost_delay = 20 3. Immediately add a row to pg_autovacuum for a huge logging table that would be too slow to vacuum usually. We'll still vacuum it once a week for good measure by cron. 4. For good measure, I think I still keep the nightly cron entry that does a complete vacuum analyze (except for that large table...). Seem like a reasonable plan? Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] How to debug performance problems
Ray Stell wrote: I'd like to have a toolbox prepared for when performance goes south. I'm clueless. Would someone mind providing some detail about how to measure these four items Craig listed: 1. The first thing is to find out which query is taking a lot of time. 2. A long-running transaction keeps vacuum from working. 3. A table grows just enough to pass a threshold in the planner and a drastically different plan is generated. I just ran into a variation of this: 3.5 A table grows so large so that VACUUMING it takes extremely long, interfering with the general performance of the system. In our case, we think the table had about 36 million rows when it hit that threshold. I'm now working on a better solution for that table. Mark ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[PERFORM] How to avoid vacuuming a huge logging table
Our application has a table that is only logged to, and infrequently used for reporting. There generally no deletes and updates. Recently, the shear size (an estimated 36 million rows) caused a serious problem because it prevented a vacuum analyze on the whole database from finishing in a timely manner. As I understand, a table with this usage pattern wouldn't need to be vacuumed anyway. I'm looking for general advice from people who have faced the same issue. I'm looking at a number of alternatives: 1. Once a month, we could delete and archive old rows, for possible re-import later if we need to report on them. It would seem this would need to be done as proper insert statements for re-importing. (Maybe there is a solution for that with table partitioning? ) 2. We could find a way to exclude the table for vacuuming, and let it grow even larger. Putting the table in it's own database would accomplish that, but it would nice to avoid the overhead of a second database connection. 3. Take a really different approach. Log in CSV format to text files instead, And only import the date ranges we need on demand if a report is requested on the data. Thanks for any tips. Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Auto-Vacuum in 8.1 was ineffective for me. 8.2 may work better?
When I upgraded a busy database system to PostgreSQL 8.1, I was excited about AutoVacuum, and promptly enabled it, and turned off the daily vacuum process. ( I set the following, as well as the option to enable auto vacuuming stats_start_collector = true stats_row_level = true ) I could see in the logs that related activity was happening, but within a few days, the performance became horrible, and enabling the regular vacuum fixed it. Eventually autovacuum was completely disabled. What could have happened? Is 8.2 more likely to just work in the regard? Is the the table-specific tuning that I would have needed to do? I realize getting autovacuuming to work could be one way to exclude the large table I wrote about in a recent post. Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] reindex vs 'analyze'
Tom Lane wrote: Mark Stosberg [EMAIL PROTECTED] writes: Your suggestion about the pet_state index was right on. I tried Analyze on it, but still got the same bad estimate. However, I then used reindex on that index, and that fixed the estimate accuracy, which made the query run faster! No, the estimate is about the same, and so is the plan. The data seems to have changed though --- on Monday you had - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..562.50 rows=39571 width=0) (actual time=213.620..213.620 rows=195599 loops=82) Index Cond: ((pet_state)::text = 'available'::text) and now it's - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..285.02 rows=41149 width=0) (actual time=22.043..22.043 rows=40397 loops=82) Index Cond: ((pet_state)::text = 'available'::text) Don't tell me you got 155000 pets adopted out yesterday ... what happened here? [ thinks... ] One possibility is that those were dead but not-yet-vacuumed rows. What's your vacuuming policy on this table? (A bitmap-index-scan plan node will count dead rows as returned, unlike all other plan node types, since we haven't actually visited the heap yet...) Today I noticed a combination of related mistakes here. 1. The Vacuum commands were being logged to a file that didn't exist. I'm mot sure if this prevented them being run. I had copied the cron entry for another machine, but neglected to create /var/log/pgsql: vacuumdb -z --table pets -d saveapet /var/log/pgsql/vacuum.log 21 ### However, I again noticed that the row counts were horribly off on the 'pet_state' index, and again used REINDEX to fix it. (Examples below). However, if the VACUUM ANALYZE wasn't actually run, that does seem like it could have been related. I'll have to see how things are tomorrow after a full round of database vacuuming. Mark - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..337.29 rows=39226 width=0) (actual time=77.158. .77.158 rows=144956 loops=81) Index Cond: ((pet_state)::text = 'available'::text) Total runtime: 8327.261 ms - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..271.71 rows=39347 width=0) (actual time=15.466..15.466 rows=40109 loops=81) Index Cond: ((pet_state)::text = 'available'::text) Total runtime: 1404.124 ms ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] reindex vs 'analyze' (was: Re: cube operations slower than geo_distance() on production server)
Merlin Moncure wrote: On 2/14/07, Tom Lane [EMAIL PROTECTED] wrote: There are two things wrong here: first, that the estimated row count is only 20% of actual; it should certainly not be that far off for such a simple condition. I wonder if your vacuum/analyze procedures are actually working. Second, you mentioned somewhere along the line that 'available' pets are about 10% of all the entries, which means that this indexscan is more than likely entirely counterproductive: it would be cheaper to ignore this index altogether. Tom, Thanks for the generosity of your time. We are using 8.1.3 currently. I have read there are some performance improvements in 8.2, but we have not started evaluating that yet. Your suggestion about the pet_state index was right on. I tried Analyze on it, but still got the same bad estimate. However, I then used reindex on that index, and that fixed the estimate accuracy, which made the query run faster! The cube search now benchmarks faster than the old search in production, taking about 2/3s of the time of the old one. Any ideas why the manual REINDEX did something that analyze didn't? It makes me wonder if there is other tuning like this to do. Attached is the EA output from the most recent run, after the re-index. I think switching the index on pet_state to a composite on (pet_state, species_id) might help too. or even better: create function is_pet_available(text) returns bool as $$ select $1='available'; $$ language sql immutable; create index pets_available_species_idx on pets(is_pet_available(pet_state), species_id); Merlin, Thanks for this suggestion. It is not an approach I had used before, and I was interested to try it. However, the new index didn't get chosen. (Perhaps I would need to drop the old one?) However, Tom's suggestions did help. I'll follow up on that in just a moment. refactor your query something similar to: SELECT * FROM ( SELECT earth_coords(q.earth_coords, s.earth_coords)/1609.344 as radius FROM pets JOIN shelters_active as shelters USING (shelter_id) JOIN zipcodes s ON shelters.postal_code_for_joining = zipcodes.zipcode JOIN zipcodes q ON q.zipcode = '90210' WHERE is_pet_available(pet_state) AND species_id = 1 AND earth_box(q.earth_coords, 10*1609.344) @ s.earth_coords ) p order by radius merlin ---(end of broadcast)--- TIP 6: explain analyze is your friend Sort (cost=5276.93..5277.00 rows=28 width=64) (actual time=1981.830..1984.415 rows=1344 loops=1) Sort Key: (cube_distance(public.zipcodes.earth_coords, public.zipcodes.earth_coords) / 1609.344::double precision) - Nested Loop (cost=291.32..5276.26 rows=28 width=64) (actual time=24.080..1976.479 rows=1344 loops=1) - Nested Loop (cost=2.15..575.79 rows=11 width=68) (actual time=2.637..34.067 rows=131 loops=1) - Nested Loop (cost=2.15..153.48 rows=42 width=73) (actual time=1.939..3.972 rows=240 loops=1) - Index Scan using zipcodes_pkey on zipcodes (cost=0.00..3.01 rows=1 width=32) (actual time=0.283..0.287 rows=1 loops=1) Index Cond: ((zipcode)::text = '90210'::text) - Bitmap Heap Scan on zipcodes (cost=2.15..149.84 rows=42 width=41) (actual time=1.403..2.323 rows=240 loops=1) Recheck Cond: (cube_enlarge((outer.earth_coords)::cube, 16093.4357308298::double precision, 3) @ zipcodes.earth_coords) - Bitmap Index Scan on zip_earth_coords_idx (cost=0.00..2.15 rows=42 width=0) (actual time=1.377..1.377 rows=240 loops=1) Index Cond: (cube_enlarge((outer.earth_coords)::cube, 16093.4357308298::double precision, 3) @ zipcodes.earth_coords) - Index Scan using shelters_postal_code_for_joining_idx on shelters (cost=0.00..10.03 rows=2 width=12) (actual time=0.064..0.118 rows=1 loops=240) Index Cond: ((shelters.postal_code_for_joining)::text = (outer.zipcode)::text) Filter: ((shelter_state)::text = 'active'::text) - Bitmap Heap Scan on pets (cost=289.17..426.86 rows=35 width=4) (actual time=14.362..14.746 rows=10 loops=131) Recheck Cond: ((pets.shelter_id = outer.shelter_id) AND ((pets.pet_state)::text = 'available'::text)) Filter: (species_id = 1) - BitmapAnd (cost=289.17..289.17 rows=35 width=0) (actual time=14.219..14.219 rows=0 loops=131) - Bitmap Index Scan on pets_shelter_id_idx (cost=0.00..3.89 rows=256 width=0) (actual time=0.188..0.188 rows=168 loops=131) Index Cond: (pets.shelter_id = outer.shelter_id) - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..285.02 rows=41149 width=0) (actual time=22.043..22.043 rows=40397 loops=82) Index Cond: ((pet_state)::text = 'available'::text)
Re: [PERFORM] reindex vs 'analyze'
On Wed, Feb 14, 2007 at 01:07:23PM -0500, Tom Lane wrote: Mark Stosberg [EMAIL PROTECTED] writes: Your suggestion about the pet_state index was right on. I tried Analyze on it, but still got the same bad estimate. However, I then used reindex on that index, and that fixed the estimate accuracy, which made the query run faster! No, the estimate is about the same, and so is the plan. The data seems to have changed though --- on Monday you had - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..562.50 rows=39571 width=0) (actual time=213.620..213.620 rows=195599 loops=82) Index Cond: ((pet_state)::text = 'available'::text) and now it's - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..285.02 rows=41149 width=0) (actual time=22.043..22.043 rows=40397 loops=82) Index Cond: ((pet_state)::text = 'available'::text) Don't tell me you got 155000 pets adopted out yesterday ... what happened here? That seemed be the difference that the reindex made. The number of rows in the table and the number marked available is roughly unchanged. select count(*) from pets; 304951 (1 row) select count(*) from pets where pet_state = 'available'; --- 39857 It appears just about 400 were marked as adopted yesterday. [ thinks... ] One possibility is that those were dead but not-yet-vacuumed rows. What's your vacuuming policy on this table? It gets vacuum analyzed ery two hours throughout most of the day. Once Nightly we vacuum analyze everything, but most of the time we just do this table. (A bitmap-index-scan plan node will count dead rows as returned, unlike all other plan node types, since we haven't actually visited the heap yet...) Thanks again for your help, Tom. Mark -- . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark StosbergPrincipal Developer [EMAIL PROTECTED] Summersault, LLC 765-939-9301 ext 202 database driven websites . . . . . http://www.summersault.com/ . . . . . . . . ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] cube operations slower than geo_distance() on production server
Merlin Moncure wrote: On 2/10/07, Mark Stosberg [EMAIL PROTECTED] wrote: With the help of some of this list, I was able to successfully set up and benchmark a cube-based replacement for geo_distance() calculations. On a development box, the cube-based variations benchmarked consistently running in about 1/3 of the time of the gel_distance() equivalents. After setting up the same columns and indexes on a production database, it's a different story. All the cube operations show themselves to be about the same as, or noticeably slower than, the same operations done with geo_distance(). I've stared at the EXPLAIN ANALYZE output as much I can to figure what's gone. Could you help? Here's the plan on the production server, which seems too slow. Below is the plan I get in on the development server, which is much faster. I tried set enable_nestloop = off, which did change the plan, but the performance. The production DB has much more data in it, but I still expected comparable results relative to using geo_distance() calculations. any objection to posting the query (any maybe tables, keys, indexes, etc)? Here the basic query I'm using: SELECT -- 1609.344 is a constant for meters per mile cube_distance( (SELECT earth_coords from zipcodes WHERE zipcode = '90210') , earth_coords)/1609.344 AS RADIUS FROM pets -- shelters_active is a view where shelter_state = 'active' JOIN shelters_active as shelters USING (shelter_id) -- The zipcode fields here are varchars JOIN zipcodes ON ( shelters.postal_code_for_joining = zipcodes.zipcode ) -- search for just 'dogs' WHERE species_id = 1 AND pet_state='available' AND earth_box( (SELECT earth_coords from zipcodes WHERE zipcode = '90210') , 10*1609.344 ) @ earth_coords ORDER BY RADIUS; All the related columns are indexed: pets.species_id pets.shelter_id pets.pet_state shelters.shelter_id (pk) shelters.postal_code_for_joining shelters.active zipcodes.zipcode (pk) zipcodes.earth_coords The pets table has about 300,000 rows, but only about 10% are available. It sees regular updates and is vacuum analyzed every couple of hours now. the rest of the tables get vacuum analyzed nightly. The shelters table is about 99% shelter_state = active. It's updated infrequently. The zipcodes table has about 40,000 rows in it and doesn't change. I tried a partial index on the pets table WHERE pet_state = 'available'. I could see the index was used, but the performance was unaffected. The EXPLAIN ANALYZE output is attached, to try to avoid mail-client wrapping. The query is running 10 times slower today than on Friday, perhaps because of server load, or because we are at the end of a VACUUM cycle. Thanks for any help! Mark Sort (cost=6887.03..6887.10 rows=27 width=32) (actual time=17925.098..17927.979 rows=1324 loops=1) Sort Key: (cube_distance($0, zipcodes.earth_coords) / 1609.344::double precision) InitPlan - Index Scan using zipcodes_pkey on zipcodes (cost=0.00..3.01 rows=1 width=32) (actual time=0.028..0.031 rows=1 loops=1) Index Cond: ((zipcode)::text = '90210'::text) - Index Scan using zipcodes_pkey on zipcodes (cost=0.00..3.01 rows=1 width=32) (actual time=0.687..0.692 rows=1 loops=1) Index Cond: ((zipcode)::text = '90210'::text) - Nested Loop (cost=568.82..6880.36 rows=27 width=32) (actual time=346.932..17919.697 rows=1324 loops=1) - Nested Loop (cost=2.15..572.14 rows=9 width=36) (actual time=8.321..43.378 rows=136 loops=1) - Bitmap Heap Scan on zipcodes (cost=2.15..150.05 rows=42 width=41) (actual time=3.442..4.402 rows=240 loops=1) Recheck Cond: (cube_enlarge(($1)::cube, 16093.4357308298::double precision, 3) @ earth_coords) - Bitmap Index Scan on zip_earth_coords_idx (cost=0.00..2.15 rows=42 width=0) (actual time=3.426..3.426 rows=240 loops=1) Index Cond: (cube_enlarge(($1)::cube, 16093.4357308298::double precision, 3) @ earth_coord) - Index Scan using shelters_postal_code_for_joining_idx on shelters (cost=0.00..10.02 rows=2 width=12) (actual time=0.093..0.155 rows=1 loops=240) Index Cond: ((shelters.postal_code_for_joining)::text = (outer.zipcode)::text) - Bitmap Heap Scan on pets (cost=566.67..700.47 rows=34 width=4) (actual time=130.363..131.367 rows=10 loops=136) Recheck Cond: ((pets.shelter_id = outer.shelter_id) AND ((pets.pet_state)::text = 'available'::text)) Filter: (species_id = 1) - BitmapAnd (cost=566.67..566.67 rows=34 width=0) (actual time=129.333..129.333 rows=0 loops=136) - Bitmap Index Scan on pets_shelter_id_idx (cost=0.00..3.92 rows=263 width=0) (actual time=0.164..0.164 rows=178 loops=136) Index Cond: (pets.shelter_id = outer.shelter_id) - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..562.50
Re: [PERFORM] cube operations slower than geo_distance() on production server
Merlin-- Thanks so much for your help. Some follow-ups are below. Merlin Moncure wrote: Here the basic query I'm using: SELECT -- 1609.344 is a constant for meters per mile cube_distance( (SELECT earth_coords from zipcodes WHERE zipcode = '90210') , earth_coords)/1609.344 AS RADIUS FROM pets -- shelters_active is a view where shelter_state = 'active' JOIN shelters_active as shelters USING (shelter_id) -- The zipcode fields here are varchars JOIN zipcodes ON ( shelters.postal_code_for_joining = zipcodes.zipcode ) -- search for just 'dogs' WHERE species_id = 1 AND pet_state='available' AND earth_box( (SELECT earth_coords from zipcodes WHERE zipcode = '90210') , 10*1609.344 ) @ earth_coords ORDER BY RADIUS; your query looks a bit funky. here are the problems I see. * in your field list, you don't need to re-query the zipcode table. cube_distance( (SELECT earth_coords from zipcodes WHERE zipcode = '90210') , earth_coords)/1609.344 AS RADIUS becomes cube_distance(pets.earth_coords, earth_coords ) / 1609.344 AS RADIUS It may not have been clear from the query, but only the 'zipcodes' table has an 'earth_coords' column. Also, I think your refactoring means something different. My query expresses number of miles this pet is from 90210, while I think the refactor expresses a distance between a pet and another calculated value. also, dont. re-refer to the zipcodes table in the join clause. you are already joining to it: AND earth_box( (SELECT earth_coords from zipcodes WHERE zipcode = '90210') , 10*1609.344) @ earth_coords becomes AND earth_box(zipcodes.earth_coords, 10*1609.344) ) @ pets.earth_coords I have the same question here as above-- I don't see how the new syntax includes the logic of distance from the 90210 zipcode. * also, does pet_state have any other states than 'available' and ' not available'? if not, you should be using a boolean. if so, you can consider a functional index to convert it to a booelan. Yes, it has three states. * if you always look up pets by species, we can explore composite index columns on species, available (especially using the above functional suggestion), etc. composite partial (imo) We nearly always search by species. Right now it's mostly dogs and some cats. I searched for references to composite index columns, and didn't find much. Could you provide a direct reference to what you have in mind? Any other ideas appreciated! Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] cube operations slower than geo_distance() on production server
With the help of some of this list, I was able to successfully set up and benchmark a cube-based replacement for geo_distance() calculations. On a development box, the cube-based variations benchmarked consistently running in about 1/3 of the time of the gel_distance() equivalents. After setting up the same columns and indexes on a production database, it's a different story. All the cube operations show themselves to be about the same as, or noticeably slower than, the same operations done with geo_distance(). I've stared at the EXPLAIN ANALYZE output as much I can to figure what's gone. Could you help? Here's the plan on the production server, which seems too slow. Below is the plan I get in on the development server, which is much faster. I tried set enable_nestloop = off, which did change the plan, but the performance. The production DB has much more data in it, but I still expected comparable results relative to using geo_distance() calculations. The production db gets a VACUUM ANALYZE every couple of hours now. Thanks! Mark Sort (cost=6617.03..6617.10 rows=27 width=32) (actual time=2482.915..2487.008 rows=1375 loops=1) Sort Key: (cube_distance($0, zipcodes.earth_coords) / 1609.344::double precision) InitPlan - Index Scan using zipcodes_pkey on zipcodes (cost=0.00..3.01 rows=1 width=32) (actual time=0.034..0.038 rows=1 loops=1) Index Cond: ((zipcode)::text = '90210'::text) - Index Scan using zipcodes_pkey on zipcodes (cost=0.00..3.01 rows=1 width=32) (actual time=0.435..0.438 rows=1 loops=1) Index Cond: ((zipcode)::text = '90210'::text) - Nested Loop (cost=538.82..6610.36 rows=27 width=32) (actual time=44.660..2476.919 rows=1375 loops=1) - Nested Loop (cost=2.15..572.14 rows=9 width=36) (actual time=4.877..39.037 rows=136 loops=1) - Bitmap Heap Scan on zipcodes (cost=2.15..150.05 rows=42 width=41) (actual time=3.749..4.951 rows=240 loops=1) Recheck Cond: (cube_enlarge(($1)::cube, 16093.4357308298::double precision, 3) @ earth_coords) - Bitmap Index Scan on zip_earth_coords_idx (cost=0.00..2.15 rows=42 width=0) (actual time=3.658..3.658 rows=240 loops=1) Index Cond: (cube_enlarge(($1)::cube, 16093.4357308298::double precision, 3) @ earth_coords) - Index Scan using shelters_postal_code_for_joining_idx on shelters (cost=0.00..10.02 rows=2 width=12) (actual time=0.079..0.133 rows=1 loops=240) Index Cond: ((shelters.postal_code_for_joining)::text = (outer.zipcode)::text) - Bitmap Heap Scan on pets (cost=536.67..670.47 rows=34 width=4) (actual time=16.844..17.830 rows=10 loops=136) Recheck Cond: ((pets.shelter_id = outer.shelter_id) AND ((pets.pet_state)::text = 'available'::text)) Filter: (species_id = 1) Sort (cost=7004.53..7004.62 rows=39 width=32) (actual time=54.635..55.450 rows=475 loops=1) - BitmapAnd (cost=536.67..536.67 rows=34 width=0) (actual time=16.621..16.621 rows=0 loops=136) - Bitmap Index Scan on pets_shelter_id_idx (cost=0.00..3.92 rows=263 width=0) (actual time=0.184..0.184 rows=132 loops=136) Index Cond: (pets.shelter_id = outer.shelter_id) - Bitmap Index Scan on pets_pet_state_idx (cost=0.00..532.50 rows=39571 width=0) (actual time=26.922..26.922 rows=40390 loops=82) Index Cond: ((pet_state)::text = 'available'::text) Total runtime: 2492.852 ms ### Faster plan in development: Sort (cost=7004.53..7004.62 rows=39 width=32) (actual time=54.635..55.450 rows=475 loops=1) Sort Key: (cube_distance($0, earth_distance.earth_coords) / 1609.344::double precision) InitPlan - Bitmap Heap Scan on earth_distance (cost=4.74..624.60 rows=212 width=32) (actual time=0.113..0.115 rows=1 loops=1) Recheck Cond: ((zipcode)::text = '90210'::text) - Bitmap Index Scan on earth_distance_zipcode_idx (cost=0.00..4.74 rows=212 width=0) (actual time=0.101..0.101 rows=2 loops=1) Index Cond: ((zipcode)::text = '90210'::text) - Bitmap Heap Scan on earth_distance (cost=4.74..624.60 rows=212 width=32) (actual time=0.205..0.208 rows=1 loops=1) Recheck Cond: ((zipcode)::text = '90210'::text) - Bitmap Index Scan on earth_distance_zipcode_idx (cost=0.00..4.74 rows=212 width=0) (actual time=0.160..0.160 rows=2 loops=1) Index Cond: ((zipcode)::text = '90210'::text) - Hash Join (cost=618.67..5754.30 rows=39 width=32) (actual time=13.499..52.924 rows=475 loops=1) Hash Cond: (outer.shelter_id = inner.shelter_id) - Bitmap Heap Scan on pets (cost=44.85..5158.42 rows=4298 width=4) (actual time=4.278..34.192 rows=3843 loops=1) Recheck Cond: ((pet_state)::text = 'available'::text)
Re: [PERFORM] Can anyone make this code tighter? Too slow, Please help!
[EMAIL PROTECTED] wrote: I have this function in my C#.NET app that goes out to find the business units for each event and returns a string (for my report). I'm finding that for larger reports it takes too long and times out. Does anyone know how I can speed this process up? Is this code very tight or could it be cleaner? thanks in advance for your help, this is a big issue used on several reports. Perhaps try EXPLAIN ANALYZE on this query, given a valid event ID: SELECT Distinct Companies.Name FROM Companies INNER JOIN ExpenseAllocations ON Companies.ID = ExpenseAllocations.BusinessUnitID WHERE (ExpenseAllocations.EventID = @EventID) ORDER BY Companies.Name DESC ### Do the columns used in the join and WHERE clause have indexes? It's also possible the optimization needs to happen at a different level. Perhaps you are frequently looking up the same results in a large report, or throughout the day. If this part doesn't need to be up-to-second fresh, perhaps your application could cache some of the results of this function, instead of repeatedly asking the database to recompute it. Mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] tip: faster sorting for proximity queries by using cube_distance()
Hello, I wanted to share something else I learned in my proximity search work. One my requirements is to order by the distance that matches are found from the center point. When did this using earth_distance(), the benefit of the earth_box() technique over the old geo_distance became minimal as I approached a 250mi radius. Switching to sorting by cube_distance() offered a huge benefit, allowing the earth_distance() query to run in about 100ms vs 300ms for the geo_distance() equivalent. I checked the results that cube_distance() produced versus earth_distance(). cube_distance() is always (not surprisingly) a little smaller, but the difference seems only grows to about a mile for a 250 mile radius. That's an acceptable margin of error for this application, and may be for others as well. Mark ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] explain analyze output for review (was: optimizing a geo_distance()...)
Bruno Wolff III wrote: Some people here may be able to tell you more if you show us explain analyze output. Here is my explain analyze output. Some brief context of what's going on. The goal is to find Pets Near You. We join the pets table on the shelters table to get a zipcode, and then join a shelters table with earth_distance to get the coordinates of the zipcode. ( Is there any significant penalty for using a varchar vs an int for a joint? ). I've been investigating partial indexes for the pets table. It has about 300,000 rows, but only about 10 are active, and those are the ones we are care about. Queries are also frequently made on males vs females, dogs vs cats or specific ages, and those specific cases seem like possible candidates for partial indexes as well. I played with that approach some, but had trouble coming up with any thing that benchmarked faster. I'm reading the explain analyze output correctly myself, nearly all of the time spent is related to the 'pets' table, but I can't see what to about it. Help appreciated! Mark Nested Loop (cost=11.82..29.90 rows=1 width=0) (actual time=37.601..1910.787 rows=628 loops=1) - Nested Loop (cost=6.68..20.73 rows=1 width=24) (actual time=35.525..166.547 rows=1727 loops=1) - Bitmap Heap Scan on pets (cost=6.68..14.71 rows=1 width=4) (actual time=35.427..125.594 rows=1727 loops=1) Recheck Cond: (((sex)::text = 'f'::text) AND (species_id = 1)) Filter: ((pet_state)::text = 'available'::text) - BitmapAnd (cost=6.68..6.68 rows=2 width=0) (actual time=33.398..33.398 rows=0 loops=1) - Bitmap Index Scan on pets_sex_idx (cost=0.00..3.21 rows=347 width=0) (actual time=14.739..14.739 rows=35579 loops=1) Index Cond: ((sex)::text = 'f'::text) - Bitmap Index Scan on pet_species_id_idx (cost=0.00..3.21 rows=347 width=0) (actual time=16.779..16.779 rows=48695 loops=1) Index Cond: (species_id = 1) - Index Scan using shelters_pkey on shelters (cost=0.00..6.01 rows=1 width=28) (actual time=0.012..0.014 rows=1 loops=1727) Index Cond: (outer.shelter_id = shelters.shelter_id) - Bitmap Heap Scan on earth_distance (cost=5.14..9.15 rows=1 width=9) (actual time=0.984..0.984 rows=0 loops=1727) Recheck Cond: ((cube_enlarge(('(-2512840.11676572, 4646218.19036629, 3574817.21369166)'::cube)::cube, 160930.130863421::double precision, 3) @ earth_distance.earth_coords) AND ((outer.postal_code_for_joining)::text = (earth_distance.zipcode)::text)) - BitmapAnd (cost=5.14..5.14 rows=1 width=0) (actual time=0.978..0.978 rows=0 loops=1727) - Bitmap Index Scan on earth_coords_idx (cost=0.00..2.15 rows=42 width=0) (actual time=0.951..0.951 rows=1223 loops=1727) Index Cond: (cube_enlarge(('(-2512840.11676572, 4646218.19036629, 3574817.21369166)'::cube)::cube, 160930.130863421::double precision, 3) @ earth_coords) - Bitmap Index Scan on earth_distance_zipcode_idx (cost=0.00..2.74 rows=212 width=0) (actual time=0.015..0.015 rows=1 loops=1727) Index Cond: ((outer.postal_code_for_joining)::text = (earth_distance.zipcode)::text) Total runtime: 1913.099 ms ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] optimizing a geo_distance() proximity query
Bruno Wolff III wrote: On Sat, Feb 03, 2007 at 14:00:26 -0500, Mark Stosberg [EMAIL PROTECTED] wrote: I'm using geo_distance() from contrib/earthdistance would like to find a way to spend up the geo distance calculation if possible. This is for a proximity search: Show me adoptable pets within 250 miles of this zipcode. If you are using the cube based part of the earth distance package, then you can use gist indexes to speed those searches up. Thanks for the tip. Any idea what kind of improvement I can expect to see, compared to using geo_distance()? There are functions for creating boxes that include all of the points some distance from a fixed point. This is lossy, so you need to recheck if you don't want some points a bit farther away returned. Also you would need to pick a point to be where the zip code is located, rather than using area based zip codes. This is also interesting. Is this approach practical if I want to index what's near each of about 40,000 US zipcodes, or the approach mostly useful if you there are just a small number of fixed points to address? I'm going to start installing the cube() and earth_distance() functions today and see where I can get with the approach. Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] optimizing a geo_distance() proximity query (example and benchmark)
Merlin Moncure wrote: On 2/5/07, Mark Stosberg [EMAIL PROTECTED] wrote: Bruno Wolff III wrote: On Sat, Feb 03, 2007 at 14:00:26 -0500, Mark Stosberg [EMAIL PROTECTED] wrote: I'm using geo_distance() from contrib/earthdistance would like to find a way to spend up the geo distance calculation if possible. This is for a proximity search: Show me adoptable pets within 250 miles of this zipcode. If you are using the cube based part of the earth distance package, then you can use gist indexes to speed those searches up. Thanks for the tip. Any idea what kind of improvement I can expect to see, compared to using geo_distance()? a lot. be aware that gist takes longer to build than btree, but very fast to search. Index search and filter to box is basically an index lookup (fast!). for mostly static datasets that involve a lot of searching, gist is ideal. The documentation in contrib/ didn't provide examples of how to create or the index or actually a the proximity search. Here's what I figured out to do: I added a new column as type 'cube': ALTER table zipcodes add column earth_coords cube; Next I converted the old lat/lon data I had stored in a 'point' type to the new format: -- Make to get lat/lon in the right order for your data model! UPDATE zipcodes set earth_coords = ll_to_earth( lon_lat[1], lon_lat[0] ); Now I added a GIST index on the field: CREATE index earth_coords_idx on zipcodes using gist (earth_coords); Finally, I was able to run a query, which I could see used the index (by checking EXPLAIN ANALYZE ... select * from zipcodes where earth_box('(436198.322855334, 4878562.8732218, 4085386.43843934)'::cube,16093.44) @ earth_coords; It's also notable that the units used are meters, not miles like geo_distance(). That's what the magic number of 16093.44 is-- 10 miles converted to meters. When I benchmarked this query against the old geo_distance() variation, it was about 200 times faster (~100ms vs .5ms). However, my next step was to try a more real world query that involved a more complex where clause and a couple of table joins. So far, that result is coming out /slower/ with the new approach, even though the index is being used. I believe this may be cause of the additional results found that are outside of the sphere, but inside the cube. This causes additional rows that need processing in the joined tables. Could someone post an example of how to further refine this so the results more closely match what geo_distance returns() ? Any other indexing or optimization tips would be appreciated. Mark ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[PERFORM] Using statement_timeout as a performance tool?
Hello, I'm working on setting up replication with Slony, and will soon have a slave that a lot of SELECT traffic will be sent to (over 500k/day). The primary query we need to run is somewhat complex, but seems to complete on average in well under a second. However, every so often (less in 1 in 10,000 queries) we'll see the query take 2 or 3 minutes. It's not clear why this is happening-- perhaps there is something else going on that is affecting this query. I'm considering the use of statement_timeout to limit the time of this particular query, to suppress the rare run away, and avoid tying up the processor for that additional time. I think it may be better to give up, quit spending cycles on it right then, and return an oops, try again in a few minutes message instead. From the data we have, seems like it has a strong chance of working again. Is anyone else using statement_timeout as part of an overall performance plan? Mark ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Most effective tuning choices for busy website?
Neil Conway wrote: Mark Stosberg wrote: I've used PQA to analyze my queries and happy overall with how they are running. About 55% of the query time is going to variations of the pet searching query, which seems like where it should be going. The query is frequent and complex. It has already been combed over for appropriate indexing. It might be worth posting the EXPLAIN ANALYZE and relevant schema definitions for this query, in case there is additional room for optimization. Our hardware: Dual 3 Ghz processors 3 GB RAM, running on FreeBSD. Disk? You are presumably using Xeon processors, right? If so, check the list archives for information on the infamous context switching storm that causes performance problems for some people using SMP Xeons. I wanted to follow-up to report a positive outcome to tuning this Xeon SMP machine on FreeBSD. We applied the following techniques, and saw the average CPU usage drop by about 25%. - in /etc/sysctl.conf, we set it to use raw RAM for shared memory: kern.ipc.shm_use_phys=1 - We updated our kernel config and postmaster.conf to set shared_buffers to about 8000. - We disabled hyperthreading in the BIOS, which had a label like Logical Processors? : Disabled. I recall there was tweak my co-worker made that's not on my list. I realize it's not particularly scientific because we changed several things at once...but at least it is working well enough for now. Mark ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] Most effective tuning choices for busy website?
Hello, I'm the fellow who was interviewed in the fall about using PostgreSQL on 1-800-Save-A-Pet.com: http://techdocs.postgresql.org/techdocs/interview-stosberg.php The site traffic continues to grow, and we are now seeing parts of the day where the CPU load (according to MRTG graphs) on the database server is stuck at 100%. I would like to improve this, and I'm not sure where to look first. The machine is a dedicated PostgreSQL server which two web server boxes talk to. I've used PQA to analyze my queries and happy overall with how they are running. About 55% of the query time is going to variations of the pet searching query, which seems like where it should be going. The query is frequent and complex. It has already been combed over for appropriate indexing. I'm more interested at this point in tuning the software and hardware infrastructure, but would like to get a sense about which choices will bring the greatest reward. Let me explain some avenues I'm considering. - We are currently running 7.4. If I upgrade to 8.0 and DBD::Pg 1.42, then the server side prepare feature will be available for use. We do run the same queries a number of times. - PhpPgAds seems to sucking up about 7.5% of our query time and is unrelated to the core application. We could move this work to another machine. The queries it generates seem like they have some room to optimized, or simply don't need to be run in some cases. However, I would like to stay out of modifying third-party code and PHP if possible. - I saw the hardware tip to Separate the Transaction Log from the Database. We have about 60% SELECT statements and 14% UPDATE statements. Focusing more on SELECT performance seems more important for us. - We have tried to tune 'shared_buffers' some, but haven't seen a noticeable performance improvement. Our hardware: Dual 3 Ghz processors 3 GB RAM, running on FreeBSD. I'm not quite sure how to check our average connection usage, but maybe this is helpful: When I do: select count(*) from pg_stat_activity ; I get values around 170. We have these values: max_connections = 400 shared_buffers = 4096 Most other values in postgresql.conf are still at the their defaults. Any suggestions are which avenues might offer the most bang for the buck are appreciated! ( I have already found: http://www.powerpostgresql.com/PerfList/ and it has been a very helpful source of suggestions. ) Mark ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly