Re: [PERFORM] Linux Filesystems again - Ubuntu this time
Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Also xfs has seen quite a bit of development in these later kernels, any thoughts on that? We've been using xfs for a few years now with good performance and no problems other than needing to disable write barriers to get good performance out of our battery-backed RAID adapter. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Linux Filesystems again - Ubuntu this time
Kevin, While we're on the topic, do you also diable fsync? We use xfs with battery-backed raid as well. We have had no issues with xfs. I'm curious whether anyone can comment on his experience (good or bad) using xfs/battery-backed-cache/fsync=off. Thanks, Whit On Tue, Jul 27, 2010 at 9:48 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Also xfs has seen quite a bit of development in these later kernels, any thoughts on that? We've been using xfs for a few years now with good performance and no problems other than needing to disable write barriers to get good performance out of our battery-backed RAID adapter. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Linux Filesystems again - Ubuntu this time
Whit Armstrong armstrong.w...@gmail.com wrote: While we're on the topic, do you also diable fsync? We only disable fsync during bulk loads, where we would be starting over anyway if there was a failure. Basically, you should never use fsync unless you are OK with losing everything in the database server if you have an OS or hardware failure. We have a few databases where we would consider that if performance wasn't otherwise acceptable, since they are consolidated replicas of off-side source databases, and we have four identical ones in two separate buildings; however, since performance is good with fsync on and it would be a bother to have to copy from one of the other servers in the event of an OS crash, we leave it on. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Linux Filesystems again - Ubuntu this time
Kevin Grittner kevin.gritt...@wicourts.gov wrote: Basically, you should never use fsync unless you are OK with losing everything in the database server if you have an OS or hardware failure. s/use/disable/ -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Linux Filesystems again - Ubuntu this time
Thanks. But there is no such risk to turning off write barriers? I'm only specifying noatime for xfs at the moment. Did you get a substantial performace boost from disabling write barriers? like 10x or more like 2x? Thanks, Whit On Tue, Jul 27, 2010 at 1:19 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: Basically, you should never use fsync unless you are OK with losing everything in the database server if you have an OS or hardware failure. s/use/disable/ -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] potential performance gain by query planner optimization
Kneringer, Armin armin.knerin...@fabasoft.com writes: I think I found a potential performance gain if the query planner would be optimized. All Tests has been performed with 8.4.1 (and earlier versions) on CentOS 5.3 (x64) The following query will run on my database (~250 GB) for ca. 1600 seconds and the sort will result in a disk merge deploying ca. 200 GB of data to the local disk (ca. 180.000 tmp-files) What have you got work_mem set to? It looks like you must be using an unreasonably large value, else the planner wouldn't have tried to use a hash join here: - Hash (cost=11917516.57..11917516.57 rows=55006045159 width=16) - Nested Loop (cost=0.00..11917516.57 rows=55006045159 width=16) - Seq Scan on atdateval t5 (cost=0.00...294152.40 rows=1859934 width=12) Filter: (attrid = 281479288456447::bigint) - Index Scan using ind_ataggval on ataggval q1_1 (cost=0.00..6.20 rows=4 width=12) Index Cond: ((q1_1.attrid = 281479288456451::bigint) AND (q1_1.aggval = t5.aggrid)) Filter: (q1_1.aggrid = 0) Also, please try something newer than 8.4.1 --- this might be some already-fixed bug. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Linux Filesystems again - Ubuntu this time
Whit Armstrong armstrong.w...@gmail.com wrote: But there is no such risk to turning off write barriers? Supposedly not: http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F Did you get a substantial performace boost from disabling write barriers? like 10x or more like 2x? It made a huge difference on creation and deletion of disk files. Unfortunately we have some procedures which use a cursor and loop through rows calling a function which creates and drops a temporary table. While I would like to see those transactions rewritten to use sane techniques, they run fast enough without the write barriers to be acceptable to the users, which puts the issue pretty low on the priority list. I don't have the numbers anymore, but I'm sure it was closer to 100 times slower than 10 times. In some workloads you might not notice the difference, although I would watch out for checkpoint behavior. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how to handle a big table for data log
Let me make my problem clearer. Here is a requirement to log data from a set of objects consistently. For example, the object maybe a mobile phone and it will report its location every 30s. To record its historical trace, I create a table like *CREATE TABLE log_table ( id integer NOT NULL, data_type integer NOT NULL, data_value double precision, ts timestamp with time zone NOT NULL, CONSTRAINT log_table_pkey PRIMARY KEY (id, data_type, ts) )*; In my location log example, the field data_type could be longitude or latitude. I create a primary key (id, data_type, ts) to make my queries more efficient. The major type of queries would ask the latest data_value of a data_type by given id and timestamp. For this kind of query, I make the following SQL statement *SELECT * FROM log_table WHERE id=[given id] and data_type='longitude' and (ts = (SELECT max(ts) FROM log_table WHERE id=[given id]and data_type='longitude' and ts=[given timestamp]));* According to my evaluation, its performance is acceptable. However, I concern more about the performance of insert operation. As I have mentioned, the log_table is growing so I decide to partition it. Currently, I partition it by date and only keep it 60 days. This partition is helpful. But when I partition it by data_type (in my case, the number of data_type is limited, say 10), the performance of insert operation will be degraded. I guess this is caused by multiple vacuum/analyze on these partitioned data_type log tables. However, if I put all data_type logs together, I can expect that the performance of insert operation will also have degradation if I want to expand the system to support more mobile phones or more data_type. This is my current situation. Please give me some hints to improve the performance (especially for the insert part). kuopo. On Mon, Jul 19, 2010 at 11:37 PM, Jorge Montero jorge_mont...@homedecorators.com wrote: Large tables, by themselves, are not necessarily a problem. The problem is what you might be trying to do with them. Depending on the operations you are trying to do, partitioning the table might help performance or make it worse. What kind of queries are you running? How many days of history are you keeping? Could you post an explain analyze output of a query that is being problematic? Given the amount of data you hint about, your server configuration, and custom statistic targets for the big tables in question would be useful. kuopo sp...@cs.nctu.edu.tw 7/19/2010 1:27 AM Hi, I have a situation to handle a log table which would accumulate a large amount of logs. This table only involves insert and query operations. To limit the table size, I tried to split this table by date. However, the number of the logs is still large (46 million records per day). To further limit its size, I tried to split this log table by log type. However, this action does not improve the performance. It is much slower than the big table solution. I guess this is because I need to pay more cost on the auto-vacuum/analyze for all split tables. Can anyone comment on this situation? Thanks in advance. kuopo. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
On Fri, 2010-07-23 at 09:52 -0700, Joshua D. Drake wrote: On Thu, 2010-07-22 at 20:56 +0100, Hannu Krosing wrote: Let's extend this shall we: Avoid adding yet another network hop postgreSQL is multi-process, so you either have a separate pooler process or need to put pooler functionality in postmaster, bothw ways you still have a two-hop scenario for connect. you may be able to pass the socket to child process and also keep it, but doing this for both client and db sides seems really convoluted. Which means, right now there is three hops. Reducing one is good. No, it is still two, as postmaster passes the socket to spwaned child postgresql process after login. the process is as follows Client --connects-- postmaster --spawns-- postgreSQL server process then socket is passed to be used directly so the use is Client --talks-to--- postgreSQL server process when using spooler it becomes Client --connects-to-- Spooler --passes-requests-to-- postgreSQL I see no way to have spooler select the postgreSQL process, pass the client connection in a way that taks directly to postgrSQL server process AND be able to get the server connection back once the client is finishe with either the request, transaction or connection (depending on pooling mode). Or is there a prortable way to pass sockets back and forth between parent and child processes ? If so, then pgbouncer could use it as well. Remove of a point of failure rather move the point of failure from external pooler to internal pooler ;) Yes but at that point, it doesn't matter. Reduction of administrative overhead Possibly. But once you start actually using it, you still need to configure and monitor it and do other administrator-y tasks. Yes, but it is inclusive. Integration into our core authentication mechanisms True, although for example having SSL on client side connection will be so slow that it hides any performance gains from pooling, at least for short-lived connections. Yes, but right now you can't use *any* pooler with LDAP for example. We could if pooling was in core. Your SSL argument doesn't really work because its true with or without pooling. As main slowdown in SSL is connection setup, so you can get the network security and pooling speedup if you run pool on client side and make the pooler-server connection over SSL. Greater flexibility in connection control Yes, poolers can be much more flexible than default postgresql. See for example pgbouncers PAUSE , RECONFIGURE and RESUME commands :D And, having connection pooling in core does not eliminate the use of an external pool where it makes since. Probably the easiest way to achieve pooling in core would be adding an option to start pgbouncer under postmaster control. Yeah but that won't happen. I guess it could happen as part of opening up the postgresql controlled process part to be configurable and able to run third party stuff. Another thing to run under postmaster control would be pgqd . Also I think we may have a libevent dependency that we have to work out. You probably can't get much leaner than pgbouncer. Oh don't get me wrong. I love pgbouncer. It is my recommended pooler but even it has limitations (such as auth). As pgbouncer is single-threaded and the main goal has been performance there is not much enthusiasm about having _any_ auth method included which cant be completed in a few cpu cycles. It may be possible to add threads to wait for LDAP/Kerberos/... response or do SSL handshakes, but i have not seen any interest from Marko to do it himself. Maybe there is a way to modularise the auth part of postmaster in a way that could be used from third party products through some nice API which postmaster-controlled pgbouncer can start using. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
On Sat, 2010-07-24 at 14:36 +0800, Craig Ringer wrote: On 24/07/10 13:23, Greg Smith wrote: Joshua Tolley wrote: Relatively minor, but it would be convenient to avoid having to query $external_pooler to determine the client_addr of an incoming connection. You suggest this as a minor concern, but I consider it to be one of the most compelling arguments in favor of in-core pooling. A constant pain with external poolers is the need to then combine two sources of data in order to track connections fully, which is something that everyone runs into eventually and finds annoying. It's one of the few things that doesn't go away no matter how much fiddling you do with pgBouncer, it's always getting in the way a bit. And it seems to seriously bother systems administrators and developers, not just the DBAs. Putting a pooler in core won't inherently fix this, and won't remove the need to solve it for cases where the pooler can't be on the same machine. 9.0 has application_name to let apps identify themselves. Perhaps a pooled_client_ip, to be set by a pooler rather than the app, could be added to address this problem in a way that can be used by all poolers new and existing, not just any new in-core pooling system. If a privileged set of pooler functions is was considered, as per my other recent mail, the pooler could use a management connection to set the client ip before handing the connection to the client, so the client couldn't change pooled_client_ip its self by accident or through malice. But even without that, it'd be awfully handy. Or maybe we can add some command extensions to the protocol for passing extra info, so that instead of sending just the (run_query:query) command over socket we could send both the extra info and execute (set_params:(proxy_client_ip:a.b.c.d)(proxy_client_post:n)(something else))(run_query:query) in one packet (for performance) and have these things be available in logging and pg_stat_activity I see no need to try to somehow restrict these if you can always be sure that they are set by the direct client. proxy can decide to pass some of these from the real client but it would be a decision made by proxy, not mandated by some proxying rules. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Slow query using the Cube contrib module.
Hello, I have a simple table which has a cube column and a cube GiST index. The column contains 3-dimensional points (not cubes or intervals). The problem is that I'm getting very slow queries when I'm using the index. The table has about 130,000 rows and is full-vacuumed after any updates/inserts (it is updated only once every 24 hours). Table definition: CREATE TABLE picof.photo_colors ( photo_id integer NOT NULL, color_percent real NOT NULL, lab_color picof.cube NOT NULL ) WITH ( OIDS=FALSE ); CREATE INDEX photo_colors_index ON picof.photo_colors USING gist (lab_color); My query: SELECT photo_id FROM photo_colors WHERE lab_color @ cube_enlarge('0, 0, 0', 10, 3) Explain analyze: Bitmap Heap Scan on photo_colors (cost=13.40..421.55 rows=135 width=4) (actual time=7.958..15.493 rows=14313 loops=1) Recheck Cond: (lab_color @ '(-10, -10, -10),(10, 10, 10)'::cube) - Bitmap Index Scan on photo_colors_index (cost=0.00..13.36 rows=135 width=0) (actual time=7.556..7.556 rows=14313 loops=1) Index Cond: (lab_color @ '(-10, -10, -10),(10, 10, 10)'::cube) Total runtime: 16.849 ms (Executed in PostgreSQL 8.4.4 on Windows and CentOS - same query plan) Now, while it might not seem much, this is part of a bigger query in which several such subqueries are being joined. The cost really adds up. My question is: Why is it doing a Bitmap Heap Scan / Recheck Cond? I've executed dozens of such queries and not once did the rechecking remove any rows. Is there any way to disable it, or do you have any other suggestions for optimizations (because I'm all out of ideas)? Thank you in advance! --- Liviu Mirea -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
On Sat, Jul 24, 2010 at 2:23 AM, Craig Ringer cr...@postnewspapers.com.au wrote: On 24/07/10 01:28, Robert Haas wrote: Well, if we could change the backends so that they could fully reinitialize themselves (disconnect from a database to which they are bound, etc.), I don't see why we couldn't use the Apache approach. This would offer the bonus on the side that it'd be more practical to implement database changes for a connection, akin to MySQL's USE. Inefficient, sure, but possible. Yep. I don't care about that current limitation very much. I think anyone changing databases all the time probably has the wrong design and should be using schema. I'm sure there are times it'd be good to be able to switch databases on one connection, though. I pretty much agree with this. I think this is merely slightly nice on its own, but I think it might be a building-block to other things that we might want to do down the road. Markus Wanner's Postgres-R replication uses worker processes; autovacuum does as well; and then there's parallel query. I can't help thinking that not needing to fork a new backend every time you want to connect to a new database has got to be useful. My question with all this remains: is it worth the effort when external poolers already solve the problem. Whether it's worth the effort is something anyone who is thinking about working on this will have to decide for themselves. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
On Mon, 2010-07-26 at 14:34 -0400, Greg Smith wrote: Matthew Wakeling wrote: Yeb also made the point - there are far too many points on that graph to really tell what the average latency is. It'd be instructive to have a few figures, like only x% of requests took longer than y. Average latency is the inverse of TPS. So if the result is, say, 1200 TPS, that means the average latency is 1 / (1200 transactions/second) = 0.83 milliseconds/transaction. This is probably only true if you run all transactions sequentially in one connection? If you run 10 parallel threads and get 1200 sec, the average transaction time (latency?) is probably closer to 8.3 ms ? The average TPS figure is normally on a more useful scale as far as being able to compare them in ways that make sense to people. pgbench-tools derives average, worst-case, and 90th percentile figures for latency from the logs. I have 37MB worth of graphs from a system showing how all this typically works for regular hard drives I've been given permission to publish; just need to find a place to host it at internally and I'll make the whole stack available to the world. So far Yeb's data is showing that a single SSD is competitive with a small array on average, but with better worst-case behavior than I'm used to seeing. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Big difference in time returned by EXPLAIN ANALYZE SELECT ... AND SELECT ...
Piotr Gasidło wrote: EXPLAIN ANALYZE SELECT ... Total runtime: 4.782 ms Time: 25,970 ms Vitalii Tymchyshyn wrote: Actually it's 20ms, so I suspect your point about planning time is correct. Craig Ringer wrote: Oh, a commas-as-fraction-separator locale. That makes sense. Thanks for the catch. Strangely, the runtime is shown with a period for the separator, though. -- Lew -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how to handle a big table for data log
On Tue, Jul 20, 2010 at 9:51 PM, kuopo sp...@cs.nctu.edu.tw wrote: Let me make my problem clearer. Here is a requirement to log data from a set of objects consistently. For example, the object maybe a mobile phone and it will report its location every 30s. To record its historical trace, I create a table like *CREATE TABLE log_table ( id integer NOT NULL, data_type integer NOT NULL, data_value double precision, ts timestamp with time zone NOT NULL, CONSTRAINT log_table_pkey PRIMARY KEY (id, data_type, ts) )*; In my location log example, the field data_type could be longitude or latitude. I witnessed GridSQL in action many moons ago that managed a massive database log table. From memory, the configuration was 4 database servers with a cumulative 500M+ records and queries were running under 5ms. May be worth a look. http://www.enterprisedb.com/community/projects/gridsql.do Greg
Re: [PERFORM] Slow query using the Cube contrib module.
Liviu Mirea-Ghiban wrote: My question is: Why is it doing a Bitmap Heap Scan / Recheck Cond? I've executed dozens of such queries and not once did the rechecking remove any rows. Is there any way to disable it, or do you have any other suggestions for optimizations (because I'm all out of ideas)? It's probably because the index nodes store data values with a lossy compression, which means that the index scan returns more rows than wanted, and that in turn is filtered out by the rescanning. See the comments for the 'RECHECK' parameter of CREATE OPERATOR CLASS (http://www.postgresql.org/docs/8.4/static/sql-createopclass.html). Its unwise to alter this behaviour without taking a look/modifying the underlying index implementation. The gist index scann part could perhaps be made a bit faster by using a smaller blocksize, but I'm not sure if or how the recheck part can be improved. Maybe rewriting the top query to not do bitmap heap scans in subqueries or inner loops? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
Robert Haas robertmh...@gmail.com writes: On Thu, Jul 22, 2010 at 5:29 PM, Andres Freund and...@anarazel.de wrote: The problem is harder for us because a backend can't switch identities once it's been assigned to a database. I haven't heard an adequate explanation of why that couldn't be changed, though. Possibly it might decrease the performance significantly enough by reducing the cache locality (syscache, prepared plans)? Those things are backend-local. The worst case scenario is you've got to flush them all when you reinitialize, in which case you still save the overhead of creating a new process. Flushing them all is not zero-cost; it's not too hard to believe that it could actually be slower than forking a clean new backend. What's much worse, it's not zero-bug. We've got little bitty caches all over the backend, including (no doubt) some caching behavior in third-party code that wouldn't get the word about whatever API you invented to deal with this. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how to handle a big table for data log
On 7/20/10 8:51 PM, kuopo wrote: Let me make my problem clearer. Here is a requirement to log data from a set of objects consistently. For example, the object maybe a mobile phone and it will report its location every 30s. To record its historical trace, I create a table like /CREATE TABLE log_table ( id integer NOT NULL, data_type integer NOT NULL, data_value double precision, ts timestamp with time zone NOT NULL, CONSTRAINT log_table_pkey PRIMARY KEY (id, data_type, ts) )/; In my location log example, the field data_type could be longitude or latitude. If what you have is longitude and latitude, why this brain-dead EAV table structure? You're making the table twice as large and half as useful for no particular reason. Use the point datatype instead of anonymizing the data. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query using the Cube contrib module.
Yeb Havinga yebhavi...@gmail.com writes: Liviu Mirea-Ghiban wrote: My question is: Why is it doing a Bitmap Heap Scan / Recheck Cond? I've executed dozens of such queries and not once did the rechecking remove any rows. Is there any way to disable it, or do you have any other suggestions for optimizations (because I'm all out of ideas)? It's probably because the index nodes store data values with a lossy compression, which means that the index scan returns more rows than wanted, and that in turn is filtered out by the rescanning. The recheck expression is only executed if the index reports that it's not executed the search exactly. If you don't see any difference between the indexscan and bitmapscan output counts, it's probably because the index can do the case exactly, so the recheck expression isn't really getting used. The planner has to include the expression in the plan anyway, because the decision about lossiness is not known until runtime. But it's not costing any runtime. The OP is mistaken to think there's anything wrong with this plan choice more than likely, it's the best available plan. The reason there's a significant gap between the indexscan runtime and the bitmapscan runtime is that that's the cost of going and actually fetching all those rows from the table. The only way to fix that is to buy a faster disk or get more RAM so that more of the table can be held in memory. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Questions on query planner, join types, and work_mem
I have spent the last couple of weeks digging into a Postgres performance problem that ultimately boiled down to this: the planner was choosing to use hash joins on a set of join keys that were much larger than the configured work_mem. We found we could make the performance much better by either 1) increasing work_mem to 500MB or more, or 2) forcing the planner to choose index-backed nested loops by turning off hash and merge joins as well as bitmap and sequential scans. Now we are trying to decide which of these paths to choose, and asking why the planner doesn't handle this for us. Background: LabKey builds an open source platform for biomedical research data. The platform consists of a tomcat web application and a relational database. we support two databases, Postgres and SQL Server. We started with SQL Server because we were very familiar with it. Two of our technical team came from the SQL Server development team. We chose Postgres because we assessed that it was the open source database most likely to be able to handle our application requirements for capacity and complex, nested, generated SQL handling. Postgres is now the default database for our platform and most of our key customers use it. In general we've been very satisfied with Postgres' performance and compatibility, but our customers are starting to hit situations where we really need to be able to understand why a particular operation is slow. We are currently recommending version 8.4 and using that ourselves. The core of the problem query was SELECT * INTO snapshot_table FROM (SELECT ... FROM tableA A LEFT OUTER JOIN tableB B ON (A.lsid = B.lsid) and A.datasetid = ? ) query1 the join column, lsid, is a poor choice for a join column as it is a long varchar value (avg length 101 characters) that us only gets unique way out on the right hand side. But we are stuck with this choice. I can post the SQL query and table definitions if it will help, but changes to either of those would be risky and difficult, whereas setting the work_mem value or forcing nested loop joins is less risky. The Performance curve looks something like this Join Type work_mem(MB) time to populate snapshot (min) __ Hash 5085 Hash 200 38 Hash 400 21 Hash 500 12 Hash 1000 12 ___ NestedLoop5015 NestedLoop200 11 NestedLoop400 11 NestedLoop500 10 NestedLoop 1000 10 Table A contains about 3.5 million rows, and table B contains about 4.4 million rows. By looking at the EXPLAIN ANALYZE reports I concluded that the planner seemed to be accurately determining the approximate number of rows returned on each side of the join node. I also noticed that at the work_mem = 50 test, the hash join query execution was using over a GB of space in the pgsql_tmp, space that grew and shrank slowly over the course of the test. Now for the questions: 1) If we tell the customer to set his work_mem value to 500MB or 1GB in postgres.config, what problems might they see? the documentation and the guidelines we received from Rupinder Singh in support suggest a much lower value, e.g. a max work_mem of 10MB. Other documentation such as the Guide to Posting Slow Query Questions suggest at least testing up to 1GB. What is a reasonable maximum to configure for all connnections? 2) How is work_mem used by a query execution? For example, does each hash table in an execution get allocated a full work_mem's worth of memory ? Is this memory released when the query is finished, or does it stay attached to the connection or some other object? 3) is there a reason why the planner doesn't seem to recognize the condition when the hash table won't fit in the current work_mem, and choose a low-memory plan instead? Excuse the long-winded post; I was trying to give the facts and nothing but the facts. Thanks, Peter Hussey LabKey Software
Re: [PERFORM] Questions on query planner, join types, and work_mem
Hi, On Tue, Jul 27, 2010 at 04:08:16PM -0700, Peter Hussey wrote: Now for the questions: 1) If we tell the customer to set his work_mem value to 500MB or 1GB in postgres.config, what problems might they see? the documentation and the guidelines we received from Rupinder Singh in support suggest a much lower value, e.g. a max work_mem of 10MB. Other documentation such as the Guide to Posting Slow Query Questions suggest at least testing up to 1GB. What is a reasonable maximum to configure for all connnections? Well. That depends on the amount of expected concurrency and available memory. Obviously you can set it way much higher in an OLAPish, low concurrency setting than in an OLTP environment. That setting is significantly complex to estimate in my opinion. For one the actualy usage depends on the complexity of the queries, for another to be halfway safe you have to use avail_mem/(max_connections * max_nodes_of_most_complex_query). Which is often a very pessimistic and unusably low estimate. 2) How is work_mem used by a query execution? For example, does each hash table in an execution get allocated a full work_mem's worth of memory ? Is this memory released when the query is finished, or does it stay attached to the connection or some other object? Each Node of the query can use one work_mem worth of data (sometimes a bit more). The memory is released after the query finished (or possibly earlier, dependent of the structure of the query). The specific allocation pattern and implementation details (of malloc) influence how and when that memory is actually returned to the os. 3) is there a reason why the planner doesn't seem to recognize the condition when the hash table won't fit in the current work_mem, and choose a low-memory plan instead? Hard to say without more information. Bad estimates maybe? Best show your query plan (EXPLAIN ANALYZE), the table definition and some details about common hardware (i.e. whether it has 1GB of memory or 256GB). Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Questions on query planner, join types, and work_mem
Peter Hussey pe...@labkey.com writes: I have spent the last couple of weeks digging into a Postgres performance problem that ultimately boiled down to this: the planner was choosing to use hash joins on a set of join keys that were much larger than the configured work_mem. What Postgres version is this, exactly? (8.4 is not the answer I want.) the join column, lsid, is a poor choice for a join column as it is a long varchar value (avg length 101 characters) that us only gets unique way out on the right hand side. Hm, but it is unique eventually? It's not necessarily bad for hashing as long as that's so. 1) If we tell the customer to set his work_mem value to 500MB or 1GB in postgres.config, what problems might they see? That would almost certainly be disastrous. If you have to follow the hack-work_mem path, I'd suggest increasing it locally in the session executing the problem query, and only for the duration of that query. Use SET, or even SET LOCAL. 2) How is work_mem used by a query execution? Well, the issue you're hitting is that the executor is dividing the query into batches to keep the size of the in-memory hash table below work_mem. The planner should expect that and estimate the cost of the hash technique appropriately, but seemingly it's failing to do so. Since you didn't provide EXPLAIN ANALYZE output, though, it's hard to be sure. 3) is there a reason why the planner doesn't seem to recognize the condition when the hash table won't fit in the current work_mem, and choose a low-memory plan instead? That's the question, all right. I wonder if it's got something to do with the wide-varchar nature of the join key ... but again that's just speculation with no facts. Please show us EXPLAIN ANALYZE results for the hash plan with both small and large work_mem, as well as for the nestloop plan. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
On Tue, Jul 27, 2010 at 4:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jul 22, 2010 at 5:29 PM, Andres Freund and...@anarazel.de wrote: The problem is harder for us because a backend can't switch identities once it's been assigned to a database. I haven't heard an adequate explanation of why that couldn't be changed, though. Possibly it might decrease the performance significantly enough by reducing the cache locality (syscache, prepared plans)? Those things are backend-local. The worst case scenario is you've got to flush them all when you reinitialize, in which case you still save the overhead of creating a new process. Flushing them all is not zero-cost; it's not too hard to believe that it could actually be slower than forking a clean new backend. I'm not so sure I believe that. Is a sinval overrun slower than forking a clean new backend? Is DISCARD ALL slower that forking a clean new backend? How much white space is there between either of those and what would be needed here? I guess it could be slower, but I wouldn't want to assume that without evidence. What's much worse, it's not zero-bug. We've got little bitty caches all over the backend, including (no doubt) some caching behavior in third-party code that wouldn't get the word about whatever API you invented to deal with this. I think this is probably the biggest issue with the whole idea, and I agree there would be some pain involved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
Robert Haas robertmh...@gmail.com writes: On Tue, Jul 27, 2010 at 4:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Flushing them all is not zero-cost; it's not too hard to believe that it could actually be slower than forking a clean new backend. I'm not so sure I believe that. I'm not asserting it's true, just suggesting it's entirely possible. Other than the fork() cost itself and whatever authentication activity there might be, practically all the startup cost of a new backend can be seen as cache-populating activities. You'd have to redo all of that, *plus* pay the costs of getting rid of the previous cache entries. Maybe the latter costs less than a fork(), or maybe not. fork() is pretty cheap on modern Unixen. What's much worse, it's not zero-bug. I think this is probably the biggest issue with the whole idea, and I agree there would be some pain involved. Yeah, if it weren't for that I'd say sure let's try it. But I'm afraid we'd be introducing significant headaches in return for a gain that's quite speculative. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
On Tue, Jul 27, 2010 at 9:56 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Jul 27, 2010 at 4:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Flushing them all is not zero-cost; it's not too hard to believe that it could actually be slower than forking a clean new backend. I'm not so sure I believe that. I'm not asserting it's true, just suggesting it's entirely possible. Other than the fork() cost itself and whatever authentication activity there might be, practically all the startup cost of a new backend can be seen as cache-populating activities. You'd have to redo all of that, *plus* pay the costs of getting rid of the previous cache entries. Maybe the latter costs less than a fork(), or maybe not. fork() is pretty cheap on modern Unixen. What's much worse, it's not zero-bug. I think this is probably the biggest issue with the whole idea, and I agree there would be some pain involved. Yeah, if it weren't for that I'd say sure let's try it. But I'm afraid we'd be introducing significant headaches in return for a gain that's quite speculative. I agree that the gain is minimal of itself; after all, how often do you need to switch databases, and what's the big deal if the postmaster has to fork a new backend? Where I see it as a potentially big win is when it comes to things like parallel query. I can't help thinking that's going to be a lot less efficient if you're forever forking new backends. Perhaps the point here is that you'd actually sort of like to NOT flush all those caches unless it turns out that you're switching databases - many installations probably operate with essentially one big database, so chances are good that even if you distributed connections / parallel queries to backends round-robin, you'd potentially save quite a bit of overhead. Of course, for the guy who has TWO big databases, you might hope to be a little smarter, but that's another problem altogether. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Questions on query planner, join types, and work_mem
Excerpts from Tom Lane's message of mar jul 27 20:05:02 -0400 2010: Peter Hussey pe...@labkey.com writes: 2) How is work_mem used by a query execution? Well, the issue you're hitting is that the executor is dividing the query into batches to keep the size of the in-memory hash table below work_mem. The planner should expect that and estimate the cost of the hash technique appropriately, but seemingly it's failing to do so. Since you didn't provide EXPLAIN ANALYZE output, though, it's hard to be sure. Hmm, I wasn't aware that hash joins worked this way wrt work_mem. Is this visible in the explain output? If it's something subtle (like an increased total cost), may I suggest that it'd be a good idea to make it explicit somehow in the machine-readable outputs? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Questions on query planner, join types, and work_mem
Hello, the join column, lsid, is a poor choice for a join column as it is a long varchar value (avg length 101 characters) that us only gets unique way out on the right hand side. Would a join on subtring on the 'way out on the right hand side' (did you mean 'rightmost characters' or 'only when we take almost all the 101 characters'?) together with a function based index help? Regards, Jayadevan DISCLAIMER: The information in this e-mail and any attachment is intended only for the person to whom it is addressed and may contain confidential and/or privileged material. If you have received this e-mail in error, kindly contact the sender and destroy all copies of the original communication. IBS makes no warranty, express or implied, nor guarantees the accuracy, adequacy or completeness of the information contained in this email or any attachment and is not liable for any errors, defects, omissions, viruses or for resultant loss or damage, if any, direct or indirect. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Questions on query planner, join types, and work_mem
Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of mar jul 27 20:05:02 -0400 2010: Well, the issue you're hitting is that the executor is dividing the query into batches to keep the size of the in-memory hash table below work_mem. The planner should expect that and estimate the cost of the hash technique appropriately, but seemingly it's failing to do so. Hmm, I wasn't aware that hash joins worked this way wrt work_mem. Is this visible in the explain output? As of 9.0, any significant difference between Hash Batches and Original Hash Batches would be a cue that the planner blew the estimate. For Peter's problem, we're just going to have to look to see if the estimated cost changes in a sane way between the small-work_mem and large-work_mem cases. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
Robert Haas robertmh...@gmail.com writes: On Tue, Jul 27, 2010 at 9:56 PM, Tom Lane t...@sss.pgh.pa.us wrote: Other than the fork() cost itself and whatever authentication activity there might be, practically all the startup cost of a new backend can be seen as cache-populating activities. You'd have to redo all of that, *plus* pay the costs of getting rid of the previous cache entries. Maybe the latter costs less than a fork(), or maybe not. fork() is pretty cheap on modern Unixen. I agree that the gain is minimal of itself; after all, how often do you need to switch databases, and what's the big deal if the postmaster has to fork a new backend? Where I see it as a potentially big win is when it comes to things like parallel query. I can't help thinking that's going to be a lot less efficient if you're forever forking new backends. Color me unconvinced. To do parallel queries with pre-existing worker processes, you'd need to hook up with a worker that was (at least) in your own database, and then somehow feed it the query plan that it needs to execute. I'm thinking fork() could easily be cheaper. But obviously this is all speculation (... and Windows is going to be a whole 'nother story in any case ...) regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.
On 28/07/10 04:40, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jul 22, 2010 at 5:29 PM, Andres Freund and...@anarazel.de wrote: The problem is harder for us because a backend can't switch identities once it's been assigned to a database. I haven't heard an adequate explanation of why that couldn't be changed, though. Possibly it might decrease the performance significantly enough by reducing the cache locality (syscache, prepared plans)? Those things are backend-local. The worst case scenario is you've got to flush them all when you reinitialize, in which case you still save the overhead of creating a new process. Flushing them all is not zero-cost; it's not too hard to believe that it could actually be slower than forking a clean new backend. What's much worse, it's not zero-bug. We've got little bitty caches all over the backend, including (no doubt) some caching behavior in third-party code that wouldn't get the word about whatever API you invented to deal with this. In addition to caches, there may be some places where memory is just expected to leak. Since it's a one-off allocation, nobody really cares; why bother cleaning it up when it's quicker to just let the OS do it when the backend terminates? Being able to reset a backend for re-use would require that per-session memory use be as neatly managed as per-query and per-transaction memory, with no leaked stragglers left lying around. Such cleanup (and management) has its own costs. Plus, you have a potentially increasingly fragmented memory map to deal with the longer the backend lives. Overall, there are plenty of advantages to letting the OS clean it up. ... however, if the requirement is introduced that a given backend may only be re-used for connections to the same database, lots of things get simpler. You have to be able to change the current user (which would be a bonus anyway), reset GUCs, etc, but how much else is there to do? That way you can maintain per-database pools of idle workers (apache prefork style) with ageing-out of backends that're unused. Wouldn't this do the vast majority of what most pools are needed for anyway? And wouldn't it potentially save quite a bit of load by avoiding having backends constantly switching databases, flushing caches, etc? -- Craig Ringer Tech-related writing: http://soapyfrogs.blogspot.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance