Re: [PERFORM] How to force Nested Loop plan?
Rob Nagler [EMAIL PROTECTED] writes: Are there plans for explicit hints to the planner? Personally, I'm philosophically opposed to planner hints; see previous discussions in the archives. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How to force Nested Loop plan?
Ron Johnson [EMAIL PROTECTED] writes: How about (if you don't already do it) ranked (or approximately ranked) b-tree indexes, where each node also stores the (approximate) count of tuple pointers under it? This way, the planner would know whether or how skewed a tree is, and (approximately) how many tuples a given WHERE predicate resolves to. Why is that better than our existing implementation of column statistics? regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] How to force Nested Loop plan?
Rob Nagler [EMAIL PROTECTED] writes: I'm trying to understand how I can get the planner to always do the right thing with this query: SELECT aa_t.min_date_time FROM aa_t , bb_t , cc_t WHERE bb_t.bb_id = aa_t.bb_id AND aa_t.realm_id = cc_t.realm_id AND aa_t.server_id = 21 ORDER BY aa_t.min_date_time desc LIMIT 1 OFFSET 674 - Index Scan Backward using aa_t20 on aa_t (cost=0.00..76738.77 rows=3454 width=46) (actual time=0.10..31.30 rows=676 loops=1) Filter: (server_id = 21::numeric) The reason the planner does not much like this plan is that it's estimating that quite a lot of rows will have to be hit in min_date_time order before it finds enough rows with server_id = 21. Thus the high cost estimate for the above step. I suspect that the reason you like this plan is that there's actually substantial correlation between server_id and min_date_time, such that the required rows are found quickly. Before trying to force the planner into what you consider an optimal plan, you had better ask yourself whether you can expect that correlation to hold up in the future. If not, your plan could become pessimal pretty quickly. I'd suggest creating a double-column index: create index aa_ti on aa_t(server_id, min_date_time); and altering the query to read ORDER BY aa_t.server_id DESC, aa_t.min_date_time DESC (you need this kluge to make sure the planner recognizes that the new index matches the ORDER BY request). Then you should get a plan with a much smaller cost coefficient for this step. regards, tom lane PS: does server_id really need to be NUMERIC? Why not integer, or at worst bigint? ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] How to force Nested Loop plan?
Tom Lane writes: The reason the planner does not much like this plan is that it's estimating that quite a lot of rows will have to be hit in min_date_time order before it finds enough rows with server_id = 21. Thus the high cost estimate for the above step. Thanks for the speedy and useful reply! More questions follow. :) Very interesting. How does it know quite a lot? Is there something I can do to get the planner to analyze the data better? I suspect that the reason you like this plan is that there's actually substantial correlation between server_id and min_date_time, such that the required rows are found quickly. Before trying to force the planner into what you consider an optimal plan, you had better ask yourself whether you can expect that correlation to hold up in the future. If not, your plan could become pessimal pretty quickly. The correlation holds. min_date_time increases over time as records are inserted. server_id is uniformly distributed over time. There's no randomness. There is at least one 21 record for every value of min_date_time. 21 is a special server_id containing aggregate (denormalized) data for the other servers. I thought about putting it in a separate table, but this would complicate the code as the data is identical to the non-aggregated case. Do you have any suggestions for organizing the data/query now that you know this? I'd suggest creating a double-column index: Thanks. I'll try this. I'm a very big fan of declarative programming. However, there's a danger in declarative programming when the interperter isn't smart enough. When I add this index, I will slow down inserts (about 20K/day) and increase data size (this is the second largest table in the database). Moreover, if the planner is improved, I've should fix my code, delete the index, etc. Is there a way of giving the planner direct hints as in Oracle? They can be ignored when the optimizer is improved, just as register is ignored by C compilers nowadays. Adding the extra index and ORDER BY is also not easy in our case. The query is dynamically generated. I can force the query ORDER BY to be whatever I want, but I would lose the ability to do interesting things, like the automatic generation of ORDER BY when someone clicks on a column header in the application. Indeed there are times when people want to sort on other columns in the query. I reduced the problem to the salient details for my post to this board. What if the ORDER BY was: ORDER BY aa_t.server_id DESC, cc_t.name ASC Would the planner do the right thing? PS: does server_id really need to be NUMERIC? Why not integer, or at worst bigint? It is a NUMERIC(18). It could be a bigint. What would be the change in performance of this query if we changed it to bigint? BTW, my customer is probably going to be switching to Oracle. This particular query has been one of the reasons. Maybe this change will help us stay with Postgres. Thanks, Rob ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How to force Nested Loop plan?
On Sat, 2003-08-30 at 10:47, Rob Nagler wrote: Tom Lane writes: [snip] enough. When I add this index, I will slow down inserts (about 20K/day) and increase data size (this is the second largest table in [snip] Since I gather that this is a web-site, can we presume that they are clumped into an 8 hour range? 20,000/8 = 2,500/hour, which is 41.67/minute. If you can't do .69 inserts/second, something is wrong, and it ain't hardware, and it ain't Postgresql... PS: does server_id really need to be NUMERIC? Why not integer, or at worst bigint? It is a NUMERIC(18). It could be a bigint. What would be the change in performance of this query if we changed it to bigint? http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-INT http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-NUMERIC-DECIMAL Scalars are faster than arbitrary precision types. Small (32 bit) scalars are faster than bit (64 bit) scalars on x86 h/w. -- - Ron Johnson, Jr. [EMAIL PROTECTED] Jefferson, LA USA Adventure is a sign of incompetence Stephanson, great polar explorer ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] How to force Nested Loop plan?
Tom Lane writes: That doesn't really tell me anything. What's the proportion of 21 records out of the total table? Currently we have about 15 servers so 6% of the data is uniformly distributed with the value 21. create index fooi on foo (min_date_time) where server_id = 21; This reduces the cost of maintaining the index but also makes it useful *only* for id = 21 queries. On the plus side, you don't need to hack the ORDER BY clause to get your queries to use it. Your choice... I like that better, thanks. After testing I found the elbow was at 1610 records with this index, but this clause still yields better performance at 1654 records: AND aa_t.server_id IN (21, 0) This is independent of the existence of the new index. Interestingly, when I drop the aa_t5 index, the elbow goes up to 1729 with the IN (21, 0) query. You might ask: why do I have an index at all? That's from my Oracle experience. server_id is a foreign key into the server table. If you don't create an index on a foreign key, Oracle locks the entire foreign table when you modify the local table. With an index, it only locks a row in the foreign table. This causes a major bottleneck, but in this case the server table is static. Therefore, the index is superfluous, and since there are only 16 values, the index should be bitmap index (Oracle speak, sorry, don't know the PG term). Dropping the index probably won't change any of the other queries, I think. Without the aa_t5 index and after the elbow, the Index Scan is replaced with a Seq Scan, which is just about as fast, but still 50 times slower than before the elbow: Limit (cost=34071.30..34071.31 rows=1 width=84) (actual time=5111.14..5111.15 rows=1 loops=1) - Sort (cost=34066.98..34075.61 rows=3454 width=84) (actual time=5108.74..5109.96 rows=1733 loops=1) Sort Key: aa_t.min_date_time - Merge Join (cost=33801.26..33863.98 rows=3454 width=84) (actual time=4868.62..5020.58 rows=41879 loops=1) Merge Cond: (outer.realm_id = inner.realm_id) - Sort (cost=31.64..32.78 rows=455 width=19) (actual time=3.06..3.38 rows=415 loops=1) Sort Key: cc_t.realm_id - Seq Scan on cc_t (cost=0.00..11.55 rows=455 width=19) (actual time=0.05..0.99 rows=455 loops=1) - Sort (cost=33769.63..33778.26 rows=3454 width=65) (actual time=4865.20..4895.28 rows=41879 loops=1) Sort Key: aa_t.realm_id - Merge Join (cost=33296.79..33566.63 rows=3454 width=65) (actual time=4232.52..4541.24 rows=41879 loops=1) Merge Cond: (outer.bb_id = inner.bb_id) - Sort (cost=25216.97..25225.60 rows=3454 width=46) (actual time=3213.53..3243.65 rows=41879 loops=1) Sort Key: aa_t.bb_id - Seq Scan on aa_t (cost=0.00..25013.97 rows=3454 width=46) (actual time=20.07..2986.11 rows=41879 loops=1) Filter: (server_id = 21::numeric) - Sort (cost=8079.83..8184.53 rows=41879 width=19) (actual time=1018.95..1049.37 rows=41879 loops=1) Sort Key: bb_t.bb_id - Seq Scan on bb_t (cost=0.00..4864.79 rows=41879 width=19) (actual time=0.04..810.88 rows=41879 loops=1) Total runtime: 5141.22 msec What I'm not sure is why does it decide to switch modes so early, i.e., at about 5% of the table size or less? It seems that an Index Scan would give better mileage than a Seq Scan for possibly up to 50% of the table in this case. I clearly don't understand the internals, but the elbow seems rather sharp to me. What if the ORDER BY was: ORDER BY aa_t.server_id DESC, cc_t.name ASC Would the planner do the right thing? What do you consider the right thing? cc_t.name doesn't seem connected to this table at all --- or did I miss something? Sorry, this is a red herring. Please ignore. If you've been generically using NUMERIC(n) where you could be using integer or bigint, then I think you've probably paid a high price without knowing it. I don't know what Oracle's cost tradeoffs are for these datatypes, but I can tell you that Postgres's integer types are way faster (and more compact) than our NUMERIC. I'll try to figure out what the price is in our case. I think Oracle does a pretty good job on data compression for NUMERIC. I haven't dealt with a large Postgres database until this one, so I guess it's time to learn. :) We actually have been quite pleased with Postgres's performance without paying much attention to it before this. When we first set up Oracle, we got into all of its parameters pretty heavily. With Postgres, we just tried it and it worked. This is the first query where we ran out of ideas to try. BTW, everybody's help on this list is fantastic. Usually, I can
Re: [PERFORM] How to force Nested Loop plan?
On Sat, 2003-08-30 at 15:42, Rob Nagler wrote: [snip] We actually have been quite pleased with Postgres's performance without paying much attention to it before this. When we first set up Oracle, we got into all of its parameters pretty heavily. With Postgres, we just tried it and it worked. This is the first query where we ran out of ideas to try. Dumb question: given your out-of-the-box satisfaction, could it be that postgresql.conf hasn't been tweaked? -- - Ron Johnson, Jr. [EMAIL PROTECTED] Jefferson, LA USA 484,246 sq mi are needed for 6 billion people to live, 4 persons per lot, in lots that are 60'x150'. That is ~ California, Texas and Missouri. Alternatively, France, Spain and The United Kingdom. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly