Re: [HACKERS] Evaluation of secondary sort key.
On Mon, Apr 18, 2011 at 6:25 AM, Jesper Krogh jes...@krogh.cc wrote: Getting the value for the first sortkey and carrying on a closure for the rest would mostly (very often) be optimal ? Well that might depend. The input data to the function might be much larger than the output. Consider the, quite common, idiom of: order by case when (complex expresssion) 1 when (complex expression) 2 else 3 It would also enable a select that has to sortkeys to utilize an index that only contains the primary sortkey, which is a huge negative effect of what's being done today. This is a separate problem entirely. It would be nice to have a strategy for ordering that can take advantage of partially ordered results. It's not hard to see how to do the executor side -- it could keep a tuplesort for each group and truncate it when the group changes. As usual the hard part is having the planner figure out *when* to use it. We have a hard enough time calculating ndistinct for individual columns -- this would require having an idea of how many values are present for each major key column. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On 2011-04-18 11:00, Greg Stark wrote: On Mon, Apr 18, 2011 at 6:25 AM, Jesper Kroghjes...@krogh.cc wrote: Getting the value for the first sortkey and carrying on a closure for the rest would mostly (very often) be optimal ? Well that might depend. The input data to the function might be much larger than the output. Consider the, quite common, idiom of: order by case when (complex expresssion) 1 when (complex expression) 2 else 3 How come that expression be relevant? There is only one sortkey and no limit, so no matter what it should clearly get the full resultset in all cases. It would also enable a select that has to sortkeys to utilize an index that only contains the primary sortkey, which is a huge negative effect of what's being done today. This is a separate problem entirely. It would be nice to have a strategy for ordering that can take advantage of partially ordered results. It's not hard to see how to do the executor side -- it could keep a tuplesort for each group and truncate it when the group changes. As usual the hard part is having the planner figure out *when* to use it. We have a hard enough time calculating ndistinct for individual columns -- this would require having an idea of how many values are present for each major key column. Yes, as with all other cases it would be hard to get the optimum, but there is also cases where it is straightforward, say when the secondary sort column has an ndistinct of -1 (or similar close to). The current standard assumption is that 2 columns are unrelated, that would also work here. (As good as is does similar places in PG). Jesper -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On Mon, Apr 18, 2011 at 5:38 PM, Jesper Krogh jes...@krogh.cc wrote: order by case when (complex expresssion) 1 when (complex expression) 2 else 3 How come that expression be relevant? There is only one sortkey and no limit, so no matter what it should clearly get the full resultset in all cases. Sure, imagine there are more order by clauses with this one as the last one. Yes, as with all other cases it would be hard to get the optimum, but there is also cases where it is straightforward, say when the secondary sort column has an ndistinct of -1 (or similar close to). The current standard assumption is that 2 columns are unrelated, that would also work here. (As good as is does similar places in PG). I'm not following what you mean with the secondary column having ndistinct of -1. Actually it seems to me a reasonable low-hanging fruit to reach for would be when the initial column has an ndistinct of -1 which is actually kind of common. A lot of SQL queries end up being written with GROUP BY primary_key, other_column, other_column, other_column just to get those other columns to be queryable. If we implemented the SQL standard dependent columns feature this would be unnecessary but we don't and even if we did people would still build schemas and queries that defeat the optimization. In these cases we probably do have ndistinct -1 for one of the columns and therefore that an index on that column alone would give us almost the right ordering and quite likely exactly the right ordering. If we buffered the outputs for any distinct value and output sorted them if there were multiple rows. It would probably somewhat worse if we guess wrong though. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
Greg Stark gsst...@mit.edu writes: A lot of SQL queries end up being written with GROUP BY primary_key, other_column, other_column, other_column just to get those other columns to be queryable. If we implemented the SQL standard dependent columns feature this would be unnecessary but we don't We do as of 9.1 ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
Excerpts from Greg Stark's message of lun abr 18 15:47:03 -0300 2011: A lot of SQL queries end up being written with GROUP BY primary_key, other_column, other_column, other_column just to get those other columns to be queryable. If we implemented the SQL standard dependent columns feature this would be unnecessary but we don't and even if we did people would still build schemas and queries that defeat the optimization. Actually we do have that in 9.1. It's a bit more restrictive than really required (there are some more cases we could handle), but AFAIR at least the primary key is handled now. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On 2011-04-09 18:54, Tom Lane wrote: I think that would be a positive disimprovement. The current design guarantees that volatile sort expressions are evaluated exactly once, in the order the rows are read from the data source. There would be no guarantees at all, either as to the number of evaluations or the order in which they happen, if we tried to do evaluation only during the actual sort. If I could only evaluate the rows needed, then that would also be a huge win, below case shifts what definitely shouldn't be a seqscan into one due to a secondary sort key. Another small problem is that any such thing would require carrying along some kind of closure (ie, the expression and all its input values), not just the final sort key value, in tuples being sorted. The ensuing complexity, speed penalty, and increase in data volume to be sorted would be paid by everybody, making this probably a net performance loss when considered across all applications. Getting the value for the first sortkey and carrying on a closure for the rest would mostly (very often) be optimal ? It would also enable a select that has to sortkeys to utilize an index that only contains the primary sortkey, which is a huge negative effect of what's being done today. 2011-04-18 07:12:43.931 testdb=# explain select id from testdb.testtable order by id limit 500; QUERY PLAN Limit (cost=0.00..262.75 rows=500 width=4) - Index Scan using testtable_pkey on testtable (cost=0.00..15015567.84 rows=28573400 width=4) (2 rows) Time: 1.363 ms 2011-04-18 07:12:52.498 testdb=# explain select id from testdb.testtable order by id,name limit 500; QUERY PLAN --- Limit (cost=5165456.70..5165457.95 rows=500 width=35) - Sort (cost=5165456.70..5236890.20 rows=28573400 width=35) Sort Key: id, name - Seq Scan on testtable (cost=0.00..3741675.00 rows=28573400 width=35) (4 rows) Time: 1.420 ms Enabling any users to sort using multiple keys, without ending in Seq Scans somewhere down the line seems to require multi dimension indexes on all combinations of colums -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Evaluation of secondary sort key.
This seems like a place where there is room for improvement. 2011-04-09 15:18:08.016 testdb=# select id from test1 where id 3 order by id; id 1 2 (2 rows) Time: 0.328 ms 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION testsort(id integer) returns integer as $$ BEGIN perform pg_sleep(id); return id; END; $$ language plpgsql; CREATE FUNCTION Time: 12.349 ms 2011-04-09 15:18:22.138 testdb=# select id from test1 where id 3 order by id,testsort(id); id 1 2 (2 rows) Time: 3001.896 ms It seems strange that there is a need to evaluate testsort(id) at all in this case. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote: This seems like a place where there is room for improvement. 2011-04-09 15:18:08.016 testdb=# select id from test1 where id 3 order by id; id 1 2 (2 rows) Time: 0.328 ms 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION testsort(id integer) returns integer as $$ BEGIN perform pg_sleep(id); return id; END; $$ language plpgsql; CREATE FUNCTION Time: 12.349 ms 2011-04-09 15:18:22.138 testdb=# select id from test1 where id 3 order by id,testsort(id); id 1 2 (2 rows) Time: 3001.896 ms It seems strange that there is a need to evaluate testsort(id) at all in this case. How would PostgreSQL know that sorting by id leaves no ambiguity for the next key to address? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
How would PostgreSQL know that sorting by id leaves no ambiguity for the next key to address? It wouldn't But it could postpone evaluation until ambiguity was actually met. Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On Sat, Apr 09, 2011 at 09:17:10AM -0700, David Fetter wrote: 2011-04-09 15:18:22.138 testdb=# select id from test1 where id 3 order by id,testsort(id); id 1 2 (2 rows) Time: 3001.896 ms It seems strange that there is a need to evaluate testsort(id) at all in this case. How would PostgreSQL know that sorting by id leaves no ambiguity for the next key to address? Well, it doesn't know that, but I guess the point is it could wait with evaluating the second key until it needs it. The reason ot works as it does now is that the ORDER BY fields are added as hidden fields to the query, like: select id, /*hidden*/ id, /*hidden*/ testsort(id) from test1 where id 3 order by 2, 3; Here it's obvious why the evaluation happens. To avoid this evaluation would require redoing the way sorts work (I think). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Evaluation of secondary sort key.
On 09.04.2011 19:17, David Fetter wrote: On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote: This seems like a place where there is room for improvement. 2011-04-09 15:18:08.016 testdb=# select id from test1 where id 3 order by id; id 1 2 (2 rows) Time: 0.328 ms 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION testsort(id integer) returns integer as $$ BEGIN perform pg_sleep(id); return id; END; $$ language plpgsql; CREATE FUNCTION Time: 12.349 ms 2011-04-09 15:18:22.138 testdb=# select id from test1 where id 3 order by id,testsort(id); id 1 2 (2 rows) Time: 3001.896 ms It seems strange that there is a need to evaluate testsort(id) at all in this case. How would PostgreSQL know that sorting by id leaves no ambiguity for the next key to address? Presumably there's a primary key constraint on id. This is one of those cases where we could optimize, but then again, there's no reason to write a query like that in the first place. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
Martijn van Oosterhout klep...@svana.org writes: On Sat, Apr 09, 2011 at 09:17:10AM -0700, David Fetter wrote: It seems strange that there is a need to evaluate testsort(id) at all in this case. How would PostgreSQL know that sorting by id leaves no ambiguity for the next key to address? Well, it doesn't know that, but I guess the point is it could wait with evaluating the second key until it needs it. I think that would be a positive disimprovement. The current design guarantees that volatile sort expressions are evaluated exactly once, in the order the rows are read from the data source. There would be no guarantees at all, either as to the number of evaluations or the order in which they happen, if we tried to do evaluation only during the actual sort. Another small problem is that any such thing would require carrying along some kind of closure (ie, the expression and all its input values), not just the final sort key value, in tuples being sorted. The ensuing complexity, speed penalty, and increase in data volume to be sorted would be paid by everybody, making this probably a net performance loss when considered across all applications. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On Sat, Apr 09, 2011 at 07:24:15PM +0300, Heikki Linnakangas wrote: On 09.04.2011 19:17, David Fetter wrote: On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote: This seems like a place where there is room for improvement. 2011-04-09 15:18:08.016 testdb=# select id from test1 where id 3 order by id; id 1 2 (2 rows) Time: 0.328 ms 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION testsort(id integer) returns integer as $$ BEGIN perform pg_sleep(id); return id; END; $$ language plpgsql; CREATE FUNCTION Time: 12.349 ms 2011-04-09 15:18:22.138 testdb=# select id from test1 where id 3 order by id,testsort(id); id 1 2 (2 rows) Time: 3001.896 ms It seems strange that there is a need to evaluate testsort(id) at all in this case. How would PostgreSQL know that sorting by id leaves no ambiguity for the next key to address? Presumably there's a primary key constraint on id. This is one of those cases where we could optimize, but then again, there's no reason to write a query like that in the first place. Given the horrors query generators perpetrate, it might be worth dropping provably redundant ORDER BYs on the floor at planning time. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On 2011-04-09 20:00, David Fetter wrote: Given the horrors query generators perpetrate, it might be worth dropping provably redundant ORDER BYs on the floor at planning time. Well, many people often add a secondary sort-key to their SQL for the only purpose of obtainting a consistent result in the corner-cases where the first sort key is ambiguios. If the first sort-key isn't planned to be supported by an index-scan, then you'll end up calculating the second sortkey for the entire dataset even if you end up doing a limit 100 at the end. You can only deem it redundant if there is a primary key in front. if you have a primary key in front, where as a fix may be really good in cases where you have a n_distinct at or near -1 in pg_stats for the column. Jesper -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Evaluation of secondary sort key.
On 2011-04-09 18:54, Tom Lane wrote: I think that would be a positive disimprovement. The current design guarantees that volatile sort expressions are evaluated exactly once, in the order the rows are read from the data source. There would be no guarantees at all, either as to the number of evaluations or the order in which they happen, if we tried to do evaluation only during the actual sort. Another small problem is that any such thing would require carrying along some kind of closure (ie, the expression and all its input values), not just the final sort key value, in tuples being sorted. The ensuing complexity, speed penalty, and increase in data volume to be sorted would be paid by everybody, making this probably a net performance loss when considered across all applications. The current approach gives that: select id from test1 where some clause that matches say 10% random by another index order by sortfunc1(id),sortfunc(2) limit 20; on a table with 100.000 elements will also end up applying both sortfunc1(id) and sortfunc2(id) to all 10.000 elements even though sortfunc2(id) might only brings value to a very few amount of tuples (the ones needed as secondary sortkeys for top20 within the dataset). It might be worth noting in the manual, that if at all possible you should stuff the sortfunc2(id) into the table as a column (perhaps computed by a before trigger), since it might actully be evaluated way more often than you anticipated. Thanks a lot for the insight. Jesper -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers