Re: [HACKERS] Evaluation of secondary sort key.

2011-04-18 Thread Greg Stark
On Mon, Apr 18, 2011 at 6:25 AM, Jesper Krogh jes...@krogh.cc wrote:
 Getting the value for the first sortkey and carrying on a closure
 for the rest would mostly (very often) be optimal ?

Well that might depend. The input data to the function might be much
larger than the output. Consider the, quite common, idiom of:

order by case when (complex expresssion) 1 when (complex expression) 2 else 3

 It would also enable a select that has to sortkeys to utilize an
 index that only contains the primary sortkey, which is a huge
 negative effect of what's being done today.

This is a separate problem entirely. It would be nice to have a
strategy for ordering that can take advantage of partially ordered
results. It's not hard to see how to do the executor side -- it could
keep a tuplesort for each group and truncate it when the group
changes. As usual the hard part is having the planner figure out
*when* to use it. We have a hard enough time calculating ndistinct for
individual columns -- this would require having an idea of how many
values are present for each major key column.




-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-18 Thread Jesper Krogh

On 2011-04-18 11:00, Greg Stark wrote:

On Mon, Apr 18, 2011 at 6:25 AM, Jesper Kroghjes...@krogh.cc  wrote:

Getting the value for the first sortkey and carrying on a closure
for the rest would mostly (very often) be optimal ?

Well that might depend. The input data to the function might be much
larger than the output. Consider the, quite common, idiom of:

order by case when (complex expresssion) 1 when (complex expression) 2 else 3


How come that expression be relevant? There is only one sortkey and no
limit, so no matter what it should clearly get the full resultset in all 
cases.



It would also enable a select that has to sortkeys to utilize an
index that only contains the primary sortkey, which is a huge
negative effect of what's being done today.

This is a separate problem entirely. It would be nice to have a
strategy for ordering that can take advantage of partially ordered
results. It's not hard to see how to do the executor side -- it could
keep a tuplesort for each group and truncate it when the group
changes. As usual the hard part is having the planner figure out
*when* to use it. We have a hard enough time calculating ndistinct for
individual columns -- this would require having an idea of how many
values are present for each major key column.


Yes, as with all other cases it would be hard to get the optimum, but
there is also cases where it is straightforward, say when the secondary
sort column has an ndistinct of -1 (or similar close to). The current 
standard
assumption is that 2 columns are unrelated, that would also work here. 
(As good as is

does similar places in PG).

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-18 Thread Greg Stark
On Mon, Apr 18, 2011 at 5:38 PM, Jesper Krogh jes...@krogh.cc wrote:
 order by case when (complex expresssion) 1 when (complex expression) 2
 else 3

 How come that expression be relevant? There is only one sortkey and no
 limit, so no matter what it should clearly get the full resultset in all
 cases.

Sure, imagine there are more order by clauses with this one as the last one.


 Yes, as with all other cases it would be hard to get the optimum, but
 there is also cases where it is straightforward, say when the secondary
 sort column has an ndistinct of -1 (or similar close to). The current
 standard
 assumption is that 2 columns are unrelated, that would also work here. (As
 good as is
 does similar places in PG).

I'm not following what you mean with the secondary column having
ndistinct of -1. Actually it seems to me a reasonable low-hanging
fruit to reach for would be when the initial column has an ndistinct
of -1 which is actually kind of common.

A lot of SQL queries end up being written with GROUP BY primary_key,
other_column, other_column, other_column just to get those other
columns to be queryable. If we implemented the SQL standard
dependent columns feature this would be unnecessary but we don't and
even if we did people would still build schemas and queries that
defeat the optimization.

In these cases we probably do have ndistinct -1 for one of the columns
and therefore that an index on that column alone would give us almost
the right ordering and quite likely exactly the right ordering. If we
buffered the outputs for any distinct value and output sorted them if
there were multiple rows. It would probably somewhat worse if we guess
wrong though.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-18 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 A lot of SQL queries end up being written with GROUP BY primary_key,
 other_column, other_column, other_column just to get those other
 columns to be queryable. If we implemented the SQL standard
 dependent columns feature this would be unnecessary but we don't

We do as of 9.1 ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-18 Thread Alvaro Herrera
Excerpts from Greg Stark's message of lun abr 18 15:47:03 -0300 2011:

 A lot of SQL queries end up being written with GROUP BY primary_key,
 other_column, other_column, other_column just to get those other
 columns to be queryable. If we implemented the SQL standard
 dependent columns feature this would be unnecessary but we don't and
 even if we did people would still build schemas and queries that
 defeat the optimization.

Actually we do have that in 9.1.  It's a bit more restrictive than
really required (there are some more cases we could handle), but AFAIR
at least the primary key is handled now.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-17 Thread Jesper Krogh

On 2011-04-09 18:54, Tom Lane wrote:

I think that would be a positive disimprovement.  The current design
guarantees that volatile sort expressions are evaluated exactly once,
in the order the rows are read from the data source.  There would be no
guarantees at all, either as to the number of evaluations or the order
in which they happen, if we tried to do evaluation only during the
actual sort.

If I could only evaluate the rows needed, then that would also
be a huge win, below case shifts what definitely shouldn't be a seqscan
into one due to a secondary sort key.


Another small problem is that any such thing would require carrying
along some kind of closure (ie, the expression and all its input
values), not just the final sort key value, in tuples being sorted.
The ensuing complexity, speed penalty, and increase in data volume
to be sorted would be paid by everybody, making this probably a net
performance loss when considered across all applications.

Getting the value for the first sortkey and carrying on a closure
for the rest would mostly (very often) be optimal ?

It would also enable a select that has to sortkeys to utilize an
index that only contains the primary sortkey, which is a huge
negative effect of what's being done today.

2011-04-18 07:12:43.931 testdb=# explain select id from testdb.testtable 
order by id limit 500;

 QUERY PLAN

 Limit  (cost=0.00..262.75 rows=500 width=4)
   -  Index Scan using testtable_pkey on testtable  
(cost=0.00..15015567.84 rows=28573400 width=4)

(2 rows)

Time: 1.363 ms
2011-04-18 07:12:52.498 testdb=# explain select id from testdb.testtable 
order by id,name limit 500;

QUERY PLAN
---
 Limit  (cost=5165456.70..5165457.95 rows=500 width=35)
   -  Sort  (cost=5165456.70..5236890.20 rows=28573400 width=35)
 Sort Key: id, name
 -  Seq Scan on testtable  (cost=0.00..3741675.00 
rows=28573400 width=35)

(4 rows)

Time: 1.420 ms

Enabling any users to sort using multiple keys, without ending in Seq 
Scans somewhere down
the line seems to require multi dimension indexes on all combinations of 
colums


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh

This seems like a place where there is room for improvement.

2011-04-09 15:18:08.016 testdb=# select id from test1 where id  3 order 
by id;

 id

  1
  2
(2 rows)

Time: 0.328 ms
2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION testsort(id 
integer) returns integer as $$ BEGIN perform pg_sleep(id); return id; 
END; $$ language plpgsql;

CREATE FUNCTION
Time: 12.349 ms
2011-04-09 15:18:22.138 testdb=# select id from test1 where id  3 order 
by id,testsort(id);

 id

  1
  2
(2 rows)

Time: 3001.896 ms

It seems strange that there is a need to evaluate testsort(id) at all in 
this case.


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread David Fetter
On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote:
 This seems like a place where there is room for improvement.
 
 2011-04-09 15:18:08.016 testdb=# select id from test1 where id  3
 order by id;
  id
 
   1
   2
 (2 rows)
 
 Time: 0.328 ms
 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION
 testsort(id integer) returns integer as $$ BEGIN perform
 pg_sleep(id); return id; END; $$ language plpgsql;
 CREATE FUNCTION
 Time: 12.349 ms
 2011-04-09 15:18:22.138 testdb=# select id from test1 where id  3
 order by id,testsort(id);
  id
 
   1
   2
 (2 rows)
 
 Time: 3001.896 ms
 
 It seems strange that there is a need to evaluate testsort(id) at
 all in this case.

How would PostgreSQL know that sorting by id leaves no ambiguity for
the next key to address?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh

 
 How would PostgreSQL know that sorting by id leaves no ambiguity for
 the next key to address?

It wouldn't   But it could postpone evaluation until ambiguity was actually 
met.  

Jesper
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Martijn van Oosterhout
On Sat, Apr 09, 2011 at 09:17:10AM -0700, David Fetter wrote:

  2011-04-09 15:18:22.138 testdb=# select id from test1 where id  3
  order by id,testsort(id);
   id
  
1
2
  (2 rows)
  
  Time: 3001.896 ms
  
  It seems strange that there is a need to evaluate testsort(id) at
  all in this case.
 
 How would PostgreSQL know that sorting by id leaves no ambiguity for
 the next key to address?

Well, it doesn't know that, but I guess the point is it could wait with
evaluating the second key until it needs it.

The reason ot works as it does now is that the ORDER BY fields are
added as hidden fields to the query, like:

select id, /*hidden*/ id, /*hidden*/ testsort(id) from test1 where id  3 order 
by 2, 3;

Here it's obvious why the evaluation happens. To avoid this evaluation
would require redoing the way sorts work (I think).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Heikki Linnakangas

On 09.04.2011 19:17, David Fetter wrote:

On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote:

This seems like a place where there is room for improvement.

2011-04-09 15:18:08.016 testdb=# select id from test1 where id  3
order by id;
  id

   1
   2
(2 rows)

Time: 0.328 ms
2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION
testsort(id integer) returns integer as $$ BEGIN perform
pg_sleep(id); return id; END; $$ language plpgsql;
CREATE FUNCTION
Time: 12.349 ms
2011-04-09 15:18:22.138 testdb=# select id from test1 where id  3
order by id,testsort(id);
  id

   1
   2
(2 rows)

Time: 3001.896 ms

It seems strange that there is a need to evaluate testsort(id) at
all in this case.


How would PostgreSQL know that sorting by id leaves no ambiguity for
the next key to address?


Presumably there's a primary key constraint on id. This is one of those 
cases where we could optimize, but then again, there's no reason to 
write a query like that in the first place.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Tom Lane
Martijn van Oosterhout klep...@svana.org writes:
 On Sat, Apr 09, 2011 at 09:17:10AM -0700, David Fetter wrote:
 It seems strange that there is a need to evaluate testsort(id) at
 all in this case.

 How would PostgreSQL know that sorting by id leaves no ambiguity for
 the next key to address?

 Well, it doesn't know that, but I guess the point is it could wait with
 evaluating the second key until it needs it.

I think that would be a positive disimprovement.  The current design
guarantees that volatile sort expressions are evaluated exactly once,
in the order the rows are read from the data source.  There would be no
guarantees at all, either as to the number of evaluations or the order
in which they happen, if we tried to do evaluation only during the
actual sort.

Another small problem is that any such thing would require carrying
along some kind of closure (ie, the expression and all its input
values), not just the final sort key value, in tuples being sorted.
The ensuing complexity, speed penalty, and increase in data volume
to be sorted would be paid by everybody, making this probably a net
performance loss when considered across all applications.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread David Fetter
On Sat, Apr 09, 2011 at 07:24:15PM +0300, Heikki Linnakangas wrote:
 On 09.04.2011 19:17, David Fetter wrote:
 On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote:
 This seems like a place where there is room for improvement.
 
 2011-04-09 15:18:08.016 testdb=# select id from test1 where id  3
 order by id;
   id
 
1
2
 (2 rows)
 
 Time: 0.328 ms
 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION
 testsort(id integer) returns integer as $$ BEGIN perform
 pg_sleep(id); return id; END; $$ language plpgsql;
 CREATE FUNCTION
 Time: 12.349 ms
 2011-04-09 15:18:22.138 testdb=# select id from test1 where id  3
 order by id,testsort(id);
   id
 
1
2
 (2 rows)
 
 Time: 3001.896 ms
 
 It seems strange that there is a need to evaluate testsort(id) at
 all in this case.
 
 How would PostgreSQL know that sorting by id leaves no ambiguity
 for the next key to address?
 
 Presumably there's a primary key constraint on id. This is one of
 those cases where we could optimize, but then again, there's no
 reason to write a query like that in the first place.

Given the horrors query generators perpetrate, it might be worth
dropping provably redundant ORDER BYs on the floor at planning time.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh

On 2011-04-09 20:00, David Fetter wrote:

Given the horrors query generators perpetrate, it might be worth
dropping provably redundant ORDER BYs on the floor at planning time.

Well, many people often add a secondary sort-key to their SQL
for the only purpose of obtainting a consistent result in the
corner-cases where the first sort key is ambiguios.

If the first sort-key isn't planned to be supported by an index-scan,
then you'll end up calculating the second sortkey for the entire
dataset even if you end up doing a limit 100 at the end.

You can only deem it redundant if there is a primary key in front.
if you have a primary key in front, where as a fix may be really
good in cases where you have a n_distinct at or near -1 in pg_stats
for the column.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Evaluation of secondary sort key.

2011-04-09 Thread Jesper Krogh

On 2011-04-09 18:54, Tom Lane wrote:

I think that would be a positive disimprovement. The current design
guarantees that volatile sort expressions are evaluated exactly once,
in the order the rows are read from the data source.  There would be no
guarantees at all, either as to the number of evaluations or the order
in which they happen, if we tried to do evaluation only during the
actual sort.

Another small problem is that any such thing would require carrying
along some kind of closure (ie, the expression and all its input
values), not just the final sort key value, in tuples being sorted.
The ensuing complexity, speed penalty, and increase in data volume
to be sorted would be paid by everybody, making this probably a net
performance loss when considered across all applications.

The current approach gives that:

select id from test1 where some clause that matches say 10% random by 
another index

order by sortfunc1(id),sortfunc(2) limit 20;

on a table with 100.000 elements will also end up applying
both sortfunc1(id) and sortfunc2(id) to all 10.000 elements
even though sortfunc2(id) might only brings value to a very few amount
of tuples (the ones needed as secondary sortkeys for top20 within
the dataset).

It might be worth noting in the manual, that if at all possible you should
stuff the sortfunc2(id) into the table as a column (perhaps computed by 
a before

trigger), since it might actully be evaluated way more often than
you anticipated.

Thanks a lot for the insight.

Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers