Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-20 Thread Jim C. Nasby
On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
  
 Added to TODO:
 
   * Allow DISTINCT to use hashing like GROUP BY

3 lines above we have...
Consider using hash buckets to do DISTINCT, rather than sorting
This would be beneficial when there are few distinct values. 

Can you add
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
could find on the other TODO was
http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
which doesn't help much...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-20 Thread Bruce Momjian
Jim C. Nasby wrote:
 On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
   
  Added to TODO:
  
  * Allow DISTINCT to use hashing like GROUP BY
 
 3 lines above we have...
 Consider using hash buckets to do DISTINCT, rather than sorting
 This would be beneficial when there are few distinct values. 

OK, I have merged these items into one.

 
 Can you add
 http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
 could find on the other TODO was
 http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
 which doesn't help much...

What do these URL's have that the current TODO does not?

* Consider using hash buckets to do DISTINCT, rather than sorting

  This would be beneficial when there are few distinct values.  This is
  already used by GROUP BY.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-20 Thread Jim C. Nasby
On Tue, Sep 20, 2005 at 05:05:05PM -0400, Bruce Momjian wrote:
 Jim C. Nasby wrote:
  On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:

   Added to TODO:
   
 * Allow DISTINCT to use hashing like GROUP BY
  
  3 lines above we have...
  Consider using hash buckets to do DISTINCT, rather than sorting
  This would be beneficial when there are few distinct values. 
 
 OK, I have merged these items into one.
 
  
  Can you add
  http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
  could find on the other TODO was
  http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
  which doesn't help much...
 
 What do these URL's have that the current TODO does not?
 
 * Consider using hash buckets to do DISTINCT, rather than sorting
 
   This would be beneficial when there are few distinct values.  This is
   already used by GROUP BY.

Maybe it's just me, but the recent run-through of the TODO list
indicated that there's a fair number of items that people look at and
don't really knowh what they mean. Providing the context (ie: email
thread) that spawned an idea seems extremely valuable in being able to
explain the idea behind a TODO item. They also usually contain valuable
tips about how a TODO could be implemented. In this example, having
quick reference to the discussion about hashagg and first()/last() would
probably prove useful.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-20 Thread Bruce Momjian
Jim C. Nasby wrote:
  What do these URL's have that the current TODO does not?
  
  * Consider using hash buckets to do DISTINCT, rather than sorting
  
This would be beneficial when there are few distinct values.  This is
already used by GROUP BY.
 
 Maybe it's just me, but the recent run-through of the TODO list
 indicated that there's a fair number of items that people look at and
 don't really knowh what they mean. Providing the context (ie: email
 thread) that spawned an idea seems extremely valuable in being able to
 explain the idea behind a TODO item. They also usually contain valuable
 tips about how a TODO could be implemented. In this example, having
 quick reference to the discussion about hashagg and first()/last() would
 probably prove useful.

True, but sometimes the thread winds all around and there isn't a
definative explaination of how to go at something.  I woul rather digest
the information to improve it, rather than require people to wade around
in an email thread.  Is there some detail the TODO is missing?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-20 Thread Jim C. Nasby
On Tue, Sep 20, 2005 at 06:45:07PM -0400, Bruce Momjian wrote:
 Jim C. Nasby wrote:
   What do these URL's have that the current TODO does not?
   
   * Consider using hash buckets to do DISTINCT, rather than sorting
   
 This would be beneficial when there are few distinct values.  This is
 already used by GROUP BY.
  
  Maybe it's just me, but the recent run-through of the TODO list
  indicated that there's a fair number of items that people look at and
  don't really knowh what they mean. Providing the context (ie: email
  thread) that spawned an idea seems extremely valuable in being able to
  explain the idea behind a TODO item. They also usually contain valuable
  tips about how a TODO could be implemented. In this example, having
  quick reference to the discussion about hashagg and first()/last() would
  probably prove useful.
 
 True, but sometimes the thread winds all around and there isn't a
 definative explaination of how to go at something.  I woul rather digest
 the information to improve it, rather than require people to wade around
 in an email thread.  Is there some detail the TODO is missing?

Sure, and I think the TODO items usually are digested fine. But if you
want to know any of the details behind the items (like what the
motivation is, or how it could be done), you have to manually go and
search the archives trying to find something.

In this example, it would have been useful to see if there had been any
discussion about if you could use hashagg to do this or not. But
searching the archive turned up nothing, so I guess we'll never know. If
there was a link to a thread in the TODO, we could see exactly what was
discussed.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] DISTINCT vs. GROUP BY

2005-09-19 Thread Hans-Jürgen Schönig
I was wondering whether it is possible to teach the planner to handle 
DISTINCT in a more efficient way:


em=# explain select distinct lastname from import.testtest;
   QUERY PLAN

 Unique  (cost=2647377.45..2709467.70 rows=1 width=7)
   -  Sort  (cost=2647377.45..2678422.58 rows=12418051 width=7)
 Sort Key: lastname
 -  Seq Scan on testtest  (cost=0.00..370082.51 rows=12418051 
width=7)

(4 Zeilen)


Isn't it possible to perform the same operation using a HashAggregate?
We have seen that a GROUP BY workaround is usually a lot faster than 
sort-unique - at least when work_mem is large enough.


best regards,

hans


--
Cybertec Geschwinde  Schönig GmbH
Schöngrabern 134; A-2020 Hollabrunn
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-19 Thread Neil Conway
On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote:
 I was wondering whether it is possible to teach the planner to handle 
 DISTINCT in a more efficient way:
[...]
 Isn't it possible to perform the same operation using a
 HashAggregate? 

One problem is that DISTINCT ON is defined to return the first unique
row (according to the query's ORDER BY) for the set of DISTINCT ON
columns, which can't easily be done via hashing.

-Neil



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-19 Thread Tom Lane
=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes:
 I was wondering whether it is possible to teach the planner to handle 
 DISTINCT in a more efficient way:

Probably (although the interactions with ORDER BY might be tricky).
No one has touched that part of the planner in a very long time.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-19 Thread Greg Stark

Neil Conway [EMAIL PROTECTED] writes:

 On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote:
  I was wondering whether it is possible to teach the planner to handle 
  DISTINCT in a more efficient way:
 [...]
  Isn't it possible to perform the same operation using a
  HashAggregate? 
 
 One problem is that DISTINCT ON is defined to return the first unique
 row (according to the query's ORDER BY) for the set of DISTINCT ON
 columns, which can't easily be done via hashing.

Uhm. Sure it can.


DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
GROUP BY with a kind of first() aggregate function. What would be really
neat would be to teach GROUP BY about first() and last() and how it can skip
over some index entries and still satisfy the query. Then make DISTINCT and
DISTINCT ON be handled through the exact same code path.

For bonus points teach it that min() and max() can sometimes be treated the
same way if the path is presenting records sorted on that column.


-- 
greg


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-19 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
 GROUP BY with a kind of first() aggregate function. What would be really
 neat would be to teach GROUP BY about first() and last() and how it can skip
 over some index entries and still satisfy the query. Then make DISTINCT and
 DISTINCT ON be handled through the exact same code path.

You've missed the point entirely.

first() is not a substitute for sorting the input; it is only useful
if the input comes pre-sorted.  And if you are going to sort the input,
you might as well use the current implementation of DISTINCT ON and
skip the effort and memory-overflow-risk associated with a hashtable.

I do think hash aggregation is a plausible alternative implementation of
plain DISTINCT, but I don't see the case for using it for DISTINCT ON.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-19 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 I do think hash aggregation is a plausible alternative implementation of
 plain DISTINCT, but I don't see the case for using it for DISTINCT ON.

It could be done without presorting the input though not with a simple
first()-like function. It would have be a sort of two-argument min() function
that kept a state variable for the smallest value found so far of the sort
key.

My main motivation here is that it's odd to have two code paths for
implementing the two language constructs when one is really just a special
case of the other. It's a source of cases like this where the code to
implement a query path exists but isn't accessible due to the way the query is
written.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] DISTINCT vs. GROUP BY

2005-09-19 Thread Bruce Momjian
 
Added to TODO:

* Allow DISTINCT to use hashing like GROUP BY


---

Greg Stark wrote:
 
 Neil Conway [EMAIL PROTECTED] writes:
 
  On Mon, 2005-19-09 at 16:27 +0200, Hans-J?rgen Sch?nig wrote:
   I was wondering whether it is possible to teach the planner to handle 
   DISTINCT in a more efficient way:
  [...]
   Isn't it possible to perform the same operation using a
   HashAggregate? 
  
  One problem is that DISTINCT ON is defined to return the first unique
  row (according to the query's ORDER BY) for the set of DISTINCT ON
  columns, which can't easily be done via hashing.
 
 Uhm. Sure it can.
 
 
 DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
 GROUP BY with a kind of first() aggregate function. What would be really
 neat would be to teach GROUP BY about first() and last() and how it can skip
 over some index entries and still satisfy the query. Then make DISTINCT and
 DISTINCT ON be handled through the exact same code path.
 
 For bonus points teach it that min() and max() can sometimes be treated the
 same way if the path is presenting records sorted on that column.
 
 
 -- 
 greg
 
 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings