Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Matthew Wakeling
On Mon, 21 Apr 2008, Mark Mielke wrote: This surprises me - hash values are lossy, so it must still need to confirm against the real list of values, which at a minimum should require references to the rows to check against? Is PostgreSQL doing something beyond my imagination? :-) Not too

Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Mark Mielke
Matthew Wakeling wrote: On Mon, 21 Apr 2008, Mark Mielke wrote: This surprises me - hash values are lossy, so it must still need to confirm against the real list of values, which at a minimum should require references to the rows to check against? Is PostgreSQL doing something beyond my

Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Matthew Wakeling
On Tue, 22 Apr 2008, Mark Mielke wrote: The poster I responded to said that the memory required for a hash join was relative to the number of distinct values, not the number of rows. They gave an example of millions of rows, but only a few distinct values. Above, you agree with me that it it

Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Mark Mielke
Matthew Wakeling wrote: On Tue, 22 Apr 2008, Mark Mielke wrote: The poster I responded to said that the memory required for a hash join was relative to the number of distinct values, not the number of rows. They gave an example of millions of rows, but only a few distinct values. Above, you

Re: [PERFORM] Group by more efficient than distinct?

2008-04-21 Thread PFC
On Sun, 20 Apr 2008 17:15:36 +0200, Francisco Reyes [EMAIL PROTECTED] wrote: PFC writes: - If you process up to some percentage of your RAM worth of data, hashing is going to be a lot faster Thanks for the excellent breakdown and explanation. I will try and get sizes of the tables in

Re: [PERFORM] Group by more efficient than distinct?

2008-04-21 Thread Mark Mielke
PFC wrote: Actually, the memory used by the hash depends on the number of distinct values, not the number of rows which are processed... Consider : SELECT a GROUP BY a SELECT a,count(*) GROUP BY a In both cases the hash only holds discinct values. So if you have 1 million rows to

Re: [PERFORM] Group by more efficient than distinct?

2008-04-21 Thread Mark Mielke
Mark Mielke wrote: PFC wrote: Actually, the memory used by the hash depends on the number of distinct values, not the number of rows which are processed... Consider : SELECT a GROUP BY a SELECT a,count(*) GROUP BY a In both cases the hash only holds discinct values. So if you

Re: [PERFORM] Group by more efficient than distinct?

2008-04-20 Thread Francisco Reyes
Gregory Stark writes: HashAggregate needs to store more values in memory at the same time so it's not a good plan if you have a lot of distinct values. So far the resulting number of rows where in the thousands and the source data were in there hundreds of thousands and the group by was

Re: [PERFORM] Group by more efficient than distinct?

2008-04-20 Thread Francisco Reyes
PFC writes: - If you process up to some percentage of your RAM worth of data, hashing is going to be a lot faster Thanks for the excellent breakdown and explanation. I will try and get sizes of the tables in question and how much memory the machines have. - If you need DISTINCT

Re: [PERFORM] Group by more efficient than distinct?

2008-04-20 Thread Luke Lonergan
Hi Francisco, Generally, PG sorting is much slower than hash aggregation for performing the distinct operation. There may be small sizes where this isn¹t true, but for large amounts of data (in-memory or not), hash agg (used most often, but not always by GROUP BY) is faster. We¹ve implemented a

Re: [PERFORM] Group by more efficient than distinct?

2008-04-18 Thread Thomas Pundt
On Freitag, 18. April 2008, Francisco Reyes wrote: | I am trying to get a distinct set of rows from 2 tables. | After looking at someone else's query I noticed they were doing a group by | to obtain the unique list. | | After comparing on multiple machines with several tables, it seems using |

Re: [PERFORM] Group by more efficient than distinct?

2008-04-18 Thread Gregory Stark
Francisco Reyes [EMAIL PROTECTED] writes: Is there any dissadvantage of using group by to obtain a unique list? On a small dataset the difference was about 20% percent. Group by HashAggregate (cost=369.61..381.12 rows=1151 width=8) (actual time=76.641..85.167 rows=2890 loops=1)

Re: [PERFORM] Group by more efficient than distinct?

2008-04-18 Thread PFC
On Fri, 18 Apr 2008 11:36:02 +0200, Gregory Stark [EMAIL PROTECTED] wrote: Francisco Reyes [EMAIL PROTECTED] writes: Is there any dissadvantage of using group by to obtain a unique list? On a small dataset the difference was about 20% percent. Group by HashAggregate (cost=369.61..381.12