Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Matthew Wakeling

On Mon, 21 Apr 2008, Mark Mielke wrote:
This surprises me - hash values are lossy, so it must still need to confirm 
against the real list of values, which at a minimum should require references 
to the rows to check against?


Is PostgreSQL doing something beyond my imagination? :-)


Not too far beyond your imagination, I hope.

It's simply your assumption that the hash table is lossy. Sure, hash 
values are lossy, but a hash table isn't. Postgres stores in memory not 
only the hash values, but the rows they refer to as well, having checked 
them all on disc beforehand. That way, it doesn't need to look up anything 
on disc for that branch of the join again, and it has a rapid in-memory 
lookup for each row.


Matthew

--
X's book explains this very well, but, poor bloke, he did the Cambridge Maths 
Tripos...   -- Computer Science Lecturer


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Mark Mielke

Matthew Wakeling wrote:

On Mon, 21 Apr 2008, Mark Mielke wrote:
This surprises me - hash values are lossy, so it must still need to 
confirm against the real list of values, which at a minimum should 
require references to the rows to check against?


Is PostgreSQL doing something beyond my imagination? :-)


Not too far beyond your imagination, I hope.

It's simply your assumption that the hash table is lossy. Sure, hash 
values are lossy, but a hash table isn't. Postgres stores in memory 
not only the hash values, but the rows they refer to as well, having 
checked them all on disc beforehand. That way, it doesn't need to look 
up anything on disc for that branch of the join again, and it has a 
rapid in-memory lookup for each row.


I said hash *values* are lossy. I did not say hash table is lossy.

The poster I responded to said that the memory required for a hash join 
was relative to the number of distinct values, not the number of rows. 
They gave an example of millions of rows, but only a few distinct 
values. Above, you agree with me that it it would include the rows (or 
at least references to the rows) as well. If it stores rows, or 
references to rows, then memory *is* relative to the number of rows, and 
millions of records would require millions of rows (or row references).


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Matthew Wakeling

On Tue, 22 Apr 2008, Mark Mielke wrote:
The poster I responded to said that the memory required for a hash join was 
relative to the number of distinct values, not the number of rows. They gave 
an example of millions of rows, but only a few distinct values. Above, you 
agree with me that it it would include the rows (or at least references to 
the rows) as well. If it stores rows, or references to rows, then memory *is* 
relative to the number of rows, and millions of records would require 
millions of rows (or row references).


Yeah, I think we're talking at cross-purposes, due to hash tables being 
used in two completely different places in Postgres. Firstly, you have 
hash joins, where Postgres loads the references to the actual rows, and 
puts those in the hash table. For that situation, you want a small number 
of rows. Secondly, you have hash aggregates, where Postgres stores an 
entry for each group in the hash table, and does not store the actual 
rows. For that situation, you can have a bazillion individual rows, but 
only a small number of distinct groups.


Matthew

--
First law of computing:  Anything can go wro
sig: Segmentation fault.  core dumped.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-22 Thread Mark Mielke

Matthew Wakeling wrote:

On Tue, 22 Apr 2008, Mark Mielke wrote:
The poster I responded to said that the memory required for a hash 
join was relative to the number of distinct values, not the number of 
rows. They gave an example of millions of rows, but only a few 
distinct values. Above, you agree with me that it it would include 
the rows (or at least references to the rows) as well. If it stores 
rows, or references to rows, then memory *is* relative to the number 
of rows, and millions of records would require millions of rows (or 
row references).


Yeah, I think we're talking at cross-purposes, due to hash tables 
being used in two completely different places in Postgres. Firstly, 
you have hash joins, where Postgres loads the references to the actual 
rows, and puts those in the hash table. For that situation, you want a 
small number of rows. Secondly, you have hash aggregates, where 
Postgres stores an entry for each group in the hash table, and does 
not store the actual rows. For that situation, you can have a 
bazillion individual rows, but only a small number of distinct groups.


That makes sense with my reality. :-)

Thanks,
mark

--
Mark Mielke [EMAIL PROTECTED]


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-21 Thread PFC
On Sun, 20 Apr 2008 17:15:36 +0200, Francisco Reyes  
[EMAIL PROTECTED] wrote:



PFC writes:

- If you process up to some percentage of your RAM worth of data,  
hashing  is going to be a lot faster


Thanks for the excellent breakdown and explanation. I will try and get  
sizes of the tables in question and how much memory the machines have.


	Actually, the memory used by the hash depends on the number of distinct  
values, not the number of rows which are processed...

Consider :

SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a

	In both cases the hash only holds discinct values. So if you have 1  
million rows to process but only 10 distinct values of a, the hash will  
only contain those 10 values (and the counts), so it will be very small  
and fast, it will absorb a huge seq scan without problem. If however, you  
have (say) 100 million distinct values for a, using a hash would be a bad  
idea. As usual, divide the size of your RAM by the number of concurrent  
connections or something.
	Note that a could be a column, several columns, anything, the size of  
the hash will be proportional to the number of distinct values, ie. the  
number of rows returned by the query, not the number of rows processed  
(read) by the query. Same with hash joins etc, that's why when you join a  
very small table to a large one Postgres likes to use seq scan + hash join  
on the small table.




- If you need DISTINCT ON, well, you're stuck with the Sort
- So, for the time being, you can replace DISTINCT with GROUP BY...


Have seen a few of those already on some code (new job..) so for those  
it is a matter of having a good disk subsystem?


	Depends on your RAM, sorting in RAM is always faster than sorting on disk  
of course, unless you eat all the RAM and trash the other processes.  
Tradeoffs...




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-21 Thread Mark Mielke

PFC wrote:
Actually, the memory used by the hash depends on the number of 
distinct values, not the number of rows which are processed...

Consider :

SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a

In both cases the hash only holds discinct values. So if you have 
1 million rows to process but only 10 distinct values of a, the hash 
will only contain those 10 values (and the counts), so it will be very 
small and fast, it will absorb a huge seq scan without problem. If 
however, you have (say) 100 million distinct values for a, using a 
hash would be a bad idea. As usual, divide the size of your RAM by the 
number of concurrent connections or something.
Note that a could be a column, several columns, anything, the 
size of the hash will be proportional to the number of distinct 
values, ie. the number of rows returned by the query, not the number 
of rows processed (read) by the query. Same with hash joins etc, 
that's why when you join a very small table to a large one Postgres 
likes to use seq scan + hash join on the small table.


This surprises me - hash values are lossy, so it must still need to 
confirm against the real list of values, which at a minimum should 
require references to the rows to check against?


Is PostgreSQL doing something beyond my imagination? :-)

Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-21 Thread Mark Mielke

Mark Mielke wrote:

PFC wrote:
Actually, the memory used by the hash depends on the number of 
distinct values, not the number of rows which are processed...

Consider :

SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a

In both cases the hash only holds discinct values. So if you have 
1 million rows to process but only 10 distinct values of a, the 
hash will only contain those 10 values (and the counts), so it will 
be very small and fast, it will absorb a huge seq scan without 
problem. If however, you have (say) 100 million distinct values for 
a, using a hash would be a bad idea. As usual, divide the size of 
your RAM by the number of concurrent connections or something.
Note that a could be a column, several columns, anything, the 
size of the hash will be proportional to the number of distinct 
values, ie. the number of rows returned by the query, not the number 
of rows processed (read) by the query. Same with hash joins etc, 
that's why when you join a very small table to a large one Postgres 
likes to use seq scan + hash join on the small table.


This surprises me - hash values are lossy, so it must still need to 
confirm against the real list of values, which at a minimum should 
require references to the rows to check against?


Is PostgreSQL doing something beyond my imagination? :-)


Hmmm... You did say distinct values, so I can see how that would work 
for distinct. What about seq scan + hash join, though? To complete the 
join, wouldn't it need to have a reference to each of the rows to join 
against? If there is 20 distinct values and 200 rows in the small table 
- wouldn't it need 200 references to be stored?


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-20 Thread Francisco Reyes

Gregory Stark writes:


HashAggregate needs to store more values in memory at the same time so it's
not a good plan if you have a lot of distinct values.


So far the resulting number of rows where in the thousands and the source 
data were in there hundreds of thousands and the group by was faster.


When you say a lot of distinct values you mean unique values as part of 
the result data set?


In other words the HashAggregate will store in memory the resulting rows or 
will be used for processing the source rows?



But the planner knows that and so as long as your work_mem is set to a
reasonable size (keeping in mind each sort or other operation feels free to


If I make sure to have vacuum analyze on a table will it be reasonable to 
trust the explain to see whether distinct or group by is better?  I started 
a new job and still don't have a good feeling for the sizes or distributions 
of the data. Soon I will be given access to the test DB so I will be able to 
do counts and explore the data without affecting production.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-20 Thread Francisco Reyes

PFC writes:

- If you process up to some percentage of your RAM worth of data, hashing  
is going to be a lot faster


Thanks for the excellent breakdown and explanation. I will try and get sizes 
of the tables in question and how much memory the machines have. 


- If you need DISTINCT ON, well, you're stuck with the Sort
- So, for the time being, you can replace DISTINCT with GROUP BY...


Have seen a few of those already on some code (new job..) so for those it is 
a matter of having a good disk subsystem?


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-20 Thread Luke Lonergan
Hi Francisco,

Generally, PG sorting is much slower than hash aggregation for performing
the distinct operation.  There may be small sizes where this isn¹t true, but
for large amounts of data (in-memory or not), hash agg (used most often, but
not always by GROUP BY) is faster.

We¹ve implemented a special optimization to PG sorting that does the
distinct processing within the sort, instead of afterward, but it¹s limited
to some small-ish number (10,000) of distinct values due to it¹s use of a
memory and processing intensive heap.

So, you¹re better off using GROUP BY and making sure that the planner is
using hash agg to do the work.

- Luke 


On 4/17/08 8:46 PM, Francisco Reyes [EMAIL PROTECTED] wrote:

 I am trying to get a distinct set of rows from 2 tables.
 After looking at someone else's query I noticed they were doing a group by
 to obtain the unique list.
 
 After comparing on multiple machines with several tables, it seems using
 group by to obtain a distinct list is substantially faster than using
 select distinct.
 
 Is there any dissadvantage of using group by to obtain a unique list?
 
 On a small dataset the difference was about 20% percent.
 
 Group by
  HashAggregate  (cost=369.61..381.12 rows=1151 width=8) (actual
 time=76.641..85.167 rows=2890 loops=1)
 
 Distinct
  Unique  (cost=1088.23..1174.53 rows=1151 width=8) (actual
 time=90.516..140.123 rows=2890 loops=1)
 
 Although I don't have the numbers here with me, a simmilar result was
 obtaining against a query that would return 100,000 rows. 20% and more
 speed differnce between group by over select distinct.
 
 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
 



Re: [PERFORM] Group by more efficient than distinct?

2008-04-18 Thread Thomas Pundt
On Freitag, 18. April 2008, Francisco Reyes wrote:
| I am trying to get a distinct set of rows from 2 tables.
| After looking at someone else's query I noticed they were doing a group by
| to obtain the unique list.
|
| After comparing on multiple machines with several tables, it seems using
| group by to obtain a distinct list is substantially faster than using
| select distinct.
|
| Is there any dissadvantage of using group by to obtain a unique list?

Searching the archives suggests that the code related to group by is much
newer than the one related to distinct and thus might benefit from more
optimization paths.

Ciao,
Thomas

-- 
Thomas Pundt [EMAIL PROTECTED]  http://rp-online.de/ 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-18 Thread Gregory Stark
Francisco Reyes [EMAIL PROTECTED] writes:

 Is there any dissadvantage of using group by to obtain a unique list?

 On a small dataset the difference was about 20% percent.

 Group by
 HashAggregate  (cost=369.61..381.12 rows=1151 width=8) (actual
 time=76.641..85.167 rows=2890 loops=1)

HashAggregate needs to store more values in memory at the same time so it's
not a good plan if you have a lot of distinct values.

But the planner knows that and so as long as your work_mem is set to a
reasonable size (keeping in mind each sort or other operation feels free to
use that much itself even if there are several in the query) and the rows
estimate is reasonable accurate -- here it's mediocre but not dangerously bad
-- then if the planner is picking it it's probably a good idea.

I'm not sure but I think there are cases where the DISTINCT method wins too.
This is basically a bug, in an ideal world both queries would generate
precisely the same plans since they're equivalent. It's just not a high
priority since we have so many more interesting improvements competing for
time.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Group by more efficient than distinct?

2008-04-18 Thread PFC
On Fri, 18 Apr 2008 11:36:02 +0200, Gregory Stark [EMAIL PROTECTED]  
wrote:



Francisco Reyes [EMAIL PROTECTED] writes:


Is there any dissadvantage of using group by to obtain a unique list?

On a small dataset the difference was about 20% percent.

Group by
HashAggregate  (cost=369.61..381.12 rows=1151 width=8) (actual
time=76.641..85.167 rows=2890 loops=1)


Basically :

	- If you process up to some percentage of your RAM worth of data, hashing  
is going to be a lot faster
	- If the size of the hash grows larger than your RAM, hashing will fail  
miserably and sorting will be much faster since PG's disksort is really  
good

- GROUP BY knows this and acts accordingly
- DISTINCT doesn't know this, it only knows sorting, so it sorts
	- If you need DISTINCT x ORDER BY x, sorting may be faster too (depending  
on the % of distinct rows)

- If you need DISTINCT ON, well, you're stuck with the Sort
- So, for the time being, you can replace DISTINCT with GROUP BY...

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance