Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanningGUCvariable

2008-01-29 Thread Heikki Linnakangas

Jeff Davis wrote:

On Mon, 2008-01-28 at 23:13 +, Heikki Linnakangas wrote:
It's a good point that we don't want pg_dump to screw up the cluster 
order, but that's the only use case I've seen this far for disabling 
sync scans. Even that wouldn't matter much if our estimate for 
clusteredness didn't get screwed up by a table that looks like this: 
5 6 7 8 9 1 2 3 4


It doesn't seem like there is any reason for the estimate to get
confused, but it apparently does. I loaded a test table with a similar
distribution to your example, and it shows a correlation of about -0.5,
but it should be as good as something near -1 or +1.

I am not a statistics expert, but it seems like a better measurement
would be: what is the chance that, if the tuples are close together in
index order, the corresponding heap tuples are close together?.

The answer to that question in your example is very likely, so there
would be no problem.

Is there a reason we don't do this?


It has been discussed before, but no-one has come up with a good 
measurement for that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanningGUCvariable

2008-01-29 Thread Stephen Denne
Jeff Davis wrote
  Well clusteredness is used or could be used for a few 
 different heuristics,
  not all of which this would be quite as well satisfied as 
 readahead. But for
 
 Can you give an example? Treating a file as a circular structure does
 not impose any significant cost that I can see.

(Pure speculation follows... if you prefer facts, skip this noise)

The data used to create pg_stats.correlation is involved in estimating the cost 
of an index scan.
It could also be used in estimating the cost of a sequential scan, if the query 
includes a limit.

Consider:
select * from huge_table_clustered_by_A where Amost_As limit 1000

If the correlation for A is close to 1, a sequential scan should be cheaper 
than an index scan.

(If the query also included an order by clause, the sequential scan would have 
to read the entire table to ensure it had found the top 1000, instead of any 
old 1000 returned in order)

If A is a circular structure, you would have to know where it started, and 
include this info in the dump/restore (or lose A's correlation).

Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.

__
  This email has been scanned by the DMZGlobal Business Quality 
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__


---(end of broadcast)---
TIP 6: explain analyze is your friend