Re: [HACKERS] Default Stats Revisited

2004-03-12 Thread Simon Riggs
Josh Berkus
  But possible more error prone.  If you crank up the default
statistics
 to
  50, but the index default is still 25...  OTOH, you could always
have
 the
  setting of used for index default be whichever is greater... hmmm.
 
 Well, I'm not 100% opposed to a multiplier.   I'd like to take a poll
of
 DBAs
 to find out which they would find more accessable.   But since most
people
 seem to be ignoring this thread, I'm not sure we'll get much response
...

...very interesting thoughts overall. I note that your idea has a very
strong basis and is pretty much suggested to us, since it forms part of
the allowable specification of TPC-H: - so let's do it!
 
I enclose the following TPC-H text. Section 5.2.8, p.103, TPC Benchmark
H (Decision Support), Standard Specification, revision 2.1.0:
5.2.8 The gathering of statistics is part of the database load (see
Clause 4.3) but it also serves as an important configuration
vehicle, particularly for the query optimizer. In order to satisfy the
requirements of Clause 5.2.7, it is desirable
to collect the same quality of statistics for every column of every
table. However, in order to reduce processing
requirements, it is permissible to segment columns into distinct classes
and base the level of statistics collection for
a particular column on class membership. Class definitions must rely
solely on schema-related attributes of a column
and must be applied consistently across all tables. For example:
. membership in an index;
. leading or other position in an index;
. use in a constraint (including a primary or foreign key
relationships).
Statistics that operate in sets, such as distribution statistics, should
employ a fixed set appropriate to the scale factor
used. Knowledge of the cardinality, values or distribution of a non-key
column as specified in Clause 4 cannot be
used to tailor statistics gathering.

Based upon that, might we add slightly to your index stats suggestion
slightly and include constraint-member columns also?

Overall, the problem you highlighted is:
- if default stats is 10 and we don't think that is very useful, then
clearly that should change, yet we are performance constrained

I would note that DB2 uses default 20, as does Teradata. Oracle uses 75
histogram buckets as default. Clearly, 10 is not the accepted
view...but I'm sure I'll be shot down for such thinking.

You set me thinking about another possible solution:
My understanding is that the default for stats on PostgreSQL is
collect, whereas on other systems it is don't collect (until
instructed). i.e. we collect stats on all columns by default. That's
good, but the end result is that it is slower than the other default.
[That might be important because dynamic sampling is not yet
implemented, not sure]

If we had the option not to collect stats at all on most columns, then
that would speed things up, wouldn't it (just as TPC-H pretty much
says). Perhaps it might be better to offer an option to alter that
default? It seems that DEFAULT_STATISTICS_TARGET cannot be set to no
thanks. If you could turn off the collection of what we seem to be
agreeing is a relatively pointless collection of statistics, would that
not improve stats collection performance? If set to 0, only collect
number of nulls, number of distinct values, max and min.

Turning off stats-by-default and yet ramping up the collection target
where it is needed sounds good for TPC-H/DBT-3 environments - and will
likely even improve perf numbers on the actual tests! 

..this might also allow us to rename the two parameters, just as was
recently done with work_mem etc..

statistics_target_default   0  -- i.e. don't collect by default
statistics_target_index 100

I would also suggest a further class of columns for statistics
collection:

statistics_target_text

which would include all fields with length  32 (pick a limit...), since
these are very frequently all unique.
Setting these to -1 would mean they use the statistics_target_default
value, which would be their default setting.

You might even provide a mechanism for defining statistics collection
classes based upon their datatype...e.g.

CREATE STATISTICS CLASS CLASSNAME
STATISTICS COLLECTION TARGET 100
APPLIES TO (LEADING n) INDEX COLUMNS, CONSTRAINT COLUMNS;

CREATE STATISTICS CLASS CLASSNAME
 STATISTICS COLLECTION TARGET 0
APPLIES TO COMMAS SEPARATED LIST OF DATATYPES;

anything not specifically identified, would then use system parameter
default

That would be great, since we now have the ability to collects stats on
user defined datatypes (don't we?).

Back to the multiplier: Yes please, but only if it worked like this:
Normal and index stats are settable differently.
Index stats DEFAULT is a multiple of normal stats, unless specifically
set. (You may wish to set it down as well as up, remember).
That way, the default behaviour improves even when the index stats
parameter is not actually set, yet is still controllable when you do.

Best Regards, Simon 

Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Josh Berkus
Scott,

 I like it.  Would a multiplier be acceptable?  
 default_stats_index_multiplier = 10

Yeah, I thought about that, but a multiplier would be harder to manage for 
most people.I mean, what if your default_stats are at 25 and you want 
your index_stats at 40?   PITA.   Also, if you wanted to increase the default 
stats but *forgot* that the index_stats were a multiplier ...

I think a straight number is less confusing.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread scott.marlowe
On Thu, 11 Mar 2004, Josh Berkus wrote:

 Scott,
 
  I like it.  Would a multiplier be acceptable?  
  default_stats_index_multiplier = 10
 
 Yeah, I thought about that, but a multiplier would be harder to manage for 
 most people.I mean, what if your default_stats are at 25 and you want 
 your index_stats at 40?   PITA.   Also, if you wanted to increase the default 
 stats but *forgot* that the index_stats were a multiplier ...
 
 I think a straight number is less confusing.

But possible more error prone.  If you crank up the default statistics to 
50, but the index default is still 25...  OTOH, you could always have the 
setting of used for index default be whichever is greater... hmmm.


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Josh Berkus
Scott,

 But possible more error prone.  If you crank up the default statistics to 
 50, but the index default is still 25...  OTOH, you could always have the 
 setting of used for index default be whichever is greater... hmmm.

Well, I'm not 100% opposed to a multiplier.   I'd like to take a poll of DBAs 
to find out which they would find more accessable.   But since most people 
seem to be ignoring this thread, I'm not sure we'll get much response ...

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Well, I'm not 100% opposed to a multiplier.   I'd like to take a poll of DBAs
 to find out which they would find more accessable.   But since most people 
 seem to be ignoring this thread, I'm not sure we'll get much response ...

Maybe you should ask on -admin or -general.  Personally I thought there
wasn't anything to say until someone did some experiments to show
whether an indexed-column differential is really worthwhile and what a
plausible default value would be.  The idea sounds good in the abstract,
but will it really help or just be another useless setting?

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Josh Berkus
Tom,

 Maybe you should ask on -admin or -general.  Personally I thought there
 wasn't anything to say until someone did some experiments to show
 whether an indexed-column differential is really worthwhile and what a
 plausible default value would be.  The idea sounds good in the abstract,
 but will it really help or just be another useless setting?

Yeah, that's our next step, a test.

On Monday,  I hope to have comparative stats for a difficult database on the 3 
solutions (leaving things as-is, raising the general default stats, and doing 
index_stats).

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Robert Treat
On Thursday 11 March 2004 14:17, Josh Berkus wrote:
 Tom,

  Maybe you should ask on -admin or -general.  Personally I thought there
  wasn't anything to say until someone did some experiments to show
  whether an indexed-column differential is really worthwhile and what a
  plausible default value would be.  The idea sounds good in the abstract,
  but will it really help or just be another useless setting?

 Yeah, that's our next step, a test.

 On Monday,  I hope to have comparative stats for a difficult database on
 the 3 solutions (leaving things as-is, raising the general default stats,
 and doing index_stats).

Do you plan on handeling primary key columns differently (since they are 
likely to be unique and indexed) ?

Also how will you handle column that are part of expressional indexes (where 
foo is true for example) ?

Final thought... I'm a DBA and I think the straight number is simpler, though 
could be convinced to go with whichever is higher... 

Robert Treat
-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Josh Berkus
Robert,

 Do you plan on handeling primary key columns differently (since they are 
 likely to be unique and indexed) ?

The same as any other indexed column.

 Also how will you handle column that are part of expressional indexes (where 
 foo is true for example) ?

See my original proposal.  These columns will be ignored.  Expressions have 
their own stats.

 Final thought... I'm a DBA and I think the straight number is simpler, 
though 
 could be convinced to go with whichever is higher... 

Thanks.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Also how will you handle column that are part of expressional indexes (where
 foo is true for example) ?

 See my original proposal.  These columns will be ignored.  Expressions have 
 their own stats.

Yeah, I see no particular need to increase the stats allocation for a
column merely because it is used in an expression index.  The mechanism
Josh is proposing should cause the default amount of stats collected
*for the expression index* to go up, though.


BTW, there's an as-yet-undone bit of business associated with collecting
stats for expression indexes.  The ANALYZE code will honor an explicit
attstattarget setting for an expressional index column, but there's no
clean way to get that setting into the system.  What works in CVS tip is

regression=# create table foo (f1 int, f2 int);
CREATE TABLE
regression=# create index fooi on foo ((f1 + f2));
CREATE INDEX
regression=# alter table fooi alter column pg_expression_1 set statistics 100;
ALTER TABLE

but I don't much care for this; the arbitrary names that are used for
expressional columns shouldn't be embedded into SQL commands, and doing
alter table on an index looks a bit funny as well.

I want to make pg_dump dump these settings, but I'm not feeling
comfortable with having it dump commands that look like the above.
That would nail down the current method of assigning expression column
names as something we could never change without breaking dump scripts.

Can anyone think of a better way?

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Another idea is whether a foreign key column should get extra
 statistics?

In practice, both ends of an FK relationship have to be indexed,
so I don't see that we need any extra special case for that.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Another idea is whether a foreign key column should get extra
  statistics?
 
 In practice, both ends of an FK relationship have to be indexed,
 so I don't see that we need any extra special case for that.

Do they?  We don't create an index automatically when using REFERENCES.
We do create an index for PRIMARY KEY.

I was just wondering if the REFERENCES column is more sensitive to join
usage and would benefit from more accurate statistics even if it doesn't
have an index.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Default Stats Revisited

2004-03-11 Thread Josh Berkus
Bruce,

 Do they?  We don't create an index automatically when using REFERENCES.
 We do create an index for PRIMARY KEY.
 
 I was just wondering if the REFERENCES column is more sensitive to join
 usage and would benefit from more accurate statistics even if it doesn't
 have an index.

I don't think so.   If one does not create an index on an FK column, this is 
usually because the column does not have enough of a range of values to 
deserve indexing (for example, 4 potential values across 100,000 records).  
In that case, we would not want to up the statistics either.

And we haven't made it a practice to hand-hold for database designers who 
don't know when to index, so I don't think we should start now.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly