RE: Statistical sampling and representative stats collection

2002-05-29 Thread Larry Elkins

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of John
 Kanagaraj
 Sent: Tuesday, May 28, 2002 1:49 PM
 To: Multiple recipients of list ORACLE-L
 Subject: RE: Statistical sampling and representative stats collection


 Btw, there is an option in 9i DBMS_STATS.AUTO_SAMPLE_SIZE. Has
 anyone tried
 this out? Or know how it works?

I would be interested in that as well.

  for these stats on non-indexed columns? If so, I disagree on
  this point,
  even if it is general advice and not a rule. Stats on
  non-indexed columns
  can play a *large* role in CBO decisions.

 I agree with you, although I do have to contend that the effect
 is not very
 pronounced in all databases. This was revealed in some depth in Wolfgang's
 paper at IOUG where he was able to actually determine this in a
 10053 trace,
 and it was an eye opener for me. The issue I have with this is that the
 default number of buckets is 2 and that is nowhere near what is needed. On
 the other hand, indiscriminately increasing the bucket size would
 leave you
 with a _Large_ number of histograms and result in the 'row cache
 lock' latch
 being taken out more that it should have been (as well as add to
 the cost of
 parsing).

I would only consider gathering this info, or customizing the stats
gathering process in general, on a case by case basis and only when needed.
Due to the nature of the data and queries, you might find the need to gather
the non-indexed column stats on a single table and 1 column, or maybe 2
tables, or maybe many tables and columns, or maybe none at all.

You already know, then, how stats on non-indexed columns can play a role.
But since I received a few back-channel emails asking for examples of how
stats on the non-indexed columns make a difference, I'll include the
following extreme example that illustrates. But by no means am I saying you
should always gather them, just on an as needed basis. And I haven't
included the 10053 traces since that could get very tedious ;-)

Say we have two tables, tables A and B, each consisting of two columns, A
and B. Column A is the PK on each, column B is not indexed. So:

Table A: 50,000 Rows
Table B: 50,000 Rows
Column A.A: (PK) Values 1 thru 50,000
Column B.A: (PK) Values 1 thru 50,000
Column A.B: Not indexed, values 1 thru 25000, each occurring twice – e.g.
1,1,2,2,3,3...
Column B.B: Not indexed, values 0 and 1, each occurring 25,000 times

Note that I am not really even dealing with any skewness at this point –
this is an even distribution of values. Here are 4 sample queries:

Sample Queries:

Select *
From A, B
WHERE A.A = B.A
  and A.B = 5 –- Filters down to 2 rows, I want to drive by A in an NL
fashion

Select *
From A, B
WHERE A.A = B.A
  and A.B = 5 –- Filters down to 2 rows, I still want to drive by A in an NL
fashion
  and B.B = 1 –- Filters down to 25,000 rows if treated standalone

Select *
From A, B
WHERE A.A = B.A
  and B.B = 1 –- Filters down to 25,000 rows, I want FTS's and HJ.

Select *
From A, B
WHERE A.A = B.A
  and B.B = 27000 –- no rows meet this, I want to drive the query by table
B.

With a basic compute for table for all indexed columns, we get hash joins
with all four, in the case of the first two, driving by table A, and the
next two driving by table B. There were no stats on the non-indexed column.
This isn't what I really wanted for all 4 queries.

Now, if I simply do an analyze table compute statistics, in which case
stats are gathered for both columns A and B, you get the default bucket and
two rows for each column over in DBA_TAB_HISTOGRAMS.

In the case of the first 2 queries, I get a nested loops driven by table A
using an FTS and an index lookup into table B. This is what I want because
of the really restrictive filtering criteria on table A’s non-indexed column
B, resulting in two rows being selected, and two indexed lookups into table
B. The stats on the non-indexed column helped the CBO make this decision.

In the case of queries 3 and 4, I would get a MERGE JOIN, using a full index
scan on A’s PK to get each row, for the purpose of feeding the data in
sorted order (I might have preferred an FTS and actually doing the sort!).
And an FTS on table B. Note that by increasing the HASH_AREA_SIZE, queries 3
and 4 simply went with FTS’s and an HJ, driving by table B. I would have
hoped that in the case of query 4 that the CBO would have chosen to drive by
table B, using a nested loops index approach into table A. Upon no rows
being returned from table B, table A would not have been accessed at all.

So, we go back and analyze specifying a SIZE for column B. Queries 1 and 2
use a nested loops driving by table A, and the restrictive filtering
criteria result in 2 indexed lookups into table B. This is what I want. In
the case of query 3, and having dropped the hash area size back down, it
does FTS’s and an HJ driving by table B. This is what I want.

And in the case of query 4, it uses a nested loops approach, driving

RE: Statistical sampling and representative stats collection

2002-05-29 Thread Larry Elkins


 Based on the scarcity of previous responses to emails on this list,
 it seems that histograms are not that widely used throughout the
 industry.  I'm not sure why.

I've used them in the past, but only in very specific instances and
certainly not for all tables/columns. Sometimes just 1 or 2, sometimes 15 or
20. And only in those cases where needed.


 Cherie Machler
 Oracle DBA
 Gelco Information Network

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Larry Elkins
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-29 Thread Terrian, Tom

Sorry, I did not use anything that sophisticated (nor accurate?):

Uniform distribution = 1/(number of unique values)
Max distribution = (the highest occurrence of a value) / (number of records in
the table)

Tom


-Original Message-
Sent: Tuesday, May 28, 2002 4:34 PM
To: Multiple recipients of list ORACLE-L

tom,

This is interesting. How did you determine max
distribution and uniform distribution? Did you use
standard deviation and variance?

regards,

jack silvey


--- Terrian, Tom [EMAIL PROTECTED] wrote:
 John,
 
 I know in a previous job, we determined that
 histograms where not worth it.  The
 following is from a test that we performed:
 

***
 
 Table-F_tab   Uniform DistributionMax Distribution
 Field-P_code  0.65%   18%
   
 Therefore, from the above numbers, the field should
 be a good candidate for
 histograms so I did the following tests.  Based on
 the following combinations of
 statistics and histograms, I timed how fast a sample
 query ran:
 
   w/o stats   w/ statsw/stats w/stats
 P_Codeno histograms   100 buckets 50 buckets
 ----  --  --
 --
 0101  342 secs.   428 385 500 
 0101  406 416 326 340
 0101  391 390 327 359
 6501  458 490 337 342
 6501  475 380 358 490
 6501  518 395 326 354
 ---   --  --
 --
 Total Secs.   1730162913482085
 (w/o high
  and low
  values)
 Avg time  7Min 12Sec  6Min 47Sec  5Min 37Sec   5Min
 51Sec
  per run
 
   However, to create the histogram it takes 1hr42min.
  Too long for the
 benefit that we gain.
 

***
 
 Tom
 
 -Original Message-
 Sent: Tuesday, May 28, 2002 3:25 PM
 To: Multiple recipients of list ORACLE-L
 
 Ian,
 
  John are you saying to create histograms on all
 indexed 
  columns, or just the ones with distributions which
 are skewed 
  and also for ones which although symmetric in
 distribution  
  have some values much more prevalent than others? 
 
 
 To keep this simplistic, I wouldn't use Histograms
 (or let it default to 2)
 *unless* hardcoded values are known to be used, at
 least in 8i. The
 situation becomes different in 9i as the CBO is able
 to peek into these
 values even when bind variables are used. (I think
 there is a script out
 there on Steve Adam's site called 'Histogram Helper'
 which can suggest this
 for you). 
 
 However, as Larry mentioned in a previous email, the
 CBO is influenced by
 distributions in non-indexed colummns. The issue
 here is that the number of
 buckets really matter, and the default of 2 can
 influence incorrect
 decisions (haven't we all seen 'em? :)  So what I am
 essentially saying is
 this: Use COMPUTE and Histograms when you have to,
 but don't sweat over it
 unless it pinches ya. 
 
 And how do we determine it is pinching? V$SYSSTAT is
 a pretty good
 indicator: (At the risk of being called a part of
 the 'ratios' group) Is the
 ratio of 'table scan blocks gotten' to 'table scan
 rows gotten' acceptable?
 Is the number of table scans acceptable? Is the
 number of 'db block gets'
 too much - as compared to 'physical reads'?
 
 I am in the process of determining the overheads of
 having 'too many'
 histograms - I am observing some 'row cache lock'
 latch waits and think that
 this could have been the result of too many
 histograms. Hope to post some
 info back to the list soon.
 
 John Kanagaraj
 Oracle Applications DBA
 DBSoft Inc
 (W): 408-970-7002
 
 The manuals for Oracle are here:
 http://tahiti.oracle.com
 The manual for Life is here:
 http://www.gospelcom.net
 
 ** The opinions and statements above are entirely my
 own and not those of my
 employer or clients **
 -- 
 Please see the official ORACLE-L FAQ:
 http://www.orafaq.com
 -- 
 Author: John Kanagaraj
   INET: [EMAIL PROTECTED]
 
 Fat City Network Services-- (858) 538-5051  FAX:
 (858) 538-5051
 San Diego, California-- Public Internet
 access / Mailing Lists


 To REMOVE yourself from this mailing list, send an
 E-Mail message
 to: [EMAIL PROTECTED] (note EXACT spelling of
 'ListGuru') and in
 the message BODY, include a line containing: UNSUB
 ORACLE-L
 (or the name of mailing list you want to be removed
 from).  You may
 also send the HELP command for other information
 (like subscribing).
 -- 
 Please 

RE: Statistical sampling and representative stats collection

2002-05-29 Thread Nicoll, Iain (Calanais)

Have used them on a for all indexed columns basis and they make a massive
difference on heavily skewed data - particularly a sort of waiting to be
processed flag which only has about 5 distinct values but the ones we want
to pick will make up only about 0.01%.  

Haven't used them on all columns as we don't often filter on non-indexed
columns.

Iain Nicoll

-Original Message-
Sent: Wednesday, May 29, 2002 12:59 PM
To: Multiple recipients of list ORACLE-L



 Based on the scarcity of previous responses to emails on this list,
 it seems that histograms are not that widely used throughout the
 industry.  I'm not sure why.

I've used them in the past, but only in very specific instances and
certainly not for all tables/columns. Sometimes just 1 or 2, sometimes 15 or
20. And only in those cases where needed.


 Cherie Machler
 Oracle DBA
 Gelco Information Network

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Larry Elkins
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Nicoll, Iain (Calanais)
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-29 Thread Cherie_Machler


Larry,

Thanks for taking the time to write this lengthy reply.   I've never seen
this particular information on histograms anywhere.

This has been very informative.

Cherie Machler
Oracle DBA
Gelco Information Network


   
   
Larry Elkins 
   
elkinsl@flash   To: [EMAIL PROTECTED]
   
.netcc:   
   
 Subject: RE: Statistical sampling and 
representative stats   
05/29/02 05:57collection   
   
AM 
   
   
   
   
   




 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of John
 Kanagaraj
 Sent: Tuesday, May 28, 2002 1:49 PM
 To: Multiple recipients of list ORACLE-L
 Subject: RE: Statistical sampling and representative stats collection


 Btw, there is an option in 9i DBMS_STATS.AUTO_SAMPLE_SIZE. Has
 anyone tried
 this out? Or know how it works?

I would be interested in that as well.

  for these stats on non-indexed columns? If so, I disagree on
  this point,
  even if it is general advice and not a rule. Stats on
  non-indexed columns
  can play a *large* role in CBO decisions.

 I agree with you, although I do have to contend that the effect
 is not very
 pronounced in all databases. This was revealed in some depth in
Wolfgang's
 paper at IOUG where he was able to actually determine this in a
 10053 trace,
 and it was an eye opener for me. The issue I have with this is that the
 default number of buckets is 2 and that is nowhere near what is needed.
On
 the other hand, indiscriminately increasing the bucket size would
 leave you
 with a _Large_ number of histograms and result in the 'row cache
 lock' latch
 being taken out more that it should have been (as well as add to
 the cost of
 parsing).

I would only consider gathering this info, or customizing the stats
gathering process in general, on a case by case basis and only when needed.
Due to the nature of the data and queries, you might find the need to
gather
the non-indexed column stats on a single table and 1 column, or maybe 2
tables, or maybe many tables and columns, or maybe none at all.

You already know, then, how stats on non-indexed columns can play a role.
But since I received a few back-channel emails asking for examples of how
stats on the non-indexed columns make a difference, I'll include the
following extreme example that illustrates. But by no means am I saying you
should always gather them, just on an as needed basis. And I haven't
included the 10053 traces since that could get very tedious ;-)

Say we have two tables, tables A and B, each consisting of two columns, A
and B. Column A is the PK on each, column B is not indexed. So:

Table A: 50,000 Rows
Table B: 50,000 Rows
Column A.A: (PK) Values 1 thru 50,000
Column B.A: (PK) Values 1 thru 50,000
Column A.B: Not indexed, values 1 thru 25000, each occurring twice ? e.g.
1,1,2,2,3,3...
Column B.B: Not indexed, values 0 and 1, each occurring 25,000 times

Note that I am not really even dealing with any skewness at this point ?
this is an even distribution of values. Here are 4 sample queries:

Sample Queries:

Select *
From A, B
WHERE A.A = B.A
  and A.B = 5 ?- Filters down to 2 rows, I want to drive by A in an NL
fashion

Select *
From A, B
WHERE A.A = B.A
  and A.B = 5 ?- Filters down to 2 rows, I still want to drive by A in an
NL
fashion
  and B.B = 1 ?- Filters down to 25,000 rows if treated standalone

Select *
From A, B
WHERE A.A = B.A
  and B.B = 1 ?- Filters down to 25,000 rows, I want FTS's and HJ.

Select *
From A, B
WHERE A.A = B.A
  and B.B = 27000 ?- no rows meet this, I want to drive the query by table
B.

With a basic compute for table for all indexed columns, we get hash joins
with all four, in the case of the first two, driving by table A, and the
next two driving by table B. There were no stats on the non-indexed column.
This isn't what I really wanted for all 4 queries.

Now, if I simply do an analyze table compute statistics, in which case
stats are gathered for both columns A and B, you get the default bucket and
two rows for each column over in DBA_TAB_HISTOGRAMS.

In the case of the first 2 queries, I get a nested loops driven by table A
using an FTS and an index lookup into table B. This is what I

RE: Statistical sampling and representative stats collection

2002-05-29 Thread Orr, Steve

We have a Y/N (yes/no) processing flag column on a very large table. 
As an alternative to a bitmapped index, we changed the code to manipulate
Y/null values because  0.1% of values are 'Y' and the others are null. 
The resulting index is quite small but we do rebuild it periodically.
This has worked quite well for us.

Steve Orr
Bozeman, MT



-Original Message-
Sent: Wednesday, May 29, 2002 7:09 AM
To: Multiple recipients of list ORACLE-L


Have used them on a for all indexed columns basis and they make a massive
difference on heavily skewed data - particularly a sort of waiting to be
processed flag which only has about 5 distinct values but the ones we want
to pick will make up only about 0.01%.  

Haven't used them on all columns as we don't often filter on non-indexed
columns.

Iain Nicoll

-Original Message-
Sent: Wednesday, May 29, 2002 12:59 PM
To: Multiple recipients of list ORACLE-L



 Based on the scarcity of previous responses to emails on this list,
 it seems that histograms are not that widely used throughout the
 industry.  I'm not sure why.

I've used them in the past, but only in very specific instances and
certainly not for all tables/columns. Sometimes just 1 or 2, sometimes 15 or
20. And only in those cases where needed.


 Cherie Machler
 Oracle DBA
 Gelco Information Network
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Orr, Steve
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-29 Thread Orr, Steve

 I remember that Steve Orr of this list used Histograms and enjoyed huge
 increase in performance. 

Hi John, that over a year ago, your memory is too good. :-)

Haven't worked with histograms lately but here are some before and after 
stats from that former histogram implementation:
- Rows scanned per second went from 300,000 to 500; 
- consistent reads per second went from 75,000 to 500;
- CPU usage went from 85-100% to 5%; 
- the physical I/O rate dropped significantly.

When histograms help the difference can be dramatic.  This was from a single
query that was executed via dbms_jobs every minute. After ID'ing the problem

query I went through a tuning exercise at length then found histograms to be

the magic bullet that fixed everything. So, rather than the shotgun approach

of calculating histograms on everything, I think the more surgical technique

of finding the bottlenecks (possibly using the wait interface) still
applies.
Of course I'm preaching to the choir.  :-)

Steve Orr
Bozeman, MT


-Original Message-
Sent: Tuesday, May 28, 2002 4:54 PM
To: Multiple recipients of list ORACLE-L
Importance: High

Cherie,

 In my experience, histograms seem a bit hit or miss but in the
 cases where they've worked, the performance improvement
 has been good or even fantastic.   In the cases where they haven't
 helped, I've simply removed them.

As I observed before, histograms help only when literal predicates are used
(until 9i). Steve Adams has this to say about Histograms:

 Quote 
Histograms enable the optimizer to more accurately estimate the selectivity
of literal predicates against columns with skewed data distributions. This
can help the optimizer to choose a better access path and possibly join
order for certain queries than might otherwise be the case. However,
redundant histograms on columns with uniformly distributed data, and unduly
large histograms on columns with skewed data distributions just increase the
CPU cost of parsing and waste space in the shared pool. Therefore histograms
should only be created where they are beneficial, and should not be any
larger than necessary. Please note that histograms on columns that are not
indexed can nevertheless be beneficial because they inform the optimizer of
the cardinality of their table for the query and can thus influence the join
order. 
 Unquote 
 
 Based on the scarcity of previous responses to emails on this list,
 it seems that histograms are not that widely used throughout the
 industry.  I'm not sure why.

I remember that Steve Orr of this list used Histograms and enjoyed huge
increase in performance. The reason why the Industry hasn't used Histograms
as much as it should have been used is due to a combination of lack of
knowledge, FUD as well as just plain lethargy. On the other hand, overuse
also has its downsides. 

Btw, has anyone tracked V$ROWCACHE which provides a fair idea of the row
cache (or DD) portion of the Shared pool - the figures against
dc_histogram_data and dc_histogram_defs may provide some clues about what's
going on within

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **
-- 
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Orr, Steve
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-29 Thread MacGregor, Ian A.

One problem I have with histograms is having to calculate the proper number of 
buckets.  Thanks to Steve Adams I have a program to do just that.  I keep such 
information in the dbms_stats_table  so it can be used by dbms_stats when the table 
is reanalyzed.  

The proper number of buckets may change over time, and it's also possible the 
frequency of distributions of the values for a column may change enough that 
histograms no longer help.  It is however an onerous task to recalculate them.
How often are people checking the frequency distribution and the bucket counts.  I try 
to do this every three months.

Ian MacGregor
Stanford Linear Accelerator Center
[EMAIL PROTECTED]

-Original Message-
Sent: Wednesday, May 29, 2002 7:39 AM
To: Multiple recipients of list ORACLE-L


 I remember that Steve Orr of this list used Histograms and enjoyed huge
 increase in performance. 

Hi John, that over a year ago, your memory is too good. :-)

Haven't worked with histograms lately but here are some before and after 
stats from that former histogram implementation:
- Rows scanned per second went from 300,000 to 500; 
- consistent reads per second went from 75,000 to 500;
- CPU usage went from 85-100% to 5%; 
- the physical I/O rate dropped significantly.

When histograms help the difference can be dramatic.  This was from a single
query that was executed via dbms_jobs every minute. After ID'ing the problem

query I went through a tuning exercise at length then found histograms to be

the magic bullet that fixed everything. So, rather than the shotgun approach

of calculating histograms on everything, I think the more surgical technique

of finding the bottlenecks (possibly using the wait interface) still
applies.
Of course I'm preaching to the choir.  :-)

Steve Orr
Bozeman, MT


-Original Message-
Sent: Tuesday, May 28, 2002 4:54 PM
To: Multiple recipients of list ORACLE-L
Importance: High

Cherie,

 In my experience, histograms seem a bit hit or miss but in the
 cases where they've worked, the performance improvement
 has been good or even fantastic.   In the cases where they haven't
 helped, I've simply removed them.

As I observed before, histograms help only when literal predicates are used
(until 9i). Steve Adams has this to say about Histograms:

 Quote 
Histograms enable the optimizer to more accurately estimate the selectivity
of literal predicates against columns with skewed data distributions. This
can help the optimizer to choose a better access path and possibly join
order for certain queries than might otherwise be the case. However,
redundant histograms on columns with uniformly distributed data, and unduly
large histograms on columns with skewed data distributions just increase the
CPU cost of parsing and waste space in the shared pool. Therefore histograms
should only be created where they are beneficial, and should not be any
larger than necessary. Please note that histograms on columns that are not
indexed can nevertheless be beneficial because they inform the optimizer of
the cardinality of their table for the query and can thus influence the join
order. 
 Unquote 
 
 Based on the scarcity of previous responses to emails on this list,
 it seems that histograms are not that widely used throughout the
 industry.  I'm not sure why.

I remember that Steve Orr of this list used Histograms and enjoyed huge
increase in performance. The reason why the Industry hasn't used Histograms
as much as it should have been used is due to a combination of lack of
knowledge, FUD as well as just plain lethargy. On the other hand, overuse
also has its downsides. 

Btw, has anyone tracked V$ROWCACHE which provides a fair idea of the row
cache (or DD) portion of the Shared pool - the figures against
dc_histogram_data and dc_histogram_defs may provide some clues about what's
going on within

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **
-- 
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Orr, Steve
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: MacGregor, Ian A.
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: 

RE: Statistical sampling and representative stats collection

2002-05-29 Thread John Kanagaraj

Thanks Larry for this excellent and painstaking explanation. 

John
 -Original Message-
 From: Larry Elkins [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, May 29, 2002 4:59 AM
 To: Multiple recipients of list ORACLE-L
 Subject: RE: Statistical sampling and representative stats collection
 
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of John
  Kanagaraj
  Sent: Tuesday, May 28, 2002 1:49 PM
  To: Multiple recipients of list ORACLE-L
  Subject: RE: Statistical sampling and representative stats 
 collection
 
 
  Btw, there is an option in 9i DBMS_STATS.AUTO_SAMPLE_SIZE. Has
  anyone tried
  this out? Or know how it works?
 
 I would be interested in that as well.
 
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-28 Thread John Kanagaraj

Hi Larry,

I sure wasn't disappointed to receive this note from a CBO/SQL guru such as
yourself. Thanks for the ideas.

 My question here is would this database and your findings be 
 applicable to
 other databases and the nature of their data? I wouldn't 

The nature of the data and the variances in between are so great I would
never try and generalize everything. However, the number (or percentage) of
generic cases is significant compared to the specific, and that is probably
what I was trying to get to. In many cases, the act and cost of collecting
the statistics by itself is so huge and significant compared to what is
actually used out of all that data. 

 since the characteristics can be so different, but, if I'm reading you
 correctly, you aren't saying that ESTIMATE is always the only 
 way. But, the

That is right. I think it was a case of moving on from carpet bombing
(either ESTIMATE only or COMPUTE only) to some sort of precision bombing.
There is still some collateral damage (as seen below)

 Sunday. Someone accidentally analyzed the schema at 30% on 
 Monday and a lot
 of things went down the toilet. Going back to COMPUTE fixed 
 things. Then
 again, maybe a 10% ESTIMATE would have fixed things. Jack and 
 I both work
 with a guy who has talked about COMPUTE resulting in 
 undesired plans, 10%
 did as well. They got the desired plans by going to 1%. So, 
 even if one
 agrees that we don't necessarily have to COMPUTE, and in many 
 (probably
 most?) cases we don't, there is still a lot of testing to be 
 done to find
 the best estimate percent, and this could very well be different for
 various objects. And I think that's the battle we all face -- 
 what is the
 best sampling percentage. And right now, it still seems to be 
 done on a
 trial and error basis. I have some ideas on how one might 
 attack this in an
 automated fashion, but it's still a *very* rough idea that I 
 need to bounce
 off a few cohorts.

I think you summarized it very well. How does one figure out what is the
best percentage, given that we will NOT be able to complete a COMPUTE within
the period allowed? When does one stop experimenting (significant DBA cost)
and how does one make sure that the apple cart is not upset by new data
patterns? I would be very interested in estimating the value using some
automated fashion. I would be honored to be part of that bunch of cohorts! 

Btw, there is an option in 9i DBMS_STATS.AUTO_SAMPLE_SIZE. Has anyone tried
this out? Or know how it works?
 
 for these stats on non-indexed columns? If so, I disagree on 
 this point,
 even if it is general advice and not a rule. Stats on 
 non-indexed columns
 can play a *large* role in CBO decisions. I'm not going to go 
 into details
 and examples here illustrating that, but those stats can 
 still help decide
 the driving table, the join methods between tables, etc. I 
 built a sample
 case some time back to illustrate the importance of gathering these
 non-indexed column stats. Now, it might not be important for 
 all systems,
 but if you are ever using indexed columns, and, still 
 specifying criteria on
 non-indexed columns, the gathering of stats on the 
 non-indexed columns could
 be *very* important. I can send you more details back-channel 
 if you are
 interested.

I agree with you, although I do have to contend that the effect is not very
pronounced in all databases. This was revealed in some depth in Wolfgang's
paper at IOUG where he was able to actually determine this in a 10053 trace,
and it was an eye opener for me. The issue I have with this is that the
default number of buckets is 2 and that is nowhere near what is needed. On
the other hand, indiscriminately increasing the bucket size would leave you
with a _Large_ number of histograms and result in the 'row cache lock' latch
being taken out more that it should have been (as well as add to the cost of
parsing).

 And your approach very well could take care of most cases for 
 many people.
 It's an interesting idea and something certainly worth 
 playing around with.

Yep - and I did add a YMMV :)

I would love to see this thread grow.

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  

RE: Statistical sampling and representative stats collection

2002-05-28 Thread John Kanagaraj

Ian,

 John are you saying to create histograms on all indexed 
 columns, or just the ones with distributions which are skewed 
 and also for ones which although symmetric in distribution  
 have some values much more prevalent than others?  

To keep this simplistic, I wouldn't use Histograms (or let it default to 2)
*unless* hardcoded values are known to be used, at least in 8i. The
situation becomes different in 9i as the CBO is able to peek into these
values even when bind variables are used. (I think there is a script out
there on Steve Adam's site called 'Histogram Helper' which can suggest this
for you). 

However, as Larry mentioned in a previous email, the CBO is influenced by
distributions in non-indexed colummns. The issue here is that the number of
buckets really matter, and the default of 2 can influence incorrect
decisions (haven't we all seen 'em? :)  So what I am essentially saying is
this: Use COMPUTE and Histograms when you have to, but don't sweat over it
unless it pinches ya. 

And how do we determine it is pinching? V$SYSSTAT is a pretty good
indicator: (At the risk of being called a part of the 'ratios' group) Is the
ratio of 'table scan blocks gotten' to 'table scan rows gotten' acceptable?
Is the number of table scans acceptable? Is the number of 'db block gets'
too much - as compared to 'physical reads'?

I am in the process of determining the overheads of having 'too many'
histograms - I am observing some 'row cache lock' latch waits and think that
this could have been the result of too many histograms. Hope to post some
info back to the list soon.

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-28 Thread Terrian, Tom

John,

I know in a previous job, we determined that histograms where not worth it.  The
following is from a test that we performed:

***

Table-  F_tab   Uniform DistributionMax Distribution
Field-  P_code  0.65%   18%

Therefore, from the above numbers, the field should be a good candidate for
histograms so I did the following tests.  Based on the following combinations of
statistics and histograms, I timed how fast a sample query ran:

w/o stats   w/ statsw/stats w/stats
P_Code  no histograms   100 buckets 50 buckets
--  --  --  --
--
0101342 secs.   428 385 500 
0101406 416 326 340
0101391 390 327 359
6501458 490 337 342
6501475 380 358 490
6501518 395 326 354
--  -   --  --
--
Total Secs. 1730162913482085
(w/o high
 and low
 values)
Avg time7Min 12Sec  6Min 47Sec  5Min 37Sec   5Min
51Sec
 per run

However, to create the histogram it takes 1hr42min.  Too long for the
benefit that we gain.

***

Tom

-Original Message-
Sent: Tuesday, May 28, 2002 3:25 PM
To: Multiple recipients of list ORACLE-L

Ian,

 John are you saying to create histograms on all indexed 
 columns, or just the ones with distributions which are skewed 
 and also for ones which although symmetric in distribution  
 have some values much more prevalent than others?  

To keep this simplistic, I wouldn't use Histograms (or let it default to 2)
*unless* hardcoded values are known to be used, at least in 8i. The
situation becomes different in 9i as the CBO is able to peek into these
values even when bind variables are used. (I think there is a script out
there on Steve Adam's site called 'Histogram Helper' which can suggest this
for you). 

However, as Larry mentioned in a previous email, the CBO is influenced by
distributions in non-indexed colummns. The issue here is that the number of
buckets really matter, and the default of 2 can influence incorrect
decisions (haven't we all seen 'em? :)  So what I am essentially saying is
this: Use COMPUTE and Histograms when you have to, but don't sweat over it
unless it pinches ya. 

And how do we determine it is pinching? V$SYSSTAT is a pretty good
indicator: (At the risk of being called a part of the 'ratios' group) Is the
ratio of 'table scan blocks gotten' to 'table scan rows gotten' acceptable?
Is the number of table scans acceptable? Is the number of 'db block gets'
too much - as compared to 'physical reads'?

I am in the process of determining the overheads of having 'too many'
histograms - I am observing some 'row cache lock' latch waits and think that
this could have been the result of too many histograms. Hope to post some
info back to the list soon.

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Terrian, Tom
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-28 Thread Cherie_Machler


I have had some really good experiences with using histograms.
They didn't always produce the improvements that I expected
but in many cases, I saw 10 times, 100 times, or even 1,000 faster
execution times after adding histograms.

I don't have the specifics, but these were cases where the data
was very heavily skewed and the column that the histogram
was created on was included in the WHERE clause of the SELECT
statement and set to some specified value with an equal sign
(bind variables were not used).

In my experience, histograms seem a bit hit or miss but in the
cases where they've worked, the performance improvement
has been good or even fantastic.   In the cases where they haven't
helped, I've simply removed them.

Based on the scarcity of previous responses to emails on this list,
it seems that histograms are not that widely used throughout the
industry.  I'm not sure why.

Cherie Machler
Oracle DBA
Gelco Information Network


   
   
Terrian, Tom 
   
tterrian@daas   To: Multiple recipients of list ORACLE-L 
[EMAIL PROTECTED]  
.dla.milcc:   
   
Sent by: Subject: RE: Statistical sampling and 
representative stats   
[EMAIL PROTECTED]collection   
   
om 
   
   
   
   
   
05/28/02 02:55 
   
PM 
   
Please respond 
   
to ORACLE-L
   
   
   
   
   




John,

I know in a previous job, we determined that histograms where not worth it.
The
following is from a test that we performed:

***

Table-  F_tab Uniform Distribution  Max
Distribution
Field-  P_code0.65%
 18%

Therefore, from the above numbers, the field should be a good candidate for
histograms so I did the following tests.  Based on the following
combinations of
statistics and histograms, I timed how fast a sample query ran:

 w/o statsw/ stats
w/stats w/stats
P_Code  no histograms100 buckets
 50 buckets
--   --   --
--
--
0101 342 secs.428
 385 500
0101 406  416
 326 340
0101 391  390
 327 359
6501 458  490
 337 342
6501 475  380
 358 490
6501 518  395
 326 354
--   ---
--
--
Total Secs.  1730  1629
  1348 2085
(w/o high
 and low
 values)
Avg time7Min 12Sec  6Min 47Sec 5Min
37Sec5Min
51Sec
 per run

   However, to create the histogram it takes 1hr42min.  Too long
for the
benefit that we gain.

***

Tom

-Original Message-
Sent: Tuesday, May 28, 2002 3:25 PM
To: Multiple recipients of list ORACLE-L

Ian,

 John are you saying to create histograms on all indexed
 columns, or just the ones with distributions which are skewed
 and also for ones which although symmetric in distribution
 have some values much more prevalent than others?

To keep this simplistic, I wouldn't use Histograms (or let it default to 2)
*unless* hardcoded values are known to be used, at least in 8i. The
situation becomes different in 9i as the CBO is able to peek into these
values even when bind variables are used. 

RE: Statistical sampling and representative stats collection

2002-05-28 Thread Jack Silvey

tom,

This is interesting. How did you determine max
distribution and uniform distribution? Did you use
standard deviation and variance?

regards,

jack silvey


--- Terrian, Tom [EMAIL PROTECTED] wrote:
 John,
 
 I know in a previous job, we determined that
 histograms where not worth it.  The
 following is from a test that we performed:
 

***
 
 Table-F_tab   Uniform DistributionMax Distribution
 Field-P_code  0.65%   18%
   
 Therefore, from the above numbers, the field should
 be a good candidate for
 histograms so I did the following tests.  Based on
 the following combinations of
 statistics and histograms, I timed how fast a sample
 query ran:
 
   w/o stats   w/ statsw/stats w/stats
 P_Codeno histograms   100 buckets 50 buckets
 ----  --  --
 --
 0101  342 secs.   428 385 500 
 0101  406 416 326 340
 0101  391 390 327 359
 6501  458 490 337 342
 6501  475 380 358 490
 6501  518 395 326 354
 ---   --  --
 --
 Total Secs.   1730162913482085
 (w/o high
  and low
  values)
 Avg time  7Min 12Sec  6Min 47Sec  5Min 37Sec   5Min
 51Sec
  per run
 
   However, to create the histogram it takes 1hr42min.
  Too long for the
 benefit that we gain.
 

***
 
 Tom
 
 -Original Message-
 Sent: Tuesday, May 28, 2002 3:25 PM
 To: Multiple recipients of list ORACLE-L
 
 Ian,
 
  John are you saying to create histograms on all
 indexed 
  columns, or just the ones with distributions which
 are skewed 
  and also for ones which although symmetric in
 distribution  
  have some values much more prevalent than others? 
 
 
 To keep this simplistic, I wouldn't use Histograms
 (or let it default to 2)
 *unless* hardcoded values are known to be used, at
 least in 8i. The
 situation becomes different in 9i as the CBO is able
 to peek into these
 values even when bind variables are used. (I think
 there is a script out
 there on Steve Adam's site called 'Histogram Helper'
 which can suggest this
 for you). 
 
 However, as Larry mentioned in a previous email, the
 CBO is influenced by
 distributions in non-indexed colummns. The issue
 here is that the number of
 buckets really matter, and the default of 2 can
 influence incorrect
 decisions (haven't we all seen 'em? :)  So what I am
 essentially saying is
 this: Use COMPUTE and Histograms when you have to,
 but don't sweat over it
 unless it pinches ya. 
 
 And how do we determine it is pinching? V$SYSSTAT is
 a pretty good
 indicator: (At the risk of being called a part of
 the 'ratios' group) Is the
 ratio of 'table scan blocks gotten' to 'table scan
 rows gotten' acceptable?
 Is the number of table scans acceptable? Is the
 number of 'db block gets'
 too much - as compared to 'physical reads'?
 
 I am in the process of determining the overheads of
 having 'too many'
 histograms - I am observing some 'row cache lock'
 latch waits and think that
 this could have been the result of too many
 histograms. Hope to post some
 info back to the list soon.
 
 John Kanagaraj
 Oracle Applications DBA
 DBSoft Inc
 (W): 408-970-7002
 
 The manuals for Oracle are here:
 http://tahiti.oracle.com
 The manual for Life is here:
 http://www.gospelcom.net
 
 ** The opinions and statements above are entirely my
 own and not those of my
 employer or clients **
 -- 
 Please see the official ORACLE-L FAQ:
 http://www.orafaq.com
 -- 
 Author: John Kanagaraj
   INET: [EMAIL PROTECTED]
 
 Fat City Network Services-- (858) 538-5051  FAX:
 (858) 538-5051
 San Diego, California-- Public Internet
 access / Mailing Lists


 To REMOVE yourself from this mailing list, send an
 E-Mail message
 to: [EMAIL PROTECTED] (note EXACT spelling of
 'ListGuru') and in
 the message BODY, include a line containing: UNSUB
 ORACLE-L
 (or the name of mailing list you want to be removed
 from).  You may
 also send the HELP command for other information
 (like subscribing).
 -- 
 Please see the official ORACLE-L FAQ:
 http://www.orafaq.com
 -- 
 Author: Terrian, Tom
   INET: [EMAIL PROTECTED]
 
 Fat City Network Services-- (858) 538-5051  FAX:
 (858) 538-5051
 San Diego, California-- Public Internet
 access / Mailing Lists


RE: Statistical sampling and representative stats collection

2002-05-28 Thread John Kanagaraj

Cherie,

 In my experience, histograms seem a bit hit or miss but in the
 cases where they've worked, the performance improvement
 has been good or even fantastic.   In the cases where they haven't
 helped, I've simply removed them.

As I observed before, histograms help only when literal predicates are used
(until 9i). Steve Adams has this to say about Histograms:

 Quote 
Histograms enable the optimizer to more accurately estimate the selectivity
of literal predicates against columns with skewed data distributions. This
can help the optimizer to choose a better access path and possibly join
order for certain queries than might otherwise be the case. However,
redundant histograms on columns with uniformly distributed data, and unduly
large histograms on columns with skewed data distributions just increase the
CPU cost of parsing and waste space in the shared pool. Therefore histograms
should only be created where they are beneficial, and should not be any
larger than necessary. Please note that histograms on columns that are not
indexed can nevertheless be beneficial because they inform the optimizer of
the cardinality of their table for the query and can thus influence the join
order. 
 Unquote 
 
 Based on the scarcity of previous responses to emails on this list,
 it seems that histograms are not that widely used throughout the
 industry.  I'm not sure why.

I remember that Steve Orr of this list used Histograms and enjoyed huge
increase in performance. The reason why the Industry hasn't used Histograms
as much as it should have been used is due to a combination of lack of
knowledge, FUD as well as just plain lethargy. On the other hand, overuse
also has its downsides. 

Btw, has anyone tracked V$ROWCACHE which provides a fair idea of the row
cache (or DD) portion of the Shared pool - the figures against
dc_histogram_data and dc_histogram_defs may provide some clues about what's
going on within

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



Re: Statistical sampling and representative stats collection

2002-05-24 Thread Tim Gorman

Fantastic research, John!

- Original Message -
To: Multiple recipients of list ORACLE-L [EMAIL PROTECTED]
Sent: Thursday, May 23, 2002 7:13 PM


 Jack,

 I conducted some tests of ANALYZE ESTIMATE vs COMPUTE for my IOUG paper
and
 arrived at the following: (cut-and-paste of relevant parts of the paper)

 --- Begin Quote ---
 MYTH:  COMPUTE IS BETTER THAN ESTIMATE
 This one generates an endless debate actually, so we will not take a firm
 stand either way. Rather, we will present some figures that throw some
light
 on the issue and allow us to step back and look at the situation. The
 problem with COMPUTE is that it has to scan the entire table, sort it and
 figure out the exact data distribution. On the other hand, ESTIMATE steps
 through samples of the data, sorts and analyzes only a portion of the
data.

 In a recent test for the effectiveness of COMPUTE versus ESTIMATE on a
 static clone of a reasonably large Oracle Apps database, the statistics
were
 generated and stored for both COMPUTE and ESTIMATE. The Database consisted
 of about 3,300 tables and 6,000 indexes and occupied approximately 120 Gb.
 The ESTIMATE percentage was defaulted to 10% and no activity other than
 ANALYZE was allowed on this clone during the entire period. Table
statistics
 including row count, average row length and blocks occupied were analyzed.
 This showed that there were some differences in row count and average row
 length on 321 of these tables. Row count differences ranged from a value
of
 53 row less in the ESTIMATE of a table containing 205,743 rows (0.025%)
all
 the way up-to a count difference of 101,704 in 13,311,090 rows (0.76%).
Even
 assuming a difference of a maximum of 5% in these scenarios, you are not
far
 off the goal. Further analysis showed that a smaller average row length
 coupled with a small table produced larger variations than was usually
seen.


 The differences however, were far more pronounced in Indexes - differences
 of upto 300% were noticed. Further analysis showed that this was related
to
 the percentage of deleted leaf rows in the index. If this percentage is
 high, the possibility of ESTIMATE going wrong was also high, as the
 deletions are not factored in correctly. This was especially true if the
 deletions occurred in leaf blocks that were probably not involved in the
 ESTIMATE. When the deleted leaf rows was low or even nil within the index,
 the percentage difference was much lower, in the range of 4 to 5%.

 The real myth killer is the cost of COMPUTE versus ESTIMATE - COMPUTE
 required 66,553,308 reads versus 38,951,158 reads for ESTIMATE - almost
70%
 more reads for COMPUTE. The sorting involved in determining the averages
and
 data distribution was a clincher - COMPUTE processed 4,263,724,259 rows in
 sorting operations while ESTIMATE sorted just 18,025,069 - i.e. about 235%
 more rows were sorted for the COMPUTE operation. The last nail in the
coffin
 was the time taken to COMPUTE statistics - about 36 hours against the time
 to ESTIMATE of just 12 hours.

 While the figures speak for themselves, we will offer some general advice
to
 the cautious: ESTIMATE on tables and COMPUTE on Indexes. Columns are
 analyzed by default, but serve no useful purpose other than showing data
 spread. Hence, you could ANALYZE only Tables and Indexed columns alone. An
 identified list of 'small' tables could also be COMPUTED rather than
 ANALYZED. This advice is given because ESTIMATE on a table comes close as
 far as row count goes, while COMPUTE on Indexes generates a more accurate
 picture of both data distribution as well as object size statistics.
Testing
 the effectiveness of COMPUTE versus ANALYZE is simple and provides you
with
 figures that you can use to decide the strategy for your situation.

 Before we move to the next topic, keep in mind that an ANALYZE/ESTIMATE
with
 a sample size greater than or equal to 50% will result in COMPUTE.

 --- End Quote ---

 The problem is that this simple mathematical model looks only at object
 sizes and did not look at Column spread and sensitivity. However, I
believe
 that the combination of ESTIMATE on Tables and COMPUTE on Indexes would
 catch most of it.

 As always, YMMV!

 John Kanagaraj
 Oracle Applications DBA
 DBSoft Inc
 (W): 408-970-7002

 The manuals for Oracle are here: http://tahiti.oracle.com
 The manual for Life is here: http://www.gospelcom.net

 ** The opinions and statements above are entirely my own and not those of
my
 employer or clients **


  -Original Message-
  From: Jack Silvey [mailto:[EMAIL PROTECTED]]
  Sent: Tuesday, May 21, 2002 2:44 PM
  To: Multiple recipients of list ORACLE-L
  Subject: Statistical sampling and representative stats collection
 
 
  Hi all,
 
  Did some investigation about statistical sampling this
  weekend since we are going to optimize our analyze
  process soon, and would like some input from all you
  orabrains on this one.
 
  I opened a TAR with Oracle asking about the sampling

RE: Statistical sampling and representative stats collection

2002-05-24 Thread Jamadagni, Rajendra

John,

Thanks for a very good explanation between COMPUTE and ESTIMATE. I am
changing the scripts to use DBMS_STATS instead of ANALYZE on out 9012
database, but on 8i I'll keep this in mind.

Thanks again
Raj
__
Rajendra Jamadagni  MIS, ESPN Inc.
Rajendra dot Jamadagni at ESPN dot com
Any opinion expressed here is personal and doesn't reflect that of ESPN Inc.

QOTD: Any clod can have facts, but having an opinion is an art!


***1

This e-mail message is confidential, intended only for the named recipient(s) above 
and may contain information that is privileged, attorney work product or exempt from 
disclosure under applicable law. If you have received this message in error, or are 
not the named recipient(s), please immediately notify ESPN at (860) 766-2000 and 
delete this e-mail message from your computer, Thank you.

***1



RE: Statistical sampling and representative stats collection

2002-05-24 Thread John Kanagaraj

Hi Jack,

 One question - you mention that an index analyze
 provides beter data distribution. Could you discuss
 what you found in more detail?

What I meant was that the Histograms that are created during an
ANALYZE/COMPUTE on Indexes will provide an almost perfect picture of the
data distribution in such columns. Under _some_ circumstances, the CBO will
be able to use this information to decide the best path (FTS or Indexed
read). On the other hand, and simply stated, when bind variables are used in
a cursor, this information about data distribution is not used since the
value of the bind variable is not used during the parse prior to 9i. In
other words, the access plan is built without considering the value of bind
variables that would have otherwise influenced the plan when histograms (and
thus information about data distribution) is avialable. However, 9i kinda
rectifies this and I quote from the Fine Manual: (Oracle9i: Database
Performance Guide and Reference)

Cursor Sharing Enhancements
The CBO now peeks at the values of user-defined bind variables on the first
invocation of a cursor. This lets the optimizer determine the selectivity of
any
WHERE clause condition, as well as if literals had been used instead of bind
variables. When bind variables are used in a statement, it is assumed that
cursor
sharing is intended and that different invocations are supposed to use the
same
execution plan.

This gives us the best of both worlds (in some cases). Of course, it all
depends on the number of buckets defined for Histograms and the width of the
data spread (and that is why I emphasized _some_ and 'simply stated' above).

I haven't tested this extensively, and I would appreciate any further inputs
from the Gurus! (For the rest of us: A 10053 trace should show up what's
happening. There was a _great_ presentation from Wolfgang Breitling on this
topic at the recent IOUG).

Btw: Searching for 'bucket' in the 8i SQL reference came up with the NTILE
function (new in 8i), and I said Wow! because I was looking for such a
function. Goes to say that we need to read the fine manuals more than we
normally do!

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **


-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-24 Thread Larry Elkins

John,

Great work. I do have some comments and questions, but don't take it as
being critical. Your work on this is greatly appreciated. My
comments/question are in-line and only asked so that the list can bounce
around some more ideas and thoughts.

 --- Begin Quote ---
 MYTH:  COMPUTE IS BETTER THAN ESTIMATE
 This one generates an endless debate actually, so we will not take a firm
 stand either way. Rather, we will present some figures that throw
 some light
 on the issue and allow us to step back and look at the situation. The
 problem with COMPUTE is that it has to scan the entire table, sort it and
 figure out the exact data distribution. On the other hand, ESTIMATE steps
 through samples of the data, sorts and analyzes only a portion of
 the data.

True, if we can get good stats that result in effective plans by using
ESTIMATE, by all means go that route since a COMPUTE is much more expensive.


 In a recent test for the effectiveness of COMPUTE versus ESTIMATE on a
 static clone of a reasonably large Oracle Apps database, the
 statistics were
 generated and stored for both COMPUTE and ESTIMATE. The Database consisted
 of about 3,300 tables and 6,000 indexes and occupied approximately 120 Gb.

My question here is would this database and your findings be applicable to
other databases and the nature of their data? I wouldn't think it would be
since the characteristics can be so different, but, if I'm reading you
correctly, you aren't saying that ESTIMATE is always the only way. But, the
title Myth: Compute is better than Estimate. Well, we have all seen cases
where compute worked out better (or 10% or 1% might have worked better than
the 30% specified). So maybe it's just a minor quibble (no Jack Silvey, not
a Tribble) and if the title included ...is ALWAYS better..., then I would
totally agree with it being a myth. We have all been there, and I worked
with a group recently who had the luxury of doing full computes every
Sunday. Someone accidentally analyzed the schema at 30% on Monday and a lot
of things went down the toilet. Going back to COMPUTE fixed things. Then
again, maybe a 10% ESTIMATE would have fixed things. Jack and I both work
with a guy who has talked about COMPUTE resulting in undesired plans, 10%
did as well. They got the desired plans by going to 1%. So, even if one
agrees that we don't necessarily have to COMPUTE, and in many (probably
most?) cases we don't, there is still a lot of testing to be done to find
the best estimate percent, and this could very well be different for
various objects. And I think that's the battle we all face -- what is the
best sampling percentage. And right now, it still seems to be done on a
trial and error basis. I have some ideas on how one might attack this in an
automated fashion, but it's still a *very* rough idea that I need to bounce
off a few cohorts.

SNIP
 While the figures speak for themselves, we will offer some
 general advice to
 the cautious: ESTIMATE on tables and COMPUTE on Indexes. Columns are
 analyzed by default, but serve no useful purpose other than showing data
 spread. Hence, you could ANALYZE only Tables and Indexed columns alone.

If I read this correctly, are you saying we only need to gather table stats
and the column stats for indexed columns, and that there is no practical use
for these stats on non-indexed columns? If so, I disagree on this point,
even if it is general advice and not a rule. Stats on non-indexed columns
can play a *large* role in CBO decisions. I'm not going to go into details
and examples here illustrating that, but those stats can still help decide
the driving table, the join methods between tables, etc. I built a sample
case some time back to illustrate the importance of gathering these
non-indexed column stats. Now, it might not be important for all systems,
but if you are ever using indexed columns, and, still specifying criteria on
non-indexed columns, the gathering of stats on the non-indexed columns could
be *very* important. I can send you more details back-channel if you are
interested.

 An
 identified list of 'small' tables could also be COMPUTED rather than
 ANALYZED. This advice is given because ESTIMATE on a table comes close as
 far as row count goes, while COMPUTE on Indexes generates a more accurate
 picture of both data distribution as well as object size
 statistics. Testing
 the effectiveness of COMPUTE versus ANALYZE is simple and
 provides you with
 figures that you can use to decide the strategy for your situation.

Ok, it sounds like you aren't saying a one size fits all.


 Before we move to the next topic, keep in mind that an
 ANALYZE/ESTIMATE with
 a sample size greater than or equal to 50% will result in COMPUTE.

 --- End Quote ---

 The problem is that this simple mathematical model looks only at object
 sizes and did not look at Column spread and sensitivity. However,
 I believe
 that the combination of ESTIMATE on Tables and COMPUTE on Indexes would
 catch most of it.

 As 

RE: Statistical sampling and representative stats collection

2002-05-24 Thread MacGregor, Ian A.

John are you saying to create histograms on all indexed columns, or just the ones with 
distributions which are skewed and also for ones which although symmetric in 
distribution  have some values much more prevalent than others?  

Ian MacGregor
Stanford Linear Accelerator Center
[EMAIL PROTECTED]

-Original Message-
Sent: Friday, May 24, 2002 3:01 PM
To: Multiple recipients of list ORACLE-L


Hi Jack,

 One question - you mention that an index analyze
 provides beter data distribution. Could you discuss
 what you found in more detail?

What I meant was that the Histograms that are created during an
ANALYZE/COMPUTE on Indexes will provide an almost perfect picture of the
data distribution in such columns. Under _some_ circumstances, the CBO will
be able to use this information to decide the best path (FTS or Indexed
read). On the other hand, and simply stated, when bind variables are used in
a cursor, this information about data distribution is not used since the
value of the bind variable is not used during the parse prior to 9i. In
other words, the access plan is built without considering the value of bind
variables that would have otherwise influenced the plan when histograms (and
thus information about data distribution) is avialable. However, 9i kinda
rectifies this and I quote from the Fine Manual: (Oracle9i: Database
Performance Guide and Reference)

Cursor Sharing Enhancements
The CBO now peeks at the values of user-defined bind variables on the first
invocation of a cursor. This lets the optimizer determine the selectivity of
any
WHERE clause condition, as well as if literals had been used instead of bind
variables. When bind variables are used in a statement, it is assumed that
cursor
sharing is intended and that different invocations are supposed to use the
same
execution plan.

This gives us the best of both worlds (in some cases). Of course, it all
depends on the number of buckets defined for Histograms and the width of the
data spread (and that is why I emphasized _some_ and 'simply stated' above).

I haven't tested this extensively, and I would appreciate any further inputs
from the Gurus! (For the rest of us: A 10053 trace should show up what's
happening. There was a _great_ presentation from Wolfgang Breitling on this
topic at the recent IOUG).

Btw: Searching for 'bucket' in the 8i SQL reference came up with the NTILE
function (new in 8i), and I said Wow! because I was looking for such a
function. Goes to say that we need to read the fine manuals more than we
normally do!

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **


-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: MacGregor, Ian A.
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-24 Thread Larry Elkins

 Hi Jack,

  One question - you mention that an index analyze
  provides beter data distribution. Could you discuss
  what you found in more detail?

 What I meant was that the Histograms that are created during an
 ANALYZE/COMPUTE on Indexes will provide an almost perfect picture of the
 data distribution in such columns. Under _some_ circumstances,
 the CBO will
 be able to use this information to decide the best path (FTS or Indexed
 read).

And stats on the non-indexed columns can also play a large role in deciding
driving table order and join methods. Ok, touched on that in an earlier
email ;-)

 On the other hand, and simply stated, when bind variables
 are used in
 a cursor, this information about data distribution is not used since the
 value of the bind variable is not used during the parse prior to 9i.

In my case, and Jack's (I'm now doing some work with a DB where Jack is
dealing with the analyze strategies), the bind thing isn't an issue.
Everything is ad-hoc, and, literals *are* used. But, there really isn't much
of an opportunity for sharing SQL even if binds were used. One user might
specify 5 values for one column, 3 values for another, 2 values for five
other columns. The combinations of the criteria specified, and the number of
values specified for each of those columns, not to mention the tables
specified, very few, if any, of the SQL statements could be shared even if
using binds. Plus, in this case, with histograms being very valuable, one
could live with less cursor sharing even if there were some that could be
shared when using binds. In this case, the literals are needed and their use
is not causing any shared pool or library cache contention.


 Btw: Searching for 'bucket' in the 8i SQL reference came up with the NTILE
 function (new in 8i), and I said Wow! because I was looking for such a
 function. Goes to say that we need to read the fine manuals more than we
 normally do!

The analytic functions are great. The analytic functions first came about in
8.1.6, a few more functions added in 8.1.7, and taken even further in 9i. A
lot of the traditional ways we might have done things, often times including
self joins, or, procedural code, are thrown out the window. I've found all
kinds of uses for them that (1) improve performance over the old approaches,
and (2) are simpler to understand. Then again, some of the analytic function
examples leave my head spinning. I'm still working through a lot of them for
better understanding. But yeah, analytic functions like NTILE are very, very
nice.


 John Kanagaraj
 Oracle Applications DBA
 DBSoft Inc
 (W): 408-970-7002

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Larry Elkins
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-23 Thread Jared . Still

Well, if the temp had any brains, the job should take less than an hour.

Maybe your temp employee was hoping for a promotion to damagement,
in which case the job  would take days.

Jared






Jamadagni, Rajendra [EMAIL PROTECTED]
Sent by: [EMAIL PROTECTED]
05/22/2002 08:53 AM
Please respond to ORACLE-L

 
To: Multiple recipients of list ORACLE-L [EMAIL PROTECTED]
cc: 
Subject:RE: Statistical sampling and representative stats collection


I remember, in 96/97, I was working on a Oracle 7 database and none of the
this sample() was available. My task was to select x people out of z 
number
of people randomly. We hired a temp, who took Dallas/Fort Worth dictionary
(I was working in that area) and counted (yes counted) number of people
whose last names started with A and B and so on.

We put this information into a table and then found out percentage of last
names starting with A compared to the whole list for each letter. This
percentage was used as weightage for each letter when choosing random
samples from a list of people. Then I used this weightage and dbms_random 
to
pick the required number of names from our database. It was fun to
implement, but I feel bad for the temp who had to count the names.

So, compared to that, 8i and 9i built-in features rock !!
Raj
__
Rajendra Jamadagni   MIS, ESPN Inc.
Rajendra dot Jamadagni at ESPN dot com
Any opinion expressed here is personal and doesn't reflect that of ESPN 
Inc.

QOTD: Any clod can have facts, but having an opinion is an art!




***1

This e-mail message is confidential, intended only for the named recipient(s) above 
and may contain information that is privileged, attorney work product or exempt from 
disclosure under applicable law. If you have received this message in error, or are 
not the named recipient(s), please immediately notify ESPN at (860) 766-2000 and 
delete this e-mail message from your computer, Thank you.

***1


RE: Statistical sampling and representative stats collection

2002-05-23 Thread John Kanagaraj

Jack,

I conducted some tests of ANALYZE ESTIMATE vs COMPUTE for my IOUG paper and
arrived at the following: (cut-and-paste of relevant parts of the paper)

--- Begin Quote ---
MYTH:  COMPUTE IS BETTER THAN ESTIMATE
This one generates an endless debate actually, so we will not take a firm
stand either way. Rather, we will present some figures that throw some light
on the issue and allow us to step back and look at the situation. The
problem with COMPUTE is that it has to scan the entire table, sort it and
figure out the exact data distribution. On the other hand, ESTIMATE steps
through samples of the data, sorts and analyzes only a portion of the data.

In a recent test for the effectiveness of COMPUTE versus ESTIMATE on a
static clone of a reasonably large Oracle Apps database, the statistics were
generated and stored for both COMPUTE and ESTIMATE. The Database consisted
of about 3,300 tables and 6,000 indexes and occupied approximately 120 Gb.
The ESTIMATE percentage was defaulted to 10% and no activity other than
ANALYZE was allowed on this clone during the entire period. Table statistics
including row count, average row length and blocks occupied were analyzed.
This showed that there were some differences in row count and average row
length on 321 of these tables. Row count differences ranged from a value of
53 row less in the ESTIMATE of a table containing 205,743 rows (0.025%) all
the way up-to a count difference of 101,704 in 13,311,090 rows (0.76%). Even
assuming a difference of a maximum of 5% in these scenarios, you are not far
off the goal. Further analysis showed that a smaller average row length
coupled with a small table produced larger variations than was usually seen.


The differences however, were far more pronounced in Indexes - differences
of upto 300% were noticed. Further analysis showed that this was related to
the percentage of deleted leaf rows in the index. If this percentage is
high, the possibility of ESTIMATE going wrong was also high, as the
deletions are not factored in correctly. This was especially true if the
deletions occurred in leaf blocks that were probably not involved in the
ESTIMATE. When the deleted leaf rows was low or even nil within the index,
the percentage difference was much lower, in the range of 4 to 5%.

The real myth killer is the cost of COMPUTE versus ESTIMATE - COMPUTE
required 66,553,308 reads versus 38,951,158 reads for ESTIMATE - almost 70%
more reads for COMPUTE. The sorting involved in determining the averages and
data distribution was a clincher - COMPUTE processed 4,263,724,259 rows in
sorting operations while ESTIMATE sorted just 18,025,069 - i.e. about 235%
more rows were sorted for the COMPUTE operation. The last nail in the coffin
was the time taken to COMPUTE statistics - about 36 hours against the time
to ESTIMATE of just 12 hours.

While the figures speak for themselves, we will offer some general advice to
the cautious: ESTIMATE on tables and COMPUTE on Indexes. Columns are
analyzed by default, but serve no useful purpose other than showing data
spread. Hence, you could ANALYZE only Tables and Indexed columns alone. An
identified list of 'small' tables could also be COMPUTED rather than
ANALYZED. This advice is given because ESTIMATE on a table comes close as
far as row count goes, while COMPUTE on Indexes generates a more accurate
picture of both data distribution as well as object size statistics. Testing
the effectiveness of COMPUTE versus ANALYZE is simple and provides you with
figures that you can use to decide the strategy for your situation. 

Before we move to the next topic, keep in mind that an ANALYZE/ESTIMATE with
a sample size greater than or equal to 50% will result in COMPUTE.

--- End Quote ---

The problem is that this simple mathematical model looks only at object
sizes and did not look at Column spread and sensitivity. However, I believe
that the combination of ESTIMATE on Tables and COMPUTE on Indexes would
catch most of it.

As always, YMMV!

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com
The manual for Life is here: http://www.gospelcom.net

** The opinions and statements above are entirely my own and not those of my
employer or clients **


 -Original Message-
 From: Jack Silvey [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, May 21, 2002 2:44 PM
 To: Multiple recipients of list ORACLE-L
 Subject: Statistical sampling and representative stats collection
 
 
 Hi all,
 
 Did some investigation about statistical sampling this
 weekend since we are going to optimize our analyze
 process soon, and would like some input from all you
 orabrains on this one.
 
 I opened a TAR with Oracle asking about the sampling
 algorithm of stats collection, and they assured me it
 was random. 
 
 The goal of analyze...estimate is to collect stats
 that are representative of the data population as a
 whole using a given sample set. Since analyzing 

RE: Statistical sampling and representative stats collection

2002-05-23 Thread Jack Silvey

John,

Thanks for the great input. I think at this time we
are working our way towards a invalidate/load/rebuild
compute index strategy, so estimate is probably not in
the picture for our indexes. Since the data is ordered
in the index and does not have to be sorted during an
analyze, a compute should not be that expensive
relative to a table compute, plus doing it during the
rebuild adds less overhead to the analyze, or so I
have read in meat link.

One question - you mention that an index analyze
provides beter data distribution. Could you discuss
what you found in more detail?

thx,

/jack


--- John Kanagaraj [EMAIL PROTECTED] wrote:
 Jack,
 
 I conducted some tests of ANALYZE ESTIMATE vs
 COMPUTE for my IOUG paper and
 arrived at the following: (cut-and-paste of relevant
 parts of the paper)
 
 --- Begin Quote ---
 MYTH:  COMPUTE IS BETTER THAN ESTIMATE
 This one generates an endless debate actually, so we
 will not take a firm
 stand either way. Rather, we will present some
 figures that throw some light
 on the issue and allow us to step back and look at
 the situation. The
 problem with COMPUTE is that it has to scan the
 entire table, sort it and
 figure out the exact data distribution. On the other
 hand, ESTIMATE steps
 through samples of the data, sorts and analyzes only
 a portion of the data.
 
 In a recent test for the effectiveness of COMPUTE
 versus ESTIMATE on a
 static clone of a reasonably large Oracle Apps
 database, the statistics were
 generated and stored for both COMPUTE and ESTIMATE.
 The Database consisted
 of about 3,300 tables and 6,000 indexes and occupied
 approximately 120 Gb.
 The ESTIMATE percentage was defaulted to 10% and no
 activity other than
 ANALYZE was allowed on this clone during the entire
 period. Table statistics
 including row count, average row length and blocks
 occupied were analyzed.
 This showed that there were some differences in row
 count and average row
 length on 321 of these tables. Row count differences
 ranged from a value of
 53 row less in the ESTIMATE of a table containing
 205,743 rows (0.025%) all
 the way up-to a count difference of 101,704 in
 13,311,090 rows (0.76%). Even
 assuming a difference of a maximum of 5% in these
 scenarios, you are not far
 off the goal. Further analysis showed that a smaller
 average row length
 coupled with a small table produced larger
 variations than was usually seen.
 
 
 The differences however, were far more pronounced in
 Indexes - differences
 of upto 300% were noticed. Further analysis showed
 that this was related to
 the percentage of deleted leaf rows in the index. If
 this percentage is
 high, the possibility of ESTIMATE going wrong was
 also high, as the
 deletions are not factored in correctly. This was
 especially true if the
 deletions occurred in leaf blocks that were probably
 not involved in the
 ESTIMATE. When the deleted leaf rows was low or even
 nil within the index,
 the percentage difference was much lower, in the
 range of 4 to 5%.
 
 The real myth killer is the cost of COMPUTE versus
 ESTIMATE - COMPUTE
 required 66,553,308 reads versus 38,951,158 reads
 for ESTIMATE - almost 70%
 more reads for COMPUTE. The sorting involved in
 determining the averages and
 data distribution was a clincher - COMPUTE processed
 4,263,724,259 rows in
 sorting operations while ESTIMATE sorted just
 18,025,069 - i.e. about 235%
 more rows were sorted for the COMPUTE operation. The
 last nail in the coffin
 was the time taken to COMPUTE statistics - about 36
 hours against the time
 to ESTIMATE of just 12 hours.
 
 While the figures speak for themselves, we will
 offer some general advice to
 the cautious: ESTIMATE on tables and COMPUTE on
 Indexes. Columns are
 analyzed by default, but serve no useful purpose
 other than showing data
 spread. Hence, you could ANALYZE only Tables and
 Indexed columns alone. An
 identified list of 'small' tables could also be
 COMPUTED rather than
 ANALYZED. This advice is given because ESTIMATE on a
 table comes close as
 far as row count goes, while COMPUTE on Indexes
 generates a more accurate
 picture of both data distribution as well as object
 size statistics. Testing
 the effectiveness of COMPUTE versus ANALYZE is
 simple and provides you with
 figures that you can use to decide the strategy for
 your situation. 
 
 Before we move to the next topic, keep in mind that
 an ANALYZE/ESTIMATE with
 a sample size greater than or equal to 50% will
 result in COMPUTE.
 
 --- End Quote ---
 
 The problem is that this simple mathematical model
 looks only at object
 sizes and did not look at Column spread and
 sensitivity. However, I believe
 that the combination of ESTIMATE on Tables and
 COMPUTE on Indexes would
 catch most of it.
 
 As always, YMMV!
 
 John Kanagaraj
 Oracle Applications DBA
 DBSoft Inc
 (W): 408-970-7002
 
 The manuals for Oracle are here:
 http://tahiti.oracle.com
 The manual for Life is here:
 http://www.gospelcom.net
 
 ** The opinions and 

Re: Statistical sampling and representative stats collection

2002-05-22 Thread Cherie_Machler


Jack,

What version are you on?   Are you able to utilize the gather_stale option.
That way you would not only be optimizing the amount you are estimating but
the interval between analyzing.

Of course, that option of DBMS_STATS is not available on older versions.

Cherie Machler
Oracle DBA
Gelco Information Network


   
  
Jack Silvey
  
jack_silvey@y   To: Multiple recipients of list ORACLE-L 
[EMAIL PROTECTED] 
ahoo.comcc:   
  
Sent by: Subject: Re: Statistical sampling and 
representative stats  
[EMAIL PROTECTED]collection   
  
om 
  
   
  
   
  
05/21/02 06:19 
  
PM 
  
Please respond 
  
to ORACLE-L
  
   
  
   
  




Hi Rafiq,

We have been using 35 percent on our warehouse, even
on our fact partitions. Now that I have thought about
it for a while, that seems like a lot given the volume
of data. If a representative sample can be gathered
with 10,000 or 50,000 or 100,000 rows, and our fact
partitions have millions of rows each, seems like we
could go 1% on our analyze and it would be within
acceptable tolerances.

/jack


--- Mohammad Rafiq [EMAIL PROTECTED] wrote:
 The most of the list memeber agrees on estimate with
 30%

 Regards
 Rafiq




 Reply-To: [EMAIL PROTECTED]
 To: Multiple recipients of list ORACLE-L
 [EMAIL PROTECTED]
 Date: Tue, 21 May 2002 13:43:33 -0800

 Hi all,

 Did some investigation about statistical sampling
 this
 weekend since we are going to optimize our analyze
 process soon, and would like some input from all you
 orabrains on this one.

 I opened a TAR with Oracle asking about the sampling
 algorithm of stats collection, and they assured me
 it
 was random.

 The goal of analyze...estimate is to collect stats
 that are representative of the data population as a
 whole using a given sample set. Since analyzing
 tables
 takes up resources (does sorts to order the data for
 investigation) the fewer rows you use in estimate,
 the
 less system resources you use and the faster the
 analyze will go.

 Since our goal is to get as small a sample as
 possible
 and still have stats that are representative, my
 contention is that we could start by finding what
 the
 margin of error will be for each sample size and
 gauge
 our tolerance for it.

 One standard way to calculate margin of error for a
 given sample is by using this formula:

 M = 1/SQRT(N)

 where:
 M = margin of error
 N=sample size

 So, if we can tolerate stats that have a 1% a margin
 of error (will deviate from representative of the
 whole population by 1%), our sample size should be
 10,000 rows.

 Also, a corollary (not a toyota corollary, though)
 to
 this would be that the more rows you add to your
 sample, the closer to representative your sample
 will
 be. So, in order to test whether your sample is
 representative enough, you could analyze using
 either
 estimate 49% or compute, take a snapshot of the
 stats,
 and then compare the stats from a 10,000 row
 estimate
 to those. Then, add rows to your estimate until you
 are satisfied with the stats.

 This of course is a pie in the sky mathematical
 model,
 but seems like a reasonable place to start with
 testing.

 Input? Input? Buhler? Buhler?


 /jack silvey


 __
 Do You Yahoo!?
 LAUNCH - Your Yahoo! Music Experience
 http://launch.yahoo.com
 --
 Please see the official ORACLE-L FAQ:
 http://www.orafaq.com
 --
 Author: Jack Silvey
INET: [EMAIL PROTECTED]

 Fat City Network Services-- (858) 538-5051  FAX:
 (858) 538-5051
 San Diego, California-- Public Internet
 access / Mailing Lists


Re: Statistical sampling and representative stats collection

2002-05-22 Thread Cherie_Machler


Jack,

Isn't the problem with this concept that it doesn't take into consideration
how skewed the data is?   Statistically significance would be relevant to
perfectly distributed data but wouldn't you need a higher percentage of
data for significance in more highly skewed data?

Just something to consider.

Cherie Machler
Oracle DBA
Gelco Information Network


   
  
Jack Silvey
  
jack_silvey@y   To: Multiple recipients of list ORACLE-L 
[EMAIL PROTECTED] 
ahoo.comcc:   
  
Sent by: Subject: Re: Statistical sampling and 
representative stats  
[EMAIL PROTECTED]collection   
  
om 
  
   
  
   
  
05/21/02 06:19 
  
PM 
  
Please respond 
  
to ORACLE-L
  
   
  
   
  




Hi Rafiq,

We have been using 35 percent on our warehouse, even
on our fact partitions. Now that I have thought about
it for a while, that seems like a lot given the volume
of data. If a representative sample can be gathered
with 10,000 or 50,000 or 100,000 rows, and our fact
partitions have millions of rows each, seems like we
could go 1% on our analyze and it would be within
acceptable tolerances.

/jack


--- Mohammad Rafiq [EMAIL PROTECTED] wrote:
 The most of the list memeber agrees on estimate with
 30%

 Regards
 Rafiq




 Reply-To: [EMAIL PROTECTED]
 To: Multiple recipients of list ORACLE-L
 [EMAIL PROTECTED]
 Date: Tue, 21 May 2002 13:43:33 -0800

 Hi all,

 Did some investigation about statistical sampling
 this
 weekend since we are going to optimize our analyze
 process soon, and would like some input from all you
 orabrains on this one.

 I opened a TAR with Oracle asking about the sampling
 algorithm of stats collection, and they assured me
 it
 was random.

 The goal of analyze...estimate is to collect stats
 that are representative of the data population as a
 whole using a given sample set. Since analyzing
 tables
 takes up resources (does sorts to order the data for
 investigation) the fewer rows you use in estimate,
 the
 less system resources you use and the faster the
 analyze will go.

 Since our goal is to get as small a sample as
 possible
 and still have stats that are representative, my
 contention is that we could start by finding what
 the
 margin of error will be for each sample size and
 gauge
 our tolerance for it.

 One standard way to calculate margin of error for a
 given sample is by using this formula:

 M = 1/SQRT(N)

 where:
 M = margin of error
 N=sample size

 So, if we can tolerate stats that have a 1% a margin
 of error (will deviate from representative of the
 whole population by 1%), our sample size should be
 10,000 rows.

 Also, a corollary (not a toyota corollary, though)
 to
 this would be that the more rows you add to your
 sample, the closer to representative your sample
 will
 be. So, in order to test whether your sample is
 representative enough, you could analyze using
 either
 estimate 49% or compute, take a snapshot of the
 stats,
 and then compare the stats from a 10,000 row
 estimate
 to those. Then, add rows to your estimate until you
 are satisfied with the stats.

 This of course is a pie in the sky mathematical
 model,
 but seems like a reasonable place to start with
 testing.

 Input? Input? Buhler? Buhler?


 /jack silvey


 __
 Do You Yahoo!?
 LAUNCH - Your Yahoo! Music Experience
 http://launch.yahoo.com
 --
 Please see the official ORACLE-L FAQ:
 http://www.orafaq.com
 --
 Author: Jack Silvey
INET: [EMAIL PROTECTED]

 Fat City Network Services-- (858) 538-5051  FAX:
 (858) 538-5051
 San Diego, California-- Public Internet
 

RE: Statistical sampling and representative stats collection

2002-05-22 Thread Jamadagni, Rajendra

Jack,

Nielsen Ratings (the TV Rating company) monitors about 5000 people (and
their TV watching habits) to supply ratings for all the shows on most of the
networks for the whole United States. So, as long as you have a working and
proven statistical model, and a good sample, it works. How do I know, ever
seen anyone challenging Nielsen Ratings for a show? I haven't.

Raj
__
Rajendra Jamadagni  MIS, ESPN Inc.
Rajendra dot Jamadagni at ESPN dot com
Any opinion expressed here is personal and doesn't reflect that of ESPN Inc.

QOTD: Any clod can have facts, but having an opinion is an art!


***1

This e-mail message is confidential, intended only for the named recipient(s) above 
and may contain information that is privileged, attorney work product or exempt from 
disclosure under applicable law. If you have received this message in error, or are 
not the named recipient(s), please immediately notify ESPN at (860) 766-2000 and 
delete this e-mail message from your computer, Thank you.

***1



RE: Statistical sampling and representative stats collection

2002-05-22 Thread DENNIS WILLIAMS

Jack, Raj
I agree. The main point I feel is that if you follow statistics
theory, which a good part of our modern technology relies on, you will test
a fixed number of samples, rather than a percentage of the table rows. 
For a small table, you may have to sample the entire table to get
results that work. As a real-life example, you wouldn't sample 10 US
Senators and expect those results to be accurate. No, you would simply
survey each Senator (and hope they don't change their mind). Similarly,
don't just sample 30 percent of a 1,000-row table. Sample all 1,000 rows.
For a large table, sampling a percentage would oversample and be a
wasted effort. If you have a million-row table and a hundred-million-row
table, the same sample size will produce results nearly as accurate for
both. That is why you don't see nearly as many state political polls. It is
nearly as much expense to accurately sample the citizens in a state as it is
to sample all the citizens in the US
Someone asked about skewed data. Well, that is the reason you
perform a RANDOM sample. That is the key point, and what produces many
real-life statistical failures. A classic example is the Truman-Dewey
presidential race in 1948. The pioneer pollsters used random samples of
phone numbers and confidently predicted Dewey's victory. What they neglected
was that wealthier people had telephones in greater proportion than poor
people. So their sample was skewed, which produced bad results. Here, we're
betting on Oracle's statement that the sample is truly random.
Now, if you want a more accurate result, you will sample more. But
you aren't increasing the sample size because the table is larger, but to
increase the accuracy. And to compensate for any other inaccuracies. 
Just a thought, if you're responsible for a data warehouse, you may
want to consider studying some basic statistics. Unfortunately most computer
science curriculums don't require a class in statistics. In fact, since
polls form a lot of our political discussion, it wouldn't hurt to require
all citizens to have some statistical training. It might make it harder for
politicians to mis-construe statistical results. However, it is hard enough
to get people just to vote, so I suppose that one isn't going to fly.

Dennis Williams
DBA
Lifetouch, Inc.
[EMAIL PROTECTED]


-Original Message-
Sent: Wednesday, May 22, 2002 9:39 AM
To: Multiple recipients of list ORACLE-L


Jack,

Nielsen Ratings (the TV Rating company) monitors about 5000 people (and
their TV watching habits) to supply ratings for all the shows on most of the
networks for the whole United States. So, as long as you have a working and
proven statistical model, and a good sample, it works. How do I know, ever
seen anyone challenging Nielsen Ratings for a show? I haven't.

Raj
__
Rajendra Jamadagni  MIS, ESPN Inc.
Rajendra dot Jamadagni at ESPN dot com
Any opinion expressed here is personal and doesn't reflect that of ESPN Inc.

QOTD: Any clod can have facts, but having an opinion is an art!
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: DENNIS WILLIAMS
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



Re: Statistical sampling and representative stats collection

2002-05-22 Thread Jack Silvey

Cherie,

Had some problems with DBMS_STATS, but we are making
our way towards it. Just have to be careful, since it
can do quirky things like analyze the dictionary. We
are using dba_tab_modifications to monitor our tables.

Thanks for the input.

Jack


--- [EMAIL PROTECTED] wrote:
 
 Jack,
 
 What version are you on?   Are you able to utilize
 the gather_stale option.
 That way you would not only be optimizing the amount
 you are estimating but
 the interval between analyzing.
 
 Of course, that option of DBMS_STATS is not
 available on older versions.
 
 Cherie Machler
 Oracle DBA
 Gelco Information Network
 
 
 
 

 Jack Silvey 
 

 jack_silvey@y   To:
 Multiple recipients of list ORACLE-L
 [EMAIL PROTECTED] 
 ahoo.comcc:
 

 Sent by: Subject:   
  Re: Statistical sampling and representative stats  

 [EMAIL PROTECTED]collection
 

 om  
 

 
 

 
 

 05/21/02 06:19  
 

 PM  
 

 Please respond  
 

 to ORACLE-L 
 

 
 

 
 

 
 
 
 
 Hi Rafiq,
 
 We have been using 35 percent on our warehouse, even
 on our fact partitions. Now that I have thought
 about
 it for a while, that seems like a lot given the
 volume
 of data. If a representative sample can be gathered
 with 10,000 or 50,000 or 100,000 rows, and our fact
 partitions have millions of rows each, seems like we
 could go 1% on our analyze and it would be within
 acceptable tolerances.
 
 /jack
 
 
 --- Mohammad Rafiq [EMAIL PROTECTED] wrote:
  The most of the list memeber agrees on estimate
 with
  30%
 
  Regards
  Rafiq
 
 
 
 
  Reply-To: [EMAIL PROTECTED]
  To: Multiple recipients of list ORACLE-L
  [EMAIL PROTECTED]
  Date: Tue, 21 May 2002 13:43:33 -0800
 
  Hi all,
 
  Did some investigation about statistical sampling
  this
  weekend since we are going to optimize our analyze
  process soon, and would like some input from all
 you
  orabrains on this one.
 
  I opened a TAR with Oracle asking about the
 sampling
  algorithm of stats collection, and they assured me
  it
  was random.
 
  The goal of analyze...estimate is to collect stats
  that are representative of the data population as
 a
  whole using a given sample set. Since analyzing
  tables
  takes up resources (does sorts to order the data
 for
  investigation) the fewer rows you use in estimate,
  the
  less system resources you use and the faster the
  analyze will go.
 
  Since our goal is to get as small a sample as
  possible
  and still have stats that are representative, my
  contention is that we could start by finding what
  the
  margin of error will be for each sample size and
  gauge
  our tolerance for it.
 
  One standard way to calculate margin of error for
 a
  given sample is by using this formula:
 
  M = 1/SQRT(N)
 
  where:
  M = margin of error
  N=sample size
 
  So, if we can tolerate stats that have a 1% a
 margin
  of error (will deviate from representative of the
  whole population by 1%), our sample size should be
  10,000 rows.
 
  Also, a corollary (not a toyota corollary, though)
  to
  this would be that the more rows you add to your
  sample, the closer to representative your sample
  will
  be. So, in order to test whether your sample is
  representative enough, you could analyze using
  either
  estimate 49% or compute, take a snapshot of the
  stats,
  and then compare the stats from a 10,000 row
  estimate
  to those. Then, add rows to your estimate until
 you
  are satisfied with the stats.
 
  This of course is a pie in the sky mathematical
  model,
  but seems like a reasonable place to start with
  testing.
 
  Input? Input? Buhler? 

RE: Statistical sampling and representative stats collection

2002-05-22 Thread Jamadagni, Rajendra

I remember, in 96/97, I was working on a Oracle 7 database and none of the
this sample() was available. My task was to select x people out of z number
of people randomly. We hired a temp, who took Dallas/Fort Worth dictionary
(I was working in that area) and counted (yes counted) number of people
whose last names started with A and B and so on.

We put this information into a table and then found out percentage of last
names starting with A compared to the whole list for each letter. This
percentage was used as weightage for each letter when choosing random
samples from a list of people. Then I used this weightage and dbms_random to
pick the required number of names from our database. It was fun to
implement, but I feel bad for the temp who had to count the names.

So, compared to that, 8i and 9i built-in features rock !!
Raj
__
Rajendra Jamadagni  MIS, ESPN Inc.
Rajendra dot Jamadagni at ESPN dot com
Any opinion expressed here is personal and doesn't reflect that of ESPN Inc.

QOTD: Any clod can have facts, but having an opinion is an art!


***1

This e-mail message is confidential, intended only for the named recipient(s) above 
and may contain information that is privileged, attorney work product or exempt from 
disclosure under applicable law. If you have received this message in error, or are 
not the named recipient(s), please immediately notify ESPN at (860) 766-2000 and 
delete this e-mail message from your computer, Thank you.

***1



Re: Statistical sampling and representative stats collection

2002-05-22 Thread Jack Silvey

That is a very good question. Anyone else know the
answer?

My approach to skewness right now is to compare the
mean and the median of the entire dataset, since if
you have two measures of centrality that are not the
same, you have skewness. 

/jack

--- [EMAIL PROTECTED] wrote:
 
 Jack,
 
 Isn't the problem with this concept that it doesn't
 take into consideration
 how skewed the data is?   Statistically significance
 would be relevant to
 perfectly distributed data but wouldn't you need a
 higher percentage of
 data for significance in more highly skewed data?
 
 Just something to consider.
 
 Cherie Machler
 Oracle DBA
 Gelco Information Network
 
 
 
 

 Jack Silvey 
 

 jack_silvey@y   To:
 Multiple recipients of list ORACLE-L
 [EMAIL PROTECTED] 
 ahoo.comcc:
 

 Sent by: Subject:   
  Re: Statistical sampling and representative stats  

 [EMAIL PROTECTED]collection
 

 om  
 

 
 

 
 

 05/21/02 06:19  
 

 PM  
 

 Please respond  
 

 to ORACLE-L 
 

 
 

 
 

 
 
 
 
 Hi Rafiq,
 
 We have been using 35 percent on our warehouse, even
 on our fact partitions. Now that I have thought
 about
 it for a while, that seems like a lot given the
 volume
 of data. If a representative sample can be gathered
 with 10,000 or 50,000 or 100,000 rows, and our fact
 partitions have millions of rows each, seems like we
 could go 1% on our analyze and it would be within
 acceptable tolerances.
 
 /jack
 
 
 --- Mohammad Rafiq [EMAIL PROTECTED] wrote:
  The most of the list memeber agrees on estimate
 with
  30%
 
  Regards
  Rafiq
 
 
 
 
  Reply-To: [EMAIL PROTECTED]
  To: Multiple recipients of list ORACLE-L
  [EMAIL PROTECTED]
  Date: Tue, 21 May 2002 13:43:33 -0800
 
  Hi all,
 
  Did some investigation about statistical sampling
  this
  weekend since we are going to optimize our analyze
  process soon, and would like some input from all
 you
  orabrains on this one.
 
  I opened a TAR with Oracle asking about the
 sampling
  algorithm of stats collection, and they assured me
  it
  was random.
 
  The goal of analyze...estimate is to collect stats
  that are representative of the data population as
 a
  whole using a given sample set. Since analyzing
  tables
  takes up resources (does sorts to order the data
 for
  investigation) the fewer rows you use in estimate,
  the
  less system resources you use and the faster the
  analyze will go.
 
  Since our goal is to get as small a sample as
  possible
  and still have stats that are representative, my
  contention is that we could start by finding what
  the
  margin of error will be for each sample size and
  gauge
  our tolerance for it.
 
  One standard way to calculate margin of error for
 a
  given sample is by using this formula:
 
  M = 1/SQRT(N)
 
  where:
  M = margin of error
  N=sample size
 
  So, if we can tolerate stats that have a 1% a
 margin
  of error (will deviate from representative of the
  whole population by 1%), our sample size should be
  10,000 rows.
 
  Also, a corollary (not a toyota corollary, though)
  to
  this would be that the more rows you add to your
  sample, the closer to representative your sample
  will
  be. So, in order to test whether your sample is
  representative enough, you could analyze using
  either
  estimate 49% or compute, take a snapshot of the
  stats,
  and then compare the stats from a 10,000 row
  estimate
  to those. Then, add rows to your estimate until
 you
  are satisfied with the stats.
 
  This of course is a pie in the sky mathematical
  model,
  but seems like a reasonable place to start 

Re: Statistical sampling and representative stats collection

2002-05-22 Thread Mohammad Rafiq

Jack
You are right keeping in mind your application and variation of data in 
different tables. I just mentioned 30% as most agreed percentage for 
estimate. I think Dennis gave you a more pracitcial answer yesterday.

Regards
Rafiq





Reply-To: [EMAIL PROTECTED]
To: Multiple recipients of list ORACLE-L [EMAIL PROTECTED]
Date: Tue, 21 May 2002 15:19:31 -0800

Hi Rafiq,

We have been using 35 percent on our warehouse, even
on our fact partitions. Now that I have thought about
it for a while, that seems like a lot given the volume
of data. If a representative sample can be gathered
with 10,000 or 50,000 or 100,000 rows, and our fact
partitions have millions of rows each, seems like we
could go 1% on our analyze and it would be within
acceptable tolerances.

/jack


--- Mohammad Rafiq [EMAIL PROTECTED] wrote:
  The most of the list memeber agrees on estimate with
  30%
 
  Regards
  Rafiq
 
 
 
 
  Reply-To: [EMAIL PROTECTED]
  To: Multiple recipients of list ORACLE-L
  [EMAIL PROTECTED]
  Date: Tue, 21 May 2002 13:43:33 -0800
 
  Hi all,
 
  Did some investigation about statistical sampling
  this
  weekend since we are going to optimize our analyze
  process soon, and would like some input from all you
  orabrains on this one.
 
  I opened a TAR with Oracle asking about the sampling
  algorithm of stats collection, and they assured me
  it
  was random.
 
  The goal of analyze...estimate is to collect stats
  that are representative of the data population as a
  whole using a given sample set. Since analyzing
  tables
  takes up resources (does sorts to order the data for
  investigation) the fewer rows you use in estimate,
  the
  less system resources you use and the faster the
  analyze will go.
 
  Since our goal is to get as small a sample as
  possible
  and still have stats that are representative, my
  contention is that we could start by finding what
  the
  margin of error will be for each sample size and
  gauge
  our tolerance for it.
 
  One standard way to calculate margin of error for a
  given sample is by using this formula:
 
  M = 1/SQRT(N)
 
  where:
  M = margin of error
  N=sample size
 
  So, if we can tolerate stats that have a 1% a margin
  of error (will deviate from representative of the
  whole population by 1%), our sample size should be
  10,000 rows.
 
  Also, a corollary (not a toyota corollary, though)
  to
  this would be that the more rows you add to your
  sample, the closer to representative your sample
  will
  be. So, in order to test whether your sample is
  representative enough, you could analyze using
  either
  estimate 49% or compute, take a snapshot of the
  stats,
  and then compare the stats from a 10,000 row
  estimate
  to those. Then, add rows to your estimate until you
  are satisfied with the stats.
 
  This of course is a pie in the sky mathematical
  model,
  but seems like a reasonable place to start with
  testing.
 
  Input? Input? Buhler? Buhler?
 
 
  /jack silvey
 
 
  __
  Do You Yahoo!?
  LAUNCH - Your Yahoo! Music Experience
  http://launch.yahoo.com
  --
  Please see the official ORACLE-L FAQ:
  http://www.orafaq.com
  --
  Author: Jack Silvey
 INET: [EMAIL PROTECTED]
 
  Fat City Network Services-- (858) 538-5051  FAX:
  (858) 538-5051
  San Diego, California-- Public Internet
  access / Mailing Lists
 

  To REMOVE yourself from this mailing list, send an
  E-Mail message
  to: [EMAIL PROTECTED] (note EXACT spelling of
  'ListGuru') and in
  the message BODY, include a line containing: UNSUB
  ORACLE-L
  (or the name of mailing list you want to be removed
  from).  You may
  also send the HELP command for other information
  (like subscribing).
 
 
 
 
  MOHAMMAD RAFIQ
 
 
 
_
  Join the world’s largest e-mail service with MSN
  Hotmail.
  http://www.hotmail.com
 
  --
  Please see the official ORACLE-L FAQ:
  http://www.orafaq.com
  --
  Author: Mohammad Rafiq
INET: [EMAIL PROTECTED]
 
  Fat City Network Services-- (858) 538-5051  FAX:
  (858) 538-5051
  San Diego, California-- Public Internet
  access / Mailing Lists
 

  To REMOVE yourself from this mailing list, send an
  E-Mail message
  to: [EMAIL PROTECTED] (note EXACT spelling of
  'ListGuru') and in
  the message BODY, include a line containing: UNSUB
  ORACLE-L
  (or the name of mailing list you want to be removed
  from).  You may
  also send the HELP command for other information
  (like subscribing).


__
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com
--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author: Jack Silvey
   INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: 

RE: Statistical sampling and representative stats collection

2002-05-22 Thread Chaim . Katz


Raj,
Would the temp have gotten the  same result by counting the frequencies of
the letters in the game of scrabble?
Chaim




Jamadagni, Rajendra [EMAIL PROTECTED]@fatcity.com on
05/22/2002 11:53:22 AM

Please respond to [EMAIL PROTECTED]

Sent by:[EMAIL PROTECTED]


To:Multiple recipients of list ORACLE-L [EMAIL PROTECTED]
cc:



I remember, in 96/97, I was working on a Oracle 7 database and none of the
this sample() was available. My task was to select x people out of z number
of people randomly. We hired a temp, who took Dallas/Fort Worth dictionary
(I was working in that area) and counted (yes counted) number of people
whose last names started with A and B and so on.

We put this information into a table and then found out percentage of last
names starting with A compared to the whole list for each letter. This
percentage was used as weightage for each letter when choosing random
samples from a list of people. Then I used this weightage and dbms_random
to
pick the required number of names from our database. It was fun to
implement, but I feel bad for the temp who had to count the names.

So, compared to that, 8i and 9i built-in features rock !!
Raj
__
Rajendra JamadagniMIS, ESPN Inc.
Rajendra dot Jamadagni at ESPN dot com
Any opinion expressed here is personal and doesn't reflect that of ESPN
Inc.

QOTD: Any clod can have facts, but having an opinion is an art!

(See attached file: ESPN_Disclaimer.txt)




ESPN_Disclaimer.txt
Description: Binary data


RE: Statistical sampling and representative stats collection

2002-05-22 Thread MacGregor, Ian A.

We use dbms_stats.gather_schema_stats and do not pass it sys as a schema owner.  Under 
8i this method did not work  for gather empty, but did fine with gather stale.   
This has been fixed in 9i.  However, it still abends if one of the schema owns an 
iot_overflow table. 

Ian 


-Original Message-
Sent: Wednesday, May 22, 2002 8:29 AM
To: Multiple recipients of list ORACLE-L


Cherie,

Had some problems with DBMS_STATS, but we are making
our way towards it. Just have to be careful, since it
can do quirky things like analyze the dictionary. We
are using dba_tab_modifications to monitor our tables.

Thanks for the input.

Jack


--- [EMAIL PROTECTED] wrote:
 
 Jack,
 
 What version are you on?   Are you able to utilize
 the gather_stale option.
 That way you would not only be optimizing the amount
 you are estimating but
 the interval between analyzing.
 
 Of course, that option of DBMS_STATS is not
 available on older versions.
 
 Cherie Machler
 Oracle DBA
 Gelco Information Network
 
 
 
 

 Jack Silvey 
 

 jack_silvey@y   To:
 Multiple recipients of list ORACLE-L
 [EMAIL PROTECTED] 
 ahoo.comcc:
 

 Sent by: Subject:   
  Re: Statistical sampling and representative stats  

 [EMAIL PROTECTED]collection
 

 om  
 

 
 

 
 

 05/21/02 06:19  
 

 PM  
 

 Please respond  
 

 to ORACLE-L 
 

 
 

 
 

 
 
 
 
 Hi Rafiq,
 
 We have been using 35 percent on our warehouse, even
 on our fact partitions. Now that I have thought
 about
 it for a while, that seems like a lot given the
 volume
 of data. If a representative sample can be gathered
 with 10,000 or 50,000 or 100,000 rows, and our fact
 partitions have millions of rows each, seems like we
 could go 1% on our analyze and it would be within
 acceptable tolerances.
 
 /jack
 
 
 --- Mohammad Rafiq [EMAIL PROTECTED] wrote:
  The most of the list memeber agrees on estimate
 with
  30%
 
  Regards
  Rafiq
 
 
 
 
  Reply-To: [EMAIL PROTECTED]
  To: Multiple recipients of list ORACLE-L
  [EMAIL PROTECTED]
  Date: Tue, 21 May 2002 13:43:33 -0800
 
  Hi all,
 
  Did some investigation about statistical sampling
  this
  weekend since we are going to optimize our analyze
  process soon, and would like some input from all
 you
  orabrains on this one.
 
  I opened a TAR with Oracle asking about the
 sampling
  algorithm of stats collection, and they assured me
  it
  was random.
 
  The goal of analyze...estimate is to collect stats
  that are representative of the data population as
 a
  whole using a given sample set. Since analyzing
  tables
  takes up resources (does sorts to order the data
 for
  investigation) the fewer rows you use in estimate,
  the
  less system resources you use and the faster the
  analyze will go.
 
  Since our goal is to get as small a sample as
  possible
  and still have stats that are representative, my
  contention is that we could start by finding what
  the
  margin of error will be for each sample size and
  gauge
  our tolerance for it.
 
  One standard way to calculate margin of error for
 a
  given sample is by using this formula:
 
  M = 1/SQRT(N)
 
  where:
  M = margin of error
  N=sample size
 
  So, if we can tolerate stats that have a 1% a
 margin
  of error (will deviate from representative of the
  whole population by 1%), our sample size should be
  10,000 rows.
 
  Also, a corollary (not a toyota corollary, though)
  to
  this would be that the more rows you add to your
  sample, the closer to representative your sample
  will
  be. So, in order to test whether your sample is
  representative 

Re: Statistical sampling and representative stats collection

2002-05-21 Thread Mohammad Rafiq

The most of the list memeber agrees on estimate with 30%

Regards
Rafiq




Reply-To: [EMAIL PROTECTED]
To: Multiple recipients of list ORACLE-L [EMAIL PROTECTED]
Date: Tue, 21 May 2002 13:43:33 -0800

Hi all,

Did some investigation about statistical sampling this
weekend since we are going to optimize our analyze
process soon, and would like some input from all you
orabrains on this one.

I opened a TAR with Oracle asking about the sampling
algorithm of stats collection, and they assured me it
was random.

The goal of analyze...estimate is to collect stats
that are representative of the data population as a
whole using a given sample set. Since analyzing tables
takes up resources (does sorts to order the data for
investigation) the fewer rows you use in estimate, the
less system resources you use and the faster the
analyze will go.

Since our goal is to get as small a sample as possible
and still have stats that are representative, my
contention is that we could start by finding what the
margin of error will be for each sample size and gauge
our tolerance for it.

One standard way to calculate margin of error for a
given sample is by using this formula:

M = 1/SQRT(N)

where:
M = margin of error
N=sample size

So, if we can tolerate stats that have a 1% a margin
of error (will deviate from representative of the
whole population by 1%), our sample size should be
10,000 rows.

Also, a corollary (not a toyota corollary, though) to
this would be that the more rows you add to your
sample, the closer to representative your sample will
be. So, in order to test whether your sample is
representative enough, you could analyze using either
estimate 49% or compute, take a snapshot of the stats,
and then compare the stats from a 10,000 row estimate
to those. Then, add rows to your estimate until you
are satisfied with the stats.

This of course is a pie in the sky mathematical model,
but seems like a reasonable place to start with
testing.

Input? Input? Buhler? Buhler?


/jack silvey


__
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com
--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author: Jack Silvey
   INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).




MOHAMMAD RAFIQ


_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Mohammad Rafiq
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



RE: Statistical sampling and representative stats collection

2002-05-21 Thread DENNIS WILLIAMS

Jack - I believe that you are on the right track. If you think about it,
political pollsters randomly assess national populations with only a few
thousand samples, and it works pretty good as long as it isn't biased by bad
questions, lying respondents, etc. And you don't have those issues here.
A more important issue if you wish to reduce the system resources
devoted to analyzing tables would be to consider how volatile the table is.
Some tables rarely change, while others change frequently. 
Then another issue is the overall objective you are trying to
achieve. Is it to ensure query optimization? Is it a large population of
ad-hoc queries? You may want to consider some of the new stored execution
plans for queries that you don't want to suddenly start making some bad
optimizer decisions.
Dennis Williams
DBA
Lifetouch, Inc.
[EMAIL PROTECTED]


-Original Message-
Sent: Tuesday, May 21, 2002 4:44 PM
To: Multiple recipients of list ORACLE-L


Hi all,

Did some investigation about statistical sampling this
weekend since we are going to optimize our analyze
process soon, and would like some input from all you
orabrains on this one.

I opened a TAR with Oracle asking about the sampling
algorithm of stats collection, and they assured me it
was random. 

The goal of analyze...estimate is to collect stats
that are representative of the data population as a
whole using a given sample set. Since analyzing tables
takes up resources (does sorts to order the data for
investigation) the fewer rows you use in estimate, the
less system resources you use and the faster the
analyze will go.

Since our goal is to get as small a sample as possible
and still have stats that are representative, my
contention is that we could start by finding what the
margin of error will be for each sample size and gauge
our tolerance for it.

One standard way to calculate margin of error for a
given sample is by using this formula: 

M = 1/SQRT(N)

where:
M = margin of error
N=sample size

So, if we can tolerate stats that have a 1% a margin
of error (will deviate from representative of the
whole population by 1%), our sample size should be
10,000 rows. 

Also, a corollary (not a toyota corollary, though) to
this would be that the more rows you add to your
sample, the closer to representative your sample will
be. So, in order to test whether your sample is
representative enough, you could analyze using either
estimate 49% or compute, take a snapshot of the stats,
and then compare the stats from a 10,000 row estimate
to those. Then, add rows to your estimate until you
are satisfied with the stats.

This of course is a pie in the sky mathematical model,
but seems like a reasonable place to start with
testing. 

Input? Input? Buhler? Buhler?


/jack silvey


__
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: DENNIS WILLIAMS
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).



Re: Statistical sampling and representative stats collection

2002-05-21 Thread Jack Silvey

Hi Rafiq,

We have been using 35 percent on our warehouse, even
on our fact partitions. Now that I have thought about
it for a while, that seems like a lot given the volume
of data. If a representative sample can be gathered
with 10,000 or 50,000 or 100,000 rows, and our fact
partitions have millions of rows each, seems like we
could go 1% on our analyze and it would be within
acceptable tolerances.

/jack


--- Mohammad Rafiq [EMAIL PROTECTED] wrote:
 The most of the list memeber agrees on estimate with
 30%
 
 Regards
 Rafiq
 
 
 
 
 Reply-To: [EMAIL PROTECTED]
 To: Multiple recipients of list ORACLE-L
 [EMAIL PROTECTED]
 Date: Tue, 21 May 2002 13:43:33 -0800
 
 Hi all,
 
 Did some investigation about statistical sampling
 this
 weekend since we are going to optimize our analyze
 process soon, and would like some input from all you
 orabrains on this one.
 
 I opened a TAR with Oracle asking about the sampling
 algorithm of stats collection, and they assured me
 it
 was random.
 
 The goal of analyze...estimate is to collect stats
 that are representative of the data population as a
 whole using a given sample set. Since analyzing
 tables
 takes up resources (does sorts to order the data for
 investigation) the fewer rows you use in estimate,
 the
 less system resources you use and the faster the
 analyze will go.
 
 Since our goal is to get as small a sample as
 possible
 and still have stats that are representative, my
 contention is that we could start by finding what
 the
 margin of error will be for each sample size and
 gauge
 our tolerance for it.
 
 One standard way to calculate margin of error for a
 given sample is by using this formula:
 
 M = 1/SQRT(N)
 
 where:
 M = margin of error
 N=sample size
 
 So, if we can tolerate stats that have a 1% a margin
 of error (will deviate from representative of the
 whole population by 1%), our sample size should be
 10,000 rows.
 
 Also, a corollary (not a toyota corollary, though)
 to
 this would be that the more rows you add to your
 sample, the closer to representative your sample
 will
 be. So, in order to test whether your sample is
 representative enough, you could analyze using
 either
 estimate 49% or compute, take a snapshot of the
 stats,
 and then compare the stats from a 10,000 row
 estimate
 to those. Then, add rows to your estimate until you
 are satisfied with the stats.
 
 This of course is a pie in the sky mathematical
 model,
 but seems like a reasonable place to start with
 testing.
 
 Input? Input? Buhler? Buhler?
 
 
 /jack silvey
 
 
 __
 Do You Yahoo!?
 LAUNCH - Your Yahoo! Music Experience
 http://launch.yahoo.com
 --
 Please see the official ORACLE-L FAQ:
 http://www.orafaq.com
 --
 Author: Jack Silvey
INET: [EMAIL PROTECTED]
 
 Fat City Network Services-- (858) 538-5051  FAX:
 (858) 538-5051
 San Diego, California-- Public Internet
 access / Mailing Lists


 To REMOVE yourself from this mailing list, send an
 E-Mail message
 to: [EMAIL PROTECTED] (note EXACT spelling of
 'ListGuru') and in
 the message BODY, include a line containing: UNSUB
 ORACLE-L
 (or the name of mailing list you want to be removed
 from).  You may
 also send the HELP command for other information
 (like subscribing).
 
 
 
 
 MOHAMMAD RAFIQ
 
 

_
 Join the world’s largest e-mail service with MSN
 Hotmail. 
 http://www.hotmail.com
 
 -- 
 Please see the official ORACLE-L FAQ:
 http://www.orafaq.com
 -- 
 Author: Mohammad Rafiq
   INET: [EMAIL PROTECTED]
 
 Fat City Network Services-- (858) 538-5051  FAX:
 (858) 538-5051
 San Diego, California-- Public Internet
 access / Mailing Lists


 To REMOVE yourself from this mailing list, send an
 E-Mail message
 to: [EMAIL PROTECTED] (note EXACT spelling of
 'ListGuru') and in
 the message BODY, include a line containing: UNSUB
 ORACLE-L
 (or the name of mailing list you want to be removed
 from).  You may
 also send the HELP command for other information
 (like subscribing).


__
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: [EMAIL PROTECTED]

Fat City Network Services-- (858) 538-5051  FAX: (858) 538-5051
San Diego, California-- Public Internet access / Mailing Lists

To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like