Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 ... I think the problem is in our heuristic sampling code.  I'm not the first 
 person to have this kind of a problem.  Will be following up with tests ...

I looked into this a while back when we were talking about changing the
sampling method. The conclusions were discouraging. Fundamentally, using
constant sized samples of data for n_distinct is bogus. Constant sized samples
only work for things like the histograms that can be analyzed through standard
statistics population sampling which depends on the law of large numbers.

n_distinct requires you to estimate how frequently very rare things occur. You
can't apply the law of large numbers because even a single instance of a value
out of a large pool alters the results disproportionately.

To get a valid estimate for n_distinct you would need to sample a fixed
percentage of the table. Not a fixed size sample. That just isn't practical.
Moreover, I think the percentage would have to be quite large. Even if you
sampled half the table I think the confidence interval would be quite wide.

The situation is worsened because it's unclear how to interpolate values for
subsets of the table. If the histogram says you have a million records and
you're adding a clause that has a selectivity of 50% then half a million is a
good guess. But if what you care about is n_distinct and you start with a
million records with 1,000 distinct values and then apply a clause that
filters 50% of them, how do you estimate the resulting n_distinct? 500 is
clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0
to 1,000 and you have no good way to figure out where the truth lies.

So I fear this is fundamentally a hopeless situation. It's going to be
difficult to consistently get good plans for any queries that depend on good
estimates for n_distinct.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon

2005-04-23 Thread Joel Fradkin
I would very, very strongly encourage you to run multi-user tests before
deciding on mysql. Mysql is nowhere near as capable when it comes to
concurrent operations as PostgreSQL is. From what others have said, it
doesn't take many concurrent operations for it to just fall over. I
can't speak from experience because I avoid mysql like the plague,
though. :)

I am just testing the water so to speak, if it cant handle single user tests
then multiple user tests are kind of a waste of time.

I am trying to de-normalize my view into a table to see if I can get my app
to work. It is a good idea anyway but raises a ton of questions about
dealing with the data post a case being closed etc; also on multiple child
relationships like merchandise and payments etc.

I did do a test of all three (MSSQL, MYSQL,and postgres) in aqua studio ,
all on the same machine running the servers and found postgres beat out
MYSQL, but like any other test it may have been an issue with aqua studio
and mysql in any case I have not made a decision to use mysql I am still
researching fixes for postgres.

I am waiting to here back from Josh on using cursors and trying to flatten
long running views. 

I am a little disappointed I have not understood enough to get my analyzer
to use the proper plan, we had to set seqscan off to get the select from
response_line to work fast and I had to turn off merge joins to get assoc
list to work fast. Once I am up I can try to learn more about it, I am so
glad there are so many folks here willing to take time to educate us newb's.



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon

2005-04-23 Thread Joshua D. Drake
Joel Fradkin wrote:
I would very, very strongly encourage you to run multi-user tests before
deciding on mysql. Mysql is nowhere near as capable when it comes to
concurrent operations as PostgreSQL is. From what others have said, it
doesn't take many concurrent operations for it to just fall over. I
can't speak from experience because I avoid mysql like the plague,
though. :)
I am just testing the water so to speak, if it cant handle single user tests
then multiple user tests are kind of a waste of time.
Joel I think you are missing the point on the above comment. The above
comment as I read is, o.k. you are having problems with PostgreSQL BUT 
MySQL isn't going to help you and you will see that in multi-user tests.

MySQL is known to work very well on small databases without a lot of 
concurrent sessions. I don't think anybody here would argue that.

Where MySQL runs into trouble is larger databases with lots of 
concurrent connections.

I am a little disappointed I have not understood enough to get my analyzer
to use the proper plan, we had to set seqscan off to get the select from
response_line to work fast and I had to turn off merge joins to get assoc
list to work fast. Once I am up I can try to learn more about it, I am so
glad there are so many folks here willing to take time to educate us newb's.
Sincerely,
Joshua D. Drake
Command Prompt, Inc.


---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon

2005-04-23 Thread brew


 I am just testing the water so to speak, if it cant handle single user
 tests then multiple user tests are kind of a waste of time.

At the risk of being even more pedantic, let me point out that if you are
going to be running your application with multiple users the reverse is
even more true, 'If it can't handle multiple user tests then single user
tests are kind of a waste of time'.

brew

 ==
  Strange Brew   ([EMAIL PROTECTED])
  Check out my Stock Option Covered Call website  http://www.callpix.com
 and my Musician's Online Database Exchange http://www.TheMode.com
 ==


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Disk filling, CPU filling, renegade inserts and deletes?

2005-04-23 Thread Richard Plotkin
If anybody has additional advice on this problem, I would really, 
really appreciate it...

I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours, 
and 50 minutes after a vacuum the CPU usage still skyrocketed, and the 
disk started filling.  This time, there is only a single file that is 
spanning multiple GB, but running oid2name again returns no result on 
the oid or filenode.

With the increased vacuuming, fixed temp tables, etc., I really am at a 
loss for what's happening, and could really use some additional help.

Thank you,
Richard
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Disk filling, CPU filling, renegade inserts and deletes?

2005-04-23 Thread Richard Plotkin
I also forgot to mention, vacuumdb fails on the command line now with 
the following error:
vacuumdb: could not connect to database smt: FATAL:  sorry, too many 
clients already

On Apr 23, 2005, at 9:57 AM, Richard Plotkin wrote:
If anybody has additional advice on this problem, I would really, 
really appreciate it...

I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours, 
and 50 minutes after a vacuum the CPU usage still skyrocketed, and the 
disk started filling.  This time, there is only a single file that is 
spanning multiple GB, but running oid2name again returns no result on 
the oid or filenode.

With the increased vacuuming, fixed temp tables, etc., I really am at 
a loss for what's happening, and could really use some additional 
help.

Thank you,
Richard

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [PERFORM] Disk filling, CPU filling, renegade inserts and deletes?

2005-04-23 Thread Tom Lane
Richard Plotkin [EMAIL PROTECTED] writes:
 I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours, 
 and 50 minutes after a vacuum the CPU usage still skyrocketed, and the 
 disk started filling.  This time, there is only a single file that is 
 spanning multiple GB, but running oid2name again returns no result on 
 the oid or filenode.

What is the filename exactly (full path)?

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Disk filling, CPU filling, renegade inserts and deletes?

2005-04-23 Thread Richard Plotkin
/usr/local/pgsql/data/base/17234/42791
/usr/local/pgsql/data/base/17234/42791.1
/usr/local/pgsql/data/base/17234/42791.2
/usr/local/pgsql/data/base/17234/42791.3
/usr/local/pgsql/data/base/17234/42791.4
/usr/local/pgsql/data/base/17234/42791.5
/usr/local/pgsql/data/base/17234/42791.6
/usr/local/pgsql/data/base/17234/42791.7
/usr/local/pgsql/data/base/17234/42791.8
/usr/local/pgsql/data/base/17234/42791.9
/usr/local/pgsql/data/base/17234/42791.10
/usr/local/pgsql/data/base/17234/42791.11
On Apr 23, 2005, at 11:06 AM, Tom Lane wrote:
Richard Plotkin [EMAIL PROTECTED] writes:
I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours,
and 50 minutes after a vacuum the CPU usage still skyrocketed, and the
disk started filling.  This time, there is only a single file that is
spanning multiple GB, but running oid2name again returns no result on
the oid or filenode.
What is the filename exactly (full path)?
regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Disk filling, CPU filling, renegade inserts and deletes?

2005-04-23 Thread Tom Lane
Richard Plotkin [EMAIL PROTECTED] writes:
 /usr/local/pgsql/data/base/17234/42791
 /usr/local/pgsql/data/base/17234/42791.1
 /usr/local/pgsql/data/base/17234/42791.2
 /usr/local/pgsql/data/base/17234/42791.3
 ...

Well, that is certainly a table or index of some kind.

Go into database 17234 --- if you are not certain which one that is, see
select datname from pg_database where oid = 17234
and do
select relname from pg_class where relfilenode = 42791

The only way I could see for this to not find the table is if the table
creation has not been committed yet.  Do you have any apps that create
and fill a table in a single transaction?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon

2005-04-23 Thread Christopher Browne
Centuries ago, Nostradamus foresaw when [EMAIL PROTECTED] (Joel Fradkin) 
would write:
 I am just testing the water so to speak, if it cant handle single
 user tests then multiple user tests are kind of a waste of time.

I would suggest that if multi-user functionality is needed, then
starting with single user tests is a similar waste of time.

There's good reason to look at it this way...  

It is all too common for people to try to start building things with
primitive functionality, and then try to evolve the system into what
they need.  It is possible for that to work, if the base covers
enough of the necessary functionality.

In practice, we have watched Windows evolve in such a fashion with
respect to multiuser support, and, in effect, it has never really
gotten it.  Microsoft started by hacking something on top of MS-DOS,
and by the time enough applications had enough dependancies on the way
that worked, it has essentially become impossible for them to migrate
properly to a multiuser model since applications are normally designed
with the myopic this is MY computer! model of the world.

You may not need _total_ functionality in the beginning, but,
particularly for multiuser support, which has deep implications for
applications, it needs to be there In The Beginning.
-- 
output = reverse(moc.liamg @ enworbbc)
http://linuxdatabases.info/info/lisp.html
A CONS is an object which cares.  -- Bernie Greenberg.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [PERFORM] Disk filling, CPU filling, renegade inserts and deletes?

2005-04-23 Thread Richard Plotkin
Hi Tom,
Thanks for your responses this morning.  I did the select relname, and 
it returned 0 rows.  I do have one function that creates a temp table 
and fills it within the same transaction.  I'm pasting it below.  
Perhaps the ON COMMIT DROP is causing problems, and I need to drop 
the table at the end of the function instead of using ON COMMIT DROP?

--
-- Name: crumbs(integer, text, boolean); Type: FUNCTION; Schema: public
--
CREATE FUNCTION crumbs(integer, text, boolean) RETURNS text
AS $_$DECLARE
starting_page ALIAS FOR $1;

current_page integer;

delimiter text DEFAULT ': ';

withLinkTags BOOLEAN DEFAULT FALSE;

page_id_temp INTEGER;

page_name_temp TEXT;

current_nOrder INTEGER := 1;

page_results record;

path TEXT DEFAULT '';

BEGIN
IF starting_page IS NULL
THEN
RETURN NULL;
END IF;
current_page := starting_page;

IF $2 IS NOT NULL
THEN
delimiter := $2;
END IF;

IF $3 IS NOT NULL
THEN
withLinkTags := $3;
END IF;

--Create a table consisting of three columns: nOrder, page_id, name

CREATE TEMPORARY TABLE results
(nOrder integer,
page_id integer,
name text)
ON COMMIT DROP;
--Select the current page into the results table

SELECT INTO
page_id_temp,
page_name_temp

p.page_id,
CASE WHEN p.title_abbr IS NOT NULL
THEN p.title_abbr
ELSE p.title
END as name

FROM page p

WHERE p.page_id = starting_page;

IF FOUND
THEN
EXECUTE 'INSERT INTO results (nOrder, page_id, name)
VALUES ('   || current_nOrder || ','
|| page_id_temp || ','
|| quote_literal(page_name_temp)
|| ')';

current_nOrder := current_nOrder + 1;
END IF;

--Loop through results for page parents

LOOP

SELECT INTO
page_id_temp,
page_name_temp
			parent.page_id as parent_id,
			CASE WHEN parent.title_abbr IS NOT NULL
THEN parent.title_abbr
ELSE parent.title
			END as name
		
		FROM page AS child
		
		INNER JOIN page AS parent
			ON child.subcat_id = parent.page_id
			
		WHERE child.page_id = current_page;
		
		IF FOUND
		THEN
		
			EXECUTE 'INSERT INTO results (nOrder, page_id, name)
			VALUES ('	|| current_nOrder || ','
		|| page_id_temp || ','
		|| quote_literal(page_name_temp)
			|| ')';
		
			current_page = page_id_temp;
			
			current_nOrder := current_nOrder + 1;
			
		ELSE
		
			EXIT;
		
		END IF;
	
	END LOOP;
	
	
	SELECT INTO
		page_id_temp,
		page_name_temp
	
		c.default_page as parent_id,
		c.name
			
	FROM page p
		
	INNER JOIN category c
		ON c.cat_id = p.cat_id
			
	WHERE page_id = starting_page;
	
	IF FOUND
	THEN
		
		EXECUTE 'INSERT INTO results (nOrder, page_id, name)
		VALUES ('	|| current_nOrder || ','
	|| page_id_temp || ','
	|| quote_literal(page_name_temp)
		|| ')';
	
	END IF;
	
	FOR page_results IN EXECUTE 'SELECT * FROM results ORDER BY nOrder 
DESC' LOOP
		
		IF path = ''
		THEN
			IF withLinkTags IS TRUE
			THEN
path := 'a href=index.php?pid=' || page_results.page_id || '';
path := path || page_results.name;
path := path || '/a';
			ELSE
path := page_results.name;
			END IF;
		ELSE
			IF withLinkTags IS TRUE
			THEN
path := path || delimiter;
path := path || 'a href=index.php?pid=' || page_results.page_id 
|| '';
path := path || page_results.name;
path := path || '/a';
			ELSE
path := path || delimiter || page_results.name;
			END IF;
		END IF;
		
	END LOOP;
	
	RETURN path;

END;$_$
LANGUAGE plpgsql;
On Apr 23, 2005, at 11:17 AM, Tom Lane wrote:
Richard Plotkin [EMAIL PROTECTED] writes:
/usr/local/pgsql/data/base/17234/42791
/usr/local/pgsql/data/base/17234/42791.1
/usr/local/pgsql/data/base/17234/42791.2
/usr/local/pgsql/data/base/17234/42791.3
...
Well, that is certainly a table or index of some kind.
Go into database 17234 --- if you are not certain which one that is, 
see
	select datname from pg_database where oid = 17234
and do
	select relname from pg_class where relfilenode = 42791

The only way I could see for this to not find the table is if the table
creation has not been committed yet.  Do you have any apps that create
and fill a table in a single transaction?
regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to 

Re: [PERFORM] Disk filling, CPU filling, renegade inserts and deletes?

2005-04-23 Thread Tom Lane
Richard Plotkin [EMAIL PROTECTED] writes:
 Thanks for your responses this morning.  I did the select relname, and 
 it returned 0 rows.  I do have one function that creates a temp table 
 and fills it within the same transaction.  I'm pasting it below.  
 Perhaps the ON COMMIT DROP is causing problems, and I need to drop 
 the table at the end of the function instead of using ON COMMIT DROP?

Well, I think we can conclude that the function is pushing way more
data into the temp table than you expect.  I am wondering if that loop
in the middle of the function is turning into an infinite loop --- could
it be finding some sort of cycle in your page data?  You might want to
add some RAISE NOTICE commands to the loop so you can track what it's
doing.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] flattening the file might work for me here is the analyze.

2005-04-23 Thread Joel Fradkin








Index Scan using ix_tblviwauditcube_clientnum on tblviwauditcube
(cost=0.00..35895.75 rows=303982 width=708) (actual time=0.145..1320.432
rows=316490 loops=1)

 Index Cond: ((clientnum)::text = 'MSI'::text)

Total runtime: 1501.028 ms



Joel Fradkin







Wazagua, Inc.
2520 Trailmate Dr
Sarasota, Florida 34243
Tel. 941-753-7111 ext 305







[EMAIL PROTECTED]
www.wazagua.com
Powered by Wazagua
Providing you with the latest Web-based technology  advanced tools.
 2004. WAZAGUA, Inc. All rights reserved. WAZAGUA,Inc
This email message is for the use of the intended recipient(s) and may
contain confidential and privileged information. Any unauthorized review,
use, disclosure or distribution is prohibited. If you are not the
intended recipient, please contact the sender by reply email and delete and
destroy all copies of the original message, including attachments.



















Re: [PERFORM] two queries and dual cpu (perplexed)

2005-04-23 Thread Tom Lane
John A Meinel [EMAIL PROTECTED] writes:
 Actually, you probably don't want enable_seqscan=off, you should try:
 SET enable_nestloop TO off.
 The problem is that it is estimating there will only be 44 rows, but in
 reality there are 13M rows. It almost definitely should be doing a
 seqscan with a sort and merge join.

Not nestloops anyway.

 I don't understand how postgres could get the number of rows that wrong.

No stats, or out-of-date stats is the most likely bet.

 I can't figure out exactly what is where from the formatting, but the query 
 that seems misestimated is:
 -  Index Scan using IX_ClimateId on ClimateChangeModel40  
 (cost=0.00..1063711.75 rows=265528 width=20) (actual time=28.311..17212.703 
 rows=13276368 loops=1)
 Index Cond: (outer.ClimateId = ClimateChangeModel40.ClimateId)

Yeah, that's what jumped out at me too.  It's not the full explanation
for the join number being so far off, but this one at least you have a
chance to fix by updating the stats on ClimateChangeModel40.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] pgbench Comparison of 7.4.7 to 8.0.2

2005-04-23 Thread Thomas F . O'Connell
Steve,
Per your and Tom's recommendations, I significantly increased the 
number of transactions used for testing. See my last post.

The database will have pretty heavy mixed use, i.e., both reads and 
writes.

I performed 32 iterations per scenario this go-round.
I'll look into OSDB for further benchmarking. Thanks for the tip.
Since pgbench is part of the postgres distribution and I had it at hand 
and it seems to be somewhat widely referenced, I figured I go ahead and 
post preliminary results from it.

-tfo
--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
Strategic Open Source: Open Your i
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005
On Apr 15, 2005, at 4:24 PM, Steve Poe wrote:
Tom,
People's opinions on pgbench may vary, so take what I say with a grain 
of salt. Here are my thoughts:

1) Test with no less than 200 transactions per client. I've heard with 
less than this, your results will vary too much with the direction of 
the wind blowing. A high enough value will help rule out some noise 
factor. If I am wrong, please let me know.

2) How is the database going to  be used?  What percentage will be 
read/write if you had to guess? Pgbench is like a TPC-B with will help 
guage the potential throughput of your tps. However, it may not stress 
the server enough to help you make key performance changes. However, 
benchmarks are like statistics...full of lies g.

3) Run not just a couple pgbench runs, but *many* (I do between 20-40 
runs) so you can rule out noise and guage improvement on median 
results.

4) Find something that you test OLTP-type transactions. I used OSDB 
since it is simple to implement and use. Although OSDL's OLTP testing 
will closer to reality.

Steve Poe

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
Greg,

 I looked into this a while back when we were talking about changing the
 sampling method. The conclusions were discouraging. Fundamentally, using
 constant sized samples of data for n_distinct is bogus. Constant sized
 samples only work for things like the histograms that can be analyzed
 through standard statistics population sampling which depends on the law of
 large numbers.

Well, unusual distributions are certainly tough.  But I think the problem 
exists even for relatively well-distributed populations.Part of it is, I 
believe, the formula we are using:

n*d / (n - f1 + f1*n/N)

This is an estimation formula from Haas and Stokes in IBM Research Report RJ 
10025, and is called the DUJ1 formula. 
(http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf)  It appears to 
suck.   For example, take my table:

rows: 26million (N)
distinct values: 3.4million

I took a random sample of 1000 rows (n) from that table.   It contained:
968 values that occurred only once (f1)
981 distinct values (d)

Any human being looking at that sample would assume a large number of distinct 
values; after all, 96.8% of the values occurred only once.   But the formula 
gives us:

1000*981 / ( 1000 - 968 + ( 968 * 1000/2600 ) ) = 30620

This is obviously dramatically wrong, by a factor of 100.  The math gets worse 
as the sample size goes down:

Sample 250, 248 distinct values, 246 unique values:

250*248 / ( 250 - 246 + ( 246 * 250 / 2600 ) ) = 15490

Even in a case with an ovewhelming majority of unique values, the formula gets 
it wrong:

999 unque values in sample
998 appearing only once:

1000*999 / ( 1000 - 998 + ( 998 * 1000 / 2600 ) ) = 490093

This means that, with a sample size of 1000 a table of 26million rows cannot 
ever have with this formula more than half a million distinct values, unless 
the column is a unique column.

Overall, our formula is inherently conservative of n_distinct.   That is, I 
believe that it is actually computing the *smallest* number of distinct 
values which would reasonably produce the given sample, rather than the 
*median* one.  This is contrary to the notes in analyze.c, which seem to 
think that we're *overestimating* n_distinct.  

This formula appears broken everywhere:

Table: 969000 rows
Distinct values: 374000
Sample Size: 1000
Unique values in sample: 938
Values appearing only once: 918

1000*938 / ( 1000 - 918 + ( 918 * 1000 / 969000 ) ) = 11308

Again, too small by a factor of 20x.   

This is so broken, in fact, that I'm wondering if we've read the paper right?  
I've perused the paper on almaden, and the DUJ1 formula appears considerably 
more complex than the formula we're using.  

Can someone whose math is more recent than calculus in 1989 take a look at 
that paper, and look at the formula toward the bottom of page 10, and see if 
we are correctly interpreting it?I'm particularly confused as to what q 
and d-sub-n represent.  Thanks!

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
People,

 Can someone whose math is more recent than calculus in 1989 take a look at
 that paper, and look at the formula toward the bottom of page 10, and see
 if we are correctly interpreting it?    I'm particularly confused as to
 what q and d-sub-n represent.  Thanks!

Actually, I managed to solve for these and it appears we are using the formula 
correctly.  It's just a bad formula.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Andrew Dunstan
Josh Berkus said:


 Well, unusual distributions are certainly tough.  But I think the
 problem  exists even for relatively well-distributed populations.
 Part of it is, I  believe, the formula we are using:

 n*d / (n - f1 + f1*n/N)

[snip]

 This is so broken, in fact, that I'm wondering if we've read the paper
 right?   I've perused the paper on almaden, and the DUJ1 formula
 appears considerably  more complex than the formula we're using.

 Can someone whose math is more recent than calculus in 1989 take a look
 at  that paper, and look at the formula toward the bottom of page 10,
 and see if  we are correctly interpreting it?I'm particularly
 confused as to what q  and d-sub-n represent.  Thanks!


Math not too recent ...

I quickly read the paper and independently came up with the same formula you
say above we are applying. The formula is on the page that is numbered 6,
although it's the tenth page in the PDF.

q = n/N  = ratio of sample size to population size
d_sub_n = d = number of distinct classes in sample

cheers

andrew





---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Overall, our formula is inherently conservative of n_distinct.   That is, I 
 believe that it is actually computing the *smallest* number of distinct 
 values which would reasonably produce the given sample, rather than the 
 *median* one.  This is contrary to the notes in analyze.c, which seem to 
 think that we're *overestimating* n_distinct.  

Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not.  Greg is
correct that this is inherently a hard problem :-(

I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match