Re: [PERFORM] Final decision

2005-04-28 Thread Dave Page
 

 -Original Message-
 From: Josh Berkus [mailto:[EMAIL PROTECTED] 
 Sent: 28 April 2005 04:09
 To: Dave Page
 Cc: Joshua D. Drake; Joel Fradkin; PostgreSQL Perform
 Subject: Re: [PERFORM] Final decision
 
 Dave, folks,
 
  Err, yes. But that's not quite the same as core telling us 
 the current
  driver is being replaced.
 
 Sorry, I spoke off the cuff.I also was unaware that work 
 on the current 
 driver had renewed.   Us Core people are not omnicient, 
 believe it or don't.

I was under the impression that you and Bruce negiotiated the developer
time! Certainly you and I chatted about it on IRC once... Ahh, well.
Never mind.

 Mind you, having 2 different teams working on two different 
 ODBC drivers is a 
 problem for another list ...

Absolutely.

Regards, Dave.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Why is this system swapping?

2005-04-28 Thread Jeff
On Apr 27, 2005, at 7:46 PM, Greg Stark wrote:
In fact I think it's generally superior to having a layer like pgpool 
having
to hand off all your database communication. Having to do an extra 
context
switch to handle every database communication is crazy.

I suppose this depends on how many machines / how much traffic you have.
In one setup I run here I get away with 32 * 4 db connections instead 
of 500 * 4. Pretty simple to see the savings on the db machine. (Yes, 
it is a bad design as you said where static  dynamic content are 
served from the same box. However it also saves money since I don't 
need machines sitting around serving up pixel.gif vs 
myBigApplication.cgi)

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [PERFORM] Suggestions for a data-warehouse migration routine

2005-04-28 Thread Mischa Sandberg
Quoting Richard Rowell [EMAIL PROTECTED]:

 I've ported enough of my companies database to Postgres to make
 warehousing on PG a real possibility.  I thought I would toss my
 data
 migration architecture ideas out for the list to shoot apart..
 
[...]
Not much feedback required.

Yes, dropping the entire database is faster and simpler.
If your database is small enough that you can rebuild it from scratch
every time, go for it.

Yes, vacuum analyze required; creating indexes alone does not create
statistics.

From a I'd dump an extract of pg_stat[io_]user_(tables|indexes)
to see how index usage and table load changes over time.
-- 
Dreams come true, not free. -- S.Sondheim, ITW 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Mischa Sandberg
Quoting Josh Berkus josh@agliodbs.com:

  Perhaps I can save you some time (yes, I have a degree in Math). If I
  understand correctly, you're trying extrapolate from the correlation
  between a tiny sample and a larger sample. Introducing the tiny sample
  into any decision can only produce a less accurate result than just
  taking the larger sample on its own; GIGO. Whether they are consistent
  with one another has no relationship to whether the larger sample
  correlates with the whole population. You can think of the tiny sample
  like anecdotal evidence for wonderdrugs.

 Actually, it's more to characterize how large of a sample we need.  For
 example, if we sample 0.005 of disk pages, and get an estimate, and then
 sample another 0.005 of disk pages and get an estimate which is not even
 close to the first estimate, then we have an idea that this is a table
which
 defies analysis based on small samples.   Wheras if the two estimates
are 
 1.0 stdev apart, we can have good confidence that the table is easily
 estimated.  Note that this doesn't require progressively larger
samples; any
 two samples would work.

We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?

About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.

-- 
Dreams come true, not free. -- S.Sondheim, ITW 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Marko Ristola
First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.
(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )
I haven't looked about the PostgreSQL code, so I don't know, that what 
is possible
now, and what is not. I do know, that the full table scan and after that 
incremental
statistics changes are a very big change, without looking at the code.


I meant the following  idea:
- compare two equal sized samples. Then redo the same thing with double
sized samples. So do lots of unnecessary work.
Check out the correlation of the two samples to try to guess the 
distribution.

So I tried to give you an idea, not to give you a full answer into the 
whole problem.

I did read some parts of the attached PDFs. They did convince me,
that it seems, that the heuristics for the hard cases would actually read
almost the whole table in many cases.
I did cover the too little sample problem by stating that the
user should be able to give the minimum size of samples. This way you would
avoid the too small sampling problem. My purpose was not to achieve at
most 5% wrong estimates, but to decrease the 2000% wrong estimates, that 
are
seen now sometimes.

Conclusions:
- No heuristics or similar thing of small samples will grant excellent 
results.
- If you need excellent estimates, you need to process the whole table!
- Some special cases, like primary keys and the unique indexes and special
case column types do give easy ways to make estimates:
For example, wether a boolean column has zero, one or two distinct 
values, it does not matter
so much ??? Hashing seems the right choise for all of them.

If I have understund correctly, the full table scans are out of
questions for large tables at this time.
The percentage idea of taking 10% samples seems good.
So here is another suggestion:
1. Do a full percentage scan, starting at an arbitrary position. If the 
user's data is not
homogenous, this hurts it, but this way it is faster.
During that scan, try to figure out all those columns, that have at most 
100 distinct values.

Of course, with it you can't go into 100% accuracy, but if the full 
table scan is out of question now,
it is better, if the accuracy is for example at most ten times wrong.

You could also improve accuracy by instead of doing a 10% partial table 
scan, you could
do 20 pieces of 0,5 percent partial table scans: This would improve 
accuracy a bit, but keep
the speed almost the same as the partial table scan.

Here are questions for the statisticians for distinct values calculation:
If we want at most 1000% tolerance, how big percentage of table's one
column must be processed?
If we want at most 500% tolerance, how big percentage of table's one
column must be processed?
If we want at most 250% tolerance, how big percentage of table's one
column must be processed?
Better to assume, that there are at most 100 distinct values on a table,
if it helps calculations.
If we try to get as much with one discontinuous partial table scan
(0,1-10% sample), here is the information, we can gather:
1. We could gather a histogram for max(100) distinct values for each 
column for every column.
2. We could measure variance and average, and the number of rows for 
these 100 distinct values.
3. We could count the number of rows, that didn't match with these 100 
distinct values:
they were left out from the histogram.
4. We could get a minimum and a maximum value for each column.

= We could get exact information about the sample with one 0,1-10% pass 
for many columns.

What you statisticans can gather about these values?
My idea is programmatical combined with statistics:
+ Performance: scan for example 100 blocks each of size 100Mb, because 
disc I/O
is much faster this way.
+ Enables larger table percentage. I hope it helps with the statistics 
formula.
   Required because of more robust statistics: take those blocks at random
   (not over each other) places to decrease the effect from hitting 
into statistically
   bad parts on the table.
+ Less table scan passes: scan all columns with limited hashing in the 
first pass.
+ All easy columns are found here with one pass.
+- Harder columns need an own pass each, but we have some preliminary
   knoledge of them on the given sample after all (minimum and maximum 
values
   and the histogram of the 100 distinct values).

Marko Ristola
Greg Stark wrote:
Dave Held [EMAIL PROTECTED] writes:
 

Actually, it's more to characterize how large of a sample
we need.  For example, if we sample 0.005 of disk pages, and
get an estimate, and then sample another 0.005 of disk pages
and get an estimate which is not even close to the first
estimate, then we have an idea that this is a table which 
defies analysis based on small samples.  
 

I buy that.
   

Better yet is to use the entire sample you've gathered of .01 and 

[PERFORM] index on different types

2005-04-28 Thread Enrico Weigelt

Hi folks,


there's often some talk about indices cannot be used if datatypes
dont match. 

On a larger (and long time growed) application I tend to use OID 
for references on new tables while old stuff is using integer.
Is the planner smart enough to see both as compatible datatype
or is manual casting required ?


thx
-- 
-
 Enrico Weigelt==   metux IT service
  phone: +49 36207 519931 www:   http://www.metux.de/
  fax:   +49 36207 519932 email: [EMAIL PROTECTED]
-
  Realtime Forex/Stock Exchange trading powered by postgresSQL :))
http://www.fxignal.net/
-

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Andrew Dunstan

Mischa Sandberg wrote:
Perhaps I can save you some time (yes, I have a degree in Math). If I 
understand correctly, you're trying extrapolate from the correlation 
between a tiny sample and a larger sample. Introducing the tiny sample 
into any decision can only produce a less accurate result than just 
taking the larger sample on its own; GIGO. Whether they are consistent 
with one another has no relationship to whether the larger sample 
correlates with the whole population. You can think of the tiny sample 
like anecdotal evidence for wonderdrugs.  

 

Ok, good point.
I'm with Tom though in being very wary of solutions that require even 
one-off whole table scans. Maybe we need an additional per-table 
statistics setting which could specify the sample size, either as an 
absolute number or as a percentage of the table. It certainly seems that 
where D/N ~ 0.3, the estimates on very large tables at least are way way 
out.

Or maybe we need to support more than one estimation method.
Or both ;-)
cheers
andrew

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] index on different types

2005-04-28 Thread Michael Fuhr
On Fri, Apr 29, 2005 at 04:35:13AM +0200, Enrico Weigelt wrote:
 
 there's often some talk about indices cannot be used if datatypes
 dont match. 

PostgreSQL 8.0 is smarter than previous versions in this respect.
It'll use an index if possible even when the types don't match.

 On a larger (and long time growed) application I tend to use OID 
 for references on new tables while old stuff is using integer.

If you're using OIDs as primary keys then you might wish to reconsider.
See the caveats in the documentation and in the FAQ:

http://www.postgresql.org/docs/8.0/interactive/datatype-oid.html
http://www.postgresql.org/docs/faqs.FAQ.html#4.12

 Is the planner smart enough to see both as compatible datatype
 or is manual casting required ?

You can use EXPLAIN to see what the planner will do, but be aware
that the planner won't always use an index even if it could: if it
thinks a sequential scan would be faster then it won't use an index.
To see if using an index is possible, you could set enable_seqscan
to off before executing EXPLAIN.  In any case, a foreign key column
probably ought to have the same type as the column it references --
is there a reason for making them different?

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres

2005-04-28 Thread a3a18850
Well, this guy has it nailed. He cites Flajolet and Martin, which was (I 
thought) as good as you could get with only a reasonable amount of memory per 
statistic. Unfortunately, their hash table is a one-shot deal; there's no way 
to maintain it once the table changes. His incremental update doesn't degrade 
as the table changes. If there isn't the same wrangle of patent as with the 
ARC algorithm, and if the existing stats collector process can stand the extra 
traffic, then this one is a winner. 
 
Many thanks to the person who posted this reference in the first place; so 
sorry I canned your posting and can't recall your name. 
 
Now, if we can come up with something better than the ARC algorithm ... 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres

2005-04-28 Thread Josh Berkus

 Now, if we can come up with something better than the ARC algorithm ...

Tom already did.  His clock-sweep patch is already in the 8.1 source.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq