Re: [HACKERS] session id and global storage

2006-06-01 Thread David Hoksza
It seems MyProcID is what I was searching for...

David Hoksza


DH Something like this would be maybe possible, but this select can
DH return more rows, when the user is connected with more instances...

DH David Hoksza

DH 

 Hi, I cant find any function, which tells me something like session
 id. Is there something like that? I need it in my AM, because I need
 to know, if something which I wrote in file was written in this
 current session or previously.

AL How about

AL select procpid||' '||backend_start from pg_stat_activity;

AL Yours,
AL Laurenz Albe


DH ---(end of
DH broadcast)---
DH TIP 5: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] copy with compression progress n

2006-06-01 Thread Hannu Krosing
Ühel kenal päeval, K, 2006-05-31 kell 17:31, kirjutas Andreas Pflug:
 Tom Lane wrote:
  Andreas Pflug [EMAIL PROTECTED] writes:
  
 The attached patch implements COPY ... WITH [BINARY] COMPRESSION 
 (compression implies BINARY). The copy data uses bit 17 of the flag 
 field to identify compressed data.
  
  
  I think this is a pretty horrid idea, because it changes pg_lzcompress
  from an unimportant implementation detail into a backup file format
  that we have to support till the end of time.  What happens if, say,
  we need to abandon pg_lzcompress because we find out it has patent
  problems?
  
  It *might* be tolerable if we used gzip instead,
 
 I used pg_lzcompress because it's present in the backend. I'm fine with 
 every other good compression algorithm.
 
   but I really don't see
  the argument for doing this inside the server at all: piping to gzip
  seems like a perfectly acceptable solution,
 
 As I said, this hits only if it is possible to pipe the result into gzip 
 in a performant way. The issue already arises if psql or any other COPY 
 client (slony, pg_dump) is not on the same machine: Network bandwidth 
 will limit throughput.

Maybe make up a way to pipe COPY result through some external process
(like gzip) on the server side without having shell access there.

To make it secure, the external process should probably be run from a
hardwired directory via chroot.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] session id and global storage

2006-06-01 Thread Andrew Dunstan
Hannu Krosing said:
 Ühel kenal päeval, N, 2006-06-01 kell 10:10, kirjutas David Hoksza:
 It seems MyProcID is what I was searching for...


 On a buzy server with lots of connects, procID will repeat quite often.


log_line-prefix has a sessionid gadget:

Session ID: A unique identifier for each session. It is 2 4-byte hexadecimal
numbers (without leading zeros) separated by a dot. The numbers are the
session start time and the process ID

Assuming PIDs are not reused within one second that will be unique.

cheers

andrew






---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Possible TODO item: copy to/from pipe

2006-06-01 Thread Mark Woodward
 After re-reading what I just wrote to Andreas about how compression of
 COPY data would be better done outside the backend than inside, it
 struck me that we are missing a feature that's fairly common in Unix
 programs.  Perhaps COPY ought to have the ability to pipe its output
 to a shell command, or read input from a shell command.  Maybe something
 like

   COPY mytable TO '| gzip /home/tgl/mytable.dump.gz';

 (I'm not wedded to the above syntax, it's just an off-the-cuff thought.)

 Of course psql would need the same capability, since the server-side
 copy would still be restricted to superusers.

 You can accomplish COPY piping now through psql, but it's a bit awkward:

   psql -c COPY mytable TO stdout mydb | gzip ...

 Thoughts?  Is this worth doing, or is the psql -c approach good enough?


To be honest, I don't see much benefit in it. You can already accomplish
what you want to accomplish easily enough.

If you want to muck with COPY, I'd like to see it accept a query as:

psql -c COPY select * from mytable where foo='bar' TO stdout mydb | gzip
...




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] CTID issues and a soc student in need of help

2006-06-01 Thread Tzahi Fadida
Hi,
I am a Google soc student and in need of some help 
with PostgreSQL internals:

My C function can run (and already running)
for a very very long time on some
inputs and reiterate on relations using SPI.
Basically, I open portals and cursors to relations. 
Also note that I always open the
relations in READ ONLY mode using SPI.
A big no no is that some data in the relations
would change since that would just ruin everything.

I have a great need to identify a tuple uniquely so
my prototype uses the CTID field for that purpose.
Btw, this identification is required by the algorithm
and primary keys are just wasteful and will slow the process
considerably (not to mention some relations don't have primary keys).

The question is, can the CTID field change throughout
the run of my function due to some other processes working
on the relation? Or because of command boundaries it is
pretty much secured inside an implicit transaction?
The problem wasn't so great if I didn't want to exploit
indices in the relations (but I do and does), since
after you issue a SELECT that uses indices, all you can rely on
is the CTID to uniquely identify a tuple.

The other solution is to temporarily duplicate the relations but
this is less appealing (plus it's already working great with CTIDs).

-- 
Tzahi Fadida
Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info
WARNING TO SPAMMERS:  see at
http://members.lycos.co.uk/my2nis/spamwarning.html


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CTID issues and a soc student in need of help

2006-06-01 Thread Martijn van Oosterhout
On Thu, Jun 01, 2006 at 03:33:50PM +0300, Tzahi Fadida wrote:
 The question is, can the CTID field change throughout
 the run of my function due to some other processes working
 on the relation? Or because of command boundaries it is
 pretty much secured inside an implicit transaction?
 The problem wasn't so great if I didn't want to exploit
 indices in the relations (but I do and does), since
 after you issue a SELECT that uses indices, all you can rely on
 is the CTID to uniquely identify a tuple.

The CTID is the location on disk of the tuple, so no, it doesn't change
while you are running.

However, if you're running in isolation mode read committed, then
someone else could update the tuple while you're looking at it. In this
case the tuple will appear to vanish and the updated version will
appear elsewhere with a new CTID. Whether this is a problem or not
depends on what you're using it for.

Hope this helps,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] CTID issues and a soc student in need of help

2006-06-01 Thread Tzahi Fadida
I am using CTID for the concept of a tuple set.
For example, the set of t1 from relation1, t1 from relation2, t10 from
relation3 will be represented in my function as a list
of (TableID:CTID) pairs.
For example {(1:1),(2:1),(3:10))
I then save these in bytea arrays in a tuplestore.
This is essential for the full disjunction algorithm because
you do not want to recompute joins for every such set using the actual
attribute.

From the documentation i see that
Dirty Read,  Nonrepeatable Read and Phantom Read
are all unacceptable.
Moreover, when i said long time i meant full disjunctions
can run for hours on an extremely difficult input 
(not that most people will want to do that, but for
the general case) so it is not
realistic to lock the tables for that period.

So i have 2 follow up questions:
1) If i execute the function in a serializable isolation level
and the function is read only to the tables, is it possible
for the function to fail or other updating transactions to
either fail or wait for hours/starvation?

2) Inside the function i open and reopen cursors and portals to
   relations, how can i set once, for all those opening within
   the function, to have a serializable isolation level.
   I suspect that because of command boundaries, just running
   SET TRANSACTION SERIALIZABLE using SPI at the start should be enough.

On Thu, 2006-06-01 at 15:30 +0200, Martijn van Oosterhout wrote:
 
 The CTID is the location on disk of the tuple, so no, it doesn't change
 while you are running.
 
 However, if you're running in isolation mode read committed, then
 someone else could update the tuple while you're looking at it. In this
 case the tuple will appear to vanish and the updated version will
 appear elsewhere with a new CTID. Whether this is a problem or not
 depends on what you're using it for.
 
 Hope this helps,


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CTID issues and a soc student in need of help

2006-06-01 Thread Tom Lane
Tzahi Fadida [EMAIL PROTECTED] writes:
 I am using CTID for the concept of a tuple set.
 For example, the set of t1 from relation1, t1 from relation2, t10 from
 relation3 will be represented in my function as a list
 of (TableID:CTID) pairs.
 For example {(1:1),(2:1),(3:10))
 I then save these in bytea arrays in a tuplestore.
 This is essential for the full disjunction algorithm because
 you do not want to recompute joins for every such set using the actual
 attribute.

I think this is OK within the context of a single SQL command, since
tuple visibility should not change for that command.  If you were trying
to use the CTID info across multiple statements then it'd get worrisome.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CTID issues and a soc student in need of help

2006-06-01 Thread Tzahi Fadida
I am not sure about the definition of a context of a single SQL command.

Example of a run:

A - SELECT getfdr('Relation1,Relation2,Relation3');
to get the result schema (takes a few milliseconds).
SELECT * FROM FullDisjunctions('Relation1,Relation2,Relation3') AS
RECORD A;
Can take a long time.

Inside C-language FullDisjunctions() function i repeatedly call, using
SPI:
SELECT * FROM Relation1;
SELECT * FROM Relation2;
SELECT * FROM Relation1 WHERE...;
SELECT * FROM Relation3;


So i call using one SQL to FullDisjunction and subsequent SQL calls in
it. I wish that the context of the overall SELECT FullDisjunctions
will be perfectly unchanged for all subsequent SQL calls inside it
and especially the CTID.

p.s.: In a different version of the function i create a temporary
relation and insert tuples in it, but it is exclusively used and
destroyed by the specific instance of that function. I hope it does not
affect anything in the general sense of a read only transaction.

10x!
Regards,
tzahi.

On Thu, 2006-06-01 at 11:28 -0400, Tom Lane wrote:
 
 I think this is OK within the context of a single SQL command, since
 tuple visibility should not change for that command.  If you were trying
 to use the CTID info across multiple statements then it'd get worrisome.
 
   regards, tom lane


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CTID issues and a soc student in need of help

2006-06-01 Thread Tom Lane
Tzahi Fadida [EMAIL PROTECTED] writes:
 I am not sure about the definition of a context of a single SQL command.

Well, AFAICS selecting a disjunction ought to qualify as a single SQL
command using a single snapshot.  It's not that different from a JOIN
or UNION operation, no?

 Inside C-language FullDisjunctions() function i repeatedly call, using
 SPI:
 SELECT * FROM Relation1;
 SELECT * FROM Relation2;
 SELECT * FROM Relation1 WHERE...;
 SELECT * FROM Relation3;
 

You would need to force all these operations to be done with the same
snapshot; should be possible with SPI_execute_snapshot.  But really the
above sounds like a toy prototype implementation to me.  Why aren't you
building this as executor plan-tree machinery?

 p.s.: In a different version of the function i create a temporary
 relation and insert tuples in it, but it is exclusively used and
 destroyed by the specific instance of that function.

Why?  You could use a tuplestore for transient data.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus

Tom,

As you know, this is something I think about a bit too, though not 
nearly as deeply as you.



In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low.  I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, which would be half of
cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
my feel for the relative costs of things.  For a CPU-bound database all
three would need to go up more than that.  But rather than telling people
to manipulate all three of these variables individually, I think it might
also be a good idea to provide a new GUC variable named something like
cpu_speed_scale that would just be a multiplier for the other variables.


Yes, please!  I've done a bit of playing around with the cpu_* 
variables, and frankly have always manipulated them by the same multiple.


Also, my testing has shown that on *large* databases (20x RAM +) with 
fast processors (Opteron) the cpu_* values are too high.  But a lot of 
that is offsetting estimates which are too high elsewhere ... as is 
random_page_cost.




I've desisted from touching this mainly because the obvious sanity
adjustments, such as adding something for tree descent and charging
random_page_cost not 1.0 for leaf page touches, would increase the
estimated cost of index usage, which seemed like not the direction we
wanted to go in.  So I figured that the errors were more or less
cancelling each other.  But the addition of bitmap index scans is now
exposing the weaknesses, so we need to face up to the problem.


Yeah.  I've refrained from proposing changes because it's a 
pick-up-sticks.  If we start modifying the model, we need to fix 
*everything*, not just one item.  And then educate our users that they 
need to use the GUC variables in a different way.  Here's the issues I see:


1. n-distinct estimation is bad, as previously discussed;

2. our current heuristics sampling methods prevent us from sampling more 
than 0.5% of any reasonably large table, causing all statistics on those 
tables to be off for any table with irregular distribution of values;


3. We don't have any method to analyze inter-column correlation within a 
table;


4. We don't keep statistics on foriegn key correlation;

5. random_page_cost (as previously discussed) is actually a funciton of 
relatively immutable hardware statistics, and as such should not need to 
exist as a GUC once the cost model is fixed.


6. We haven't added any way to estimate rows returned from SRFs.


I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0.  This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change.  And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.


Per above, I think anything involving random_page_cost is the wrong way 
to go.  RPC is a GUC assigned by the DBA on pure guesswork.  We should 
be decreasing its use or eliminating it entirely, not increasing the 
amount we use it.


For btrees, we should be able to accurately estimate the cost of the 
index access given:

a) the depth of the tree;
b) the likelyhood that requested values are adjacent;
c) whether the index and tables are already in shared_mem, or else (d) 
and (e) below:
d) the probability that the index pages are cached in memory, which is 
composed of:

(1) the frequency with which that index is accessed, modified by
(2) whether the index is a fraction of available ram, or larger than ram
e) the probability that the relevant table pages are cached in memory, 
determined by the same two factors.


*None* of this should involve a userset parameter (random_page_cost) 
since as you pointed out months ago, the ration of seq/seek access has 
remained relatively constant over the past 6 years of HDD and I/O 
engineering.


Sadly, I don't have the math for all of the above but I'm hoping someone 
(either here or on ACM) does.


Also, this would require us to have different *formulas* and not merely 
different cost factors for different kinds of indexes.  I believe that 
this is something that Jie is already struggling with.  Jie?



The big 

Re: [HACKERS] CTID issues and a soc student in need of help

2006-06-01 Thread Tzahi Fadida
On Thu, 2006-06-01 at 12:45 -0400, Tom Lane wrote:
 Tzahi Fadida [EMAIL PROTECTED] writes:
  I am not sure about the definition of a context of a single SQL command.
 
 Well, AFAICS selecting a disjunction ought to qualify as a single SQL
 command using a single snapshot.  It's not that different from a JOIN
 or UNION operation, no?

Yes, it is (at least the current version i am implementing) a one shot
computation. It is computed top-down and not bottom-up as regular
joins. For example, A natural join B natural join C can be broken down
to a left deep plan tree. Full disjunctions cannot be broken into such a
thing (in this version) and FD('A,B,C') directly returns a set of
results.

 
  Inside C-language FullDisjunctions() function i repeatedly call, using
  SPI:
  SELECT * FROM Relation1;
  SELECT * FROM Relation2;
  SELECT * FROM Relation1 WHERE...;
  SELECT * FROM Relation3;
  
 
 You would need to force all these operations to be done with the same
 snapshot; should be possible with SPI_execute_snapshot.  But really the
 above sounds like a toy prototype implementation to me.  Why aren't you
 building this as executor plan-tree machinery?

I actually use cursors because i reiterate on the
SELECT * FROM Relation1 queries using the FETCH_ALL technique.
Hopefully cursors uses something similar to SPI_execute_snapshot?
(maybe on READ_ONLY that i use. i see it uses something called
ActiveSnapshot)
(but for WHERE queries that are intended to exploit indices in
the relations i must execute repeatedly).

The reason, is two fold.
- At this time i don't see any big advantage (aside from the schema) 
in putting it in the parser and subsequently the executor.
- I want to work inside the frame of time for the soc.

I think that i should first have a stable contrib module that looks
acceptable before i continue to something more problematic to maintain. 

We have a new paper that was accepted to VLDB yesterday that breaks down
the problem into smaller ones + iterators + have polynomial delay that
is suited for streaming, hence the possibility for implementing in
the planner but it's too complex for soc. Lets have a stable something
first.

 
  p.s.: In a different version of the function i create a temporary
  relation and insert tuples in it, but it is exclusively used and
  destroyed by the specific instance of that function.
 
 Why?  You could use a tuplestore for transient data.

I do use tuplestore, but the other version needs an index and you can't
put an index on a tuplestore. Unless, you can give me a hint on how to
create a btree/hash index without a relation but that can be stored on
disk like tuplestore. I.e. all data is stored in the index. The key is
the whole tuple (the array of CTIDs) anyway.

 
   regards, tom lane


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 1. n-distinct estimation is bad, as previously discussed;
 
 2. our current heuristics sampling methods prevent us from sampling more than
 0.5% of any reasonably large table, causing all statistics on those tables to
 be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

 3. We don't have any method to analyze inter-column correlation within a 
 table;
 
 4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

 5. random_page_cost (as previously discussed) is actually a funciton of
 relatively immutable hardware statistics, and as such should not need to exist
 as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] stable snapshot looks outdated

2006-06-01 Thread Robert Treat
Looking at http://www.postgresql.org/ftp/stable_snapshot/  surely we have 
acheived stability at least once since 2005-11-26.. :-)  Can we get that 
fixed? 

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg,

 I'm convinced these two are more connected than you believe.

Actually, I think they are inseparable.

 I might be interested in implementing that algorithm that was posted a
 while back that involved generating good unbiased samples of discrete
 values. The algorithm was quite clever and well described and paper
 claimed impressively good results.

 However it will only make sense if people are willing to accept that
 analyze will need a full table scan -- at least for tables where the DBA
 knows that good n_distinct estimates are necessary.

What about block-based sampling?   Sampling 1 in 20 disk pages, rather than 
1 in 20 rows, should require siginificantly less scanning, and yet give us 
a large enough sample for reasonable accuracy.

  3. We don't have any method to analyze inter-column correlation within
  a table;
 
  4. We don't keep statistics on foriegn key correlation;

 Gosh these would be nice but they sound like hard problems. Has anybody
 even made any headway in brainstorming how to tackle them?

There's no time like the present!

Actually, these  both seem like fairly straightforward problems 
storage-wise.  The issue is deriving the statistics, for tables with many 
columns or FKs.  

  5. random_page_cost (as previously discussed) is actually a funciton
  of relatively immutable hardware statistics, and as such should not
  need to exist as a GUC once the cost model is fixed.

 I don't think that's true at all. Not all hardware is the same.

 Certainly the need to twiddle this GUC should be drastically reduced if
 the cache effects are modelled properly and the only excuses left are
 legitimate hardware differences.

OK ... but still, it should become a little knob rather than the big 
knob it is currently.


-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Yeah.  I've refrained from proposing changes because it's a 
 pick-up-sticks.  If we start modifying the model, we need to fix 
 *everything*, not just one item.  And then educate our users that they 
 need to use the GUC variables in a different way.  Here's the issues I see:

 [ various deficiencies in our statistics gathering ]

Those are all valid points, but pretty much orthogonal to what I'm on
about today.  The problems I'm thinking about are seen even when the
planner's rowcount estimates are dead on (as indeed they mostly were
in Philippe's example).

 5. random_page_cost (as previously discussed) is actually a funciton of 
 relatively immutable hardware statistics, and as such should not need to 
 exist as a GUC once the cost model is fixed.

Well, it'll still exist as a GUC for the same reasons the other ones are
GUCs.  But I think the main reasons people have needed to twiddle it are
(1) their database is completely RAM-resident (where RPC
*should* logically be 1.0), or
(2) they're trying to compensate for the overestimation of
nestloop indexscans.
The changes I'm proposing should help with (2).

 For btrees, we should be able to accurately estimate the cost of the 
 index access given:
 a) the depth of the tree;

Although logically the tree descent *ought* to be a factor, I haven't
seen any evidence that we need to take it into account; in fact all the
evidence points to the conclusion that we're better off ignoring it.
My theory about why is that the upper levels of the tree stay in cache.
I have to admit that's just handwaving, but counting additional disk
hits to fetch the first index tuple is clearly not the direction the
cost model needs to go in.  Look again at the examples in Philippe's
output: an indexscan fetching 1700 tuples (about 5 leaf pages probably)
took 32x longer than one fetching 7 tuples (presumably all on the same
leaf page).  There's no way that a model that counts an additional
couple of page fetches for tree descent is going to arrive at those
numbers.  And we see this over and over again: incredibly small times
to fetch a single index tuple, at least on the average when the index
is being hit many times in one query.

 c) whether the index and tables are already in shared_mem, or else (d) 
 and (e) below:
 d) the probability that the index pages are cached in memory, which is 
 composed of:
   (1) the frequency with which that index is accessed, modified by
   (2) whether the index is a fraction of available ram, or larger than ram
 e) the probability that the relevant table pages are cached in memory, 
 determined by the same two factors.

These would all be nice things to know, but I'm afraid it's pie in the
sky.  We have no reasonable way to get those numbers.  (And if we could
get them, there would be another set of problems, namely plan instability:
the planner's choices would become very difficult to reproduce.)

 The big difficulty in modeling cache effects from multiple scans is that
 we usually don't know how many times the index will be scanned.

 I think we can work around this by figuring out whether the index has 
 already been scanned in the plan, something we *can* know.  So the first 
 scan will be full cost and remaining scans will be fractional cost. 

Uh, no, because what we're normally thinking about is independent probes
into the index with different keys.  For a small number of probes you
have to figure that those all hit different leaf pages and aren't going
to amortize well.  As the number of probes goes up, you can start
charging fractional costs because it's more likely you're hitting a leaf
page you hit recently.  The Mackert/Lohman equations do this reasonably
well --- it's possible we can find something better, but those seem like
a good place to start.  The immediate problem is to get an
infrastructure in place that gives us some numbers to apply equations to.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

  However it will only make sense if people are willing to accept that
  analyze will need a full table scan -- at least for tables where the DBA
  knows that good n_distinct estimates are necessary.
 
 What about block-based sampling?   Sampling 1 in 20 disk pages, rather than 
 1 in 20 rows, should require siginificantly less scanning, and yet give us 
 a large enough sample for reasonable accuracy.

a) We already use block based sampling to reduce overhead. If you're talking
   about using the entire block and not just randomly sampled tuples from
   within those blocks then your sample will be biased.

b) It *still* wouldn't give you reasonable accuracy for n_distinct estimates.
   Your probability of being accurate for discrete processes like n_distinct
   is going to be more or less proportional to your sample. So sampling 5% of
   the table gives you only a 5% of the chance of an accurate prediction. More
   or less.

 Actually, these  both seem like fairly straightforward problems 
 storage-wise.  The issue is deriving the statistics, for tables with many 
 columns or FKs.  

Well there are problems on many levels:

1) You have n^2 possible two-column combinations. That's a lot of processing
   and storage.

2) It isn't even clear what data you're exactly looking for. Certainly
   correlation is just shorthand here and isn't what you're actually looking
   for. You need something like a complete histogram for the dependent column
   for every possible value of the forcing column. That would be an enormous
   amount of storage.

 OK ... but still, it should become a little knob rather than the big 
 knob it is currently.

Sure, the more accurately you model the cache the less need there would be to
adjust random_page_cost. I'm not sure how accurate Postgres can really get
unless it switches philosophies towards using O_DIRECT and doing it's own
caching and scheduling. But it could certainly be much better than today.
Tom's comment makes it look like there's hope for pretty accurate cache
modelling for nested loop plans.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Generalized concept of modules

2006-06-01 Thread Martijn van Oosterhout
On Wed, May 31, 2006 at 05:33:44PM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  While you do have a good point about non-binary modules, our module
  handling need some help IMHO. For example, the current hack for CREATE
  LANGUAGE to fix things caused by old pg_dumps. I think that's the
  totally wrong approach long term, I think the pg_dump shouldn't be
  including the CREATE LANGUAGE statement at all, but should be saying
  something like INSTALL plpgsql and pg_restore works out what is
  needed for that module.
 
 There's a lot to be said for this, but I keep having the nagging
 feeling that people are equating module with shared library, which
 seems far from sufficiently general.  I'd like to see module mean
 an arbitrary collection of SQL objects. 

I agree that module is often used interchangably with shared library.
We need to handle the other case too. It would be a lot easier if we
had an example of an SQL only module, since contrib doesn't appear to
have one (at first glance anyway).

 So I think the raw definition
 sought by your INSTALL would always be a SQL script, and any shared
 libs that might come along with that are secondary.  The idea of using
 pg_depend to manage UNINSTALL is an excellent one.

Well, in that case I'd like to give some concrete suggestions:

1. The $libdir in future may be used to find SQL scripts as well as
shared libraries. They'll have different extensions so no possibility
of conflict.

2. Create something like BEGIN MODULE xxx which starts a transaction
and marks any objects created within it as owned by module xxx. I
think it should be tied to a transaction level to avoid half installed
things, but maybe people would prefer it to work more like schemas.

 pg_module system catalog.  You'd need this anyway since there has to be
 some representation of the module object in the catalogs for its
 component objects to have pg_depend dependencies on.

Ack. Owned by in the above sense means that the object depends on the
module. You could do it the other way round (module depends on object)
but that makes it harder to change things manually. DROP MODULE would
work easier too.

 Let's see, I guess pg_dump would have to be taught to ignore any objects
 that it can see are directly dependent on a module object.  What about
 indirect dependencies though?  The exact semantics don't seem clear to me.

At a base level, you could definitly drop the functions. Dropping types
is harder because columns might be using them. Normally we use CASCADE
to specify that.

 Also, this seems to be getting into territory that Oracle has already
 trod --- someone should study exactly what they do for PL/SQL modules
 and whether we want to be compatible or not.  Perhaps there's even
 something in SQL2003 about it?

Probably a good idea...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg, Tom,

 a) We already use block based sampling to reduce overhead. If you're
 talking about using the entire block and not just randomly sampled
 tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating 
both row correlation and n-distinct based on sampling complete blocks.  
See prior discussions on this list on N-distinct.

 1) You have n^2 possible two-column combinations. That's a lot of
 processing and storage.

Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

 2) It isn't even clear what data you're exactly looking for. Certainly
correlation is just shorthand here and isn't what you're actually
 looking for. 

Actually, I'd think that a correlation number estimate (0 = complete 
uncorrelated, 1 = completely correlated) would be sufficient to improve 
row count estimation significantly, without incurring the vast overhead of 
histogramXhistorgram manipulation.

 Those are all valid points, but pretty much orthogonal to what I'm on
 about today.  The problems I'm thinking about are seen even when the
 planner's rowcount estimates are dead on (as indeed they mostly were
 in Philippe's example).

OK, I was afraid they were interdependant.  You would know better than me.

 Well, it'll still exist as a GUC for the same reasons the other ones are
 GUCs.  But I think the main reasons people have needed to twiddle it are
   (1) their database is completely RAM-resident (where RPC
   *should* logically be 1.0), or
   (2) they're trying to compensate for the overestimation of
   nestloop indexscans.
 The changes I'm proposing should help with (2).

Right.  What I'm saying is that (1) should be derived from 
estimated_cache_size and DBSIZE, not by setting an additional GUC.

  (1) the frequency with which that index is accessed, modified by
  (2) whether the index is a fraction of available ram, or larger than
  ram e) the probability that the relevant table pages are cached in
  memory, determined by the same two factors.

 These would all be nice things to know, but I'm afraid it's pie in the
 sky.  We have no reasonable way to get those numbers.  (And if we could
 get them, there would be another set of problems, namely plan
 instability: the planner's choices would become very difficult to
 reproduce.)

Hmmm ... I think those numbers are possible to infer, at least on a gross 
level.  Whether the cost of calculating them outweighs the benefit of 
having them is another question, resolvable only through some 
experimentation.

 a good place to start.  The immediate problem is to get an
 infrastructure in place that gives us some numbers to apply equations
 to.

No arguments, of course.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Generalized concept of modules

2006-06-01 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 Well, in that case I'd like to give some concrete suggestions:

 1. The $libdir in future may be used to find SQL scripts as well as
 shared libraries. They'll have different extensions so no possibility
 of conflict.

No, it needs to be a separate directory, and the reason is that SQL
scripts are architecture-independent and belong under the share/
directory not the lib/ directory.  This is a minor point, but the
packagers will scream at us if we get it wrong.

 2. Create something like BEGIN MODULE xxx which starts a transaction
 and marks any objects created within it as owned by module xxx. I
 think it should be tied to a transaction level to avoid half installed
 things, but maybe people would prefer it to work more like schemas.

I think I'd sooner keep it decoupled from transactions, which means you
need both a BEGIN MODULE xxx and an END MODULE xxx.  Also, a module
developer might want to go back and add more stuff to an existing
module.  So there should be separate commands:
CREATE MODULE xxx;
BEGIN MODULE xxx;
... anything created here belongs to the module
END MODULE xxx;
An alternative possibility is to make it work like schemas: you set
a GUC variable to indicate that created objects should be associated
with that module.  This might be the best choice since it avoids having
two very different ways of doing very similar things.

(Come to think of it, what is the relation between modules and schemas
anyway?  Should a module's objects be required to be in schemas also
owned by the module?  If not, what happens when you try to INSTALL a
module whose objects lived in a schema you don't have?  This gets back
to the fact that we don't have a very nice answer for installing
existing contrib modules into user-chosen schemas.)

 Let's see, I guess pg_dump would have to be taught to ignore any objects
 that it can see are directly dependent on a module object.  What about
 indirect dependencies though?  The exact semantics don't seem clear to me.

 At a base level, you could definitly drop the functions. Dropping types
 is harder because columns might be using them. Normally we use CASCADE
 to specify that.

I think we're talking at cross purposes.  It sounds like you're thinking
about how do I remove one object in a module?  AFAICS you just drop it.
What I was wondering was what is pg_dump's behavior.  ISTM we want two
modes:

* normal behavior: dump module objects as INSTALL foo commands.  Do
not dump any objects that are owned by a module; assume they will be
created by the INSTALL.  Use the dependency chains to make sure that
INSTALL commands are ordered properly relative to everything else.

* dump module foo: dump the module object as a CREATE MODULE command,
and then dump creation commands for all the objects that are owned by
it.  Ignore all else.  This is an easy way to generate an updated module
definition script.

Something that's not at all clear to me is object ownership.  Should
objects belonging to a module be marked as being owned by the person
executing INSTALL, or should module dump/install try to preserve
original ownership?  I think in most scenarios trying to preserve the
original ownership is wrong, since commonly INSTALL will be used to
transfer objects to new databases where the original owner might not
exist at all.  But it's a bit non-orthogonal.

Also, what privileges are needed to execute either CREATE MODULE or
INSTALL?  Conservative design would say superuser-only, but I can't put
my finger on any specific reason for that, at least if none of the
contained objects need superuser privs to create.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg,

  1) You have n^2 possible two-column combinations. That's a lot of
  processing and storage.

 Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

Ooops, bad math.  Andrew pointed out it's actually n*(n-1)/2, not n!.

Also, we could omit columns unlikely to correlate, such as large text 
columns, bytea and numerics with high precisions.  Also, we probably don't 
need to correlate UNIQUE columns inside ... I think.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] CVS-Unknown buildfarm failures?

2006-06-01 Thread Tom Lane
meerkat and snake both have persistent CVS-Unknown failures in some
but not all branches.  I can't see any evidence of an actual failure
in their logs though.  What I do see is ? entries about files that
shouldn't be there --- for instance, meerkat apparently needs a make
distclean.  If that's what's causing the failure report, could we
get the buildfarm to show a more useful status message?  I'd always
assumed that CVS-Unknown suggested a transient problem such as
connection loss, and there wasn't any need for human intervention.

A more radical answer is to have the script go ahead and delete the
offending files itself, but I can see where that might not have good
fail-soft behavior ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CVS-Unknown buildfarm failures?

2006-06-01 Thread Joshua D. Drake

Tom Lane wrote:

meerkat and snake both have persistent CVS-Unknown failures in some
but not all branches.  I can't see any evidence of an actual failure
in their logs though.  What I do see is ? entries about files that
shouldn't be there --- for instance, meerkat apparently needs a make
distclean.  If that's what's causing the failure report, could we
get the buildfarm to show a more useful status message?  I'd always
assumed that CVS-Unknown suggested a transient problem such as
connection loss, and there wasn't any need for human intervention.

A more radical answer is to have the script go ahead and delete the
offending files itself, but I can see where that might not have good
fail-soft behavior ...


I have manually ran a dist-clean on meerkat for 8_0 and 8_1 and am 
rerunning the builds now.


Joshua D. Drake



regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Jim C. Nasby
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote:
 
 Josh Berkus josh@agliodbs.com writes:
 
  1. n-distinct estimation is bad, as previously discussed;
  
  2. our current heuristics sampling methods prevent us from sampling more 
  than
  0.5% of any reasonably large table, causing all statistics on those tables 
  to
  be off for any table with irregular distribution of values;
 
 I'm convinced these two are more connected than you believe. For distributions
 of values in a range using small samples is on solid statistical basis. It's
 the same reason poll results based on only a few hundred people can accurately
 predict elections in a country with hundreds of millions of voters.
 
 However for things like estimating discrete values there's absolutely no solid
 foundation for these small samples. Your accuracy is going to be simply
 proportional to the sample size, meaning you can't get anything trustworthy
 without sampling large enough portions of the table that a full sequential
 scan would be just as fast.
 
 I might be interested in implementing that algorithm that was posted a while
 back that involved generating good unbiased samples of discrete values. The
 algorithm was quite clever and well described and paper claimed impressively
 good results.
 
 However it will only make sense if people are willing to accept that analyze
 will need a full table scan -- at least for tables where the DBA knows that
 good n_distinct estimates are necessary.
 
There *might* be an alternative

Suppose that the executor could keep track of what values it's seeing
from a table as it's actually running a query. These could be used to
build statistical information without actually running analyze.

Of course, the problem is that you could end up with some seriously
skewed statistics, depending on what your query patterns are. But in
some ways that's not bad; you're optimizing for the queries you most
often run.

One possible way to handle this would be to keep track of how many times
each value has been seen, as well as when it was last seen, so you have
some idea of the quality of the data. Another idea is to somehow blend
these stats with the traditional method.

  3. We don't have any method to analyze inter-column correlation within a 
  table;
  
  4. We don't keep statistics on foriegn key correlation;
 
 Gosh these would be nice but they sound like hard problems. Has anybody even
 made any headway in brainstorming how to tackle them?
 
A *huge* improvement would be gathering statistics on all multi-column
indexes. Some of the stats might not make sense in this context, but
others (such as correlation) would.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Jim C. Nasby
On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote:
 These would all be nice things to know, but I'm afraid it's pie in the
 sky.  We have no reasonable way to get those numbers.  (And if we could
 get them, there would be another set of problems, namely plan instability:
 the planner's choices would become very difficult to reproduce.)

Speaking of plan instability, something that's badly needed is the
ability to steer away from query plans that *might* be the most optimal,
but also will fail horribly should the cost estimates be wrong. People
generally don't care about getting the absolutely most optimal plan;
they do care about NOT getting a plan that's horribly bad. One possible
way to do this would be to have the estimator calculate a worst-case
cost to go along with the best case cost, or maybe even 3 numbers:
ideal, what we think will happen, and worst-case. I know that index
scans already have that information, at least in the form of cost for a
plan with 0 correlation and one with perfect correlation.

Does anyone have any experience developing genetic algorithms? I think
coming up with cost estimators might be an ideal use for that
technology...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 Greg, Tom,
 
  a) We already use block based sampling to reduce overhead. If you're
  talking about using the entire block and not just randomly sampled
  tuples from within those blocks then your sample will be biased.
 
 There are actually some really good equations to work with this, estimating 
 both row correlation and n-distinct based on sampling complete blocks.  
 See prior discussions on this list on N-distinct.

In the prior discussions someone posted the paper with the algorithm I
mentioned. That paper mentions that previous work showed poor results at
estimating n_distinct even with sample sizes as large as 50% or more.

  1) You have n^2 possible two-column combinations. That's a lot of
  processing and storage.
 
 Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

You have O(n^2) possible *two-column* combinations and O(n!) n-column
combinations. I was hoping two-column combinations would be enough information
to extrapolate from for larger combinations.

  2) It isn't even clear what data you're exactly looking for. Certainly
 correlation is just shorthand here and isn't what you're actually
  looking for. 
 
 Actually, I'd think that a correlation number estimate (0 = complete 
 uncorrelated, 1 = completely correlated) would be sufficient to improve 
 row count estimation significantly, without incurring the vast overhead of 
 histogramXhistorgram manipulation.

No, correlation is actually quite uninteresting here. Consider for example two
columns state and area code. The area code is pretty much 100% dependent
on state, but will show very little correlation. Similarly things like
product name and catalog number or even author and genre would be
problem cases but have little correlation. And you can easily imagine queries
that specify restrictive clauses on two such columns for which the existing
statistics will overestimate the selectivity because it assumes no
interdependency.

Hopefully you're right that you don't need complete histograms. Perhaps
there's some statistics concept they don't teach in stats 101 that would cover
this need. What we're looking for is a function f(a,b,x) that gives the net
selectivity given a and b, the selectivity of two clauses based on two
columns, and x some simple value that can be easily calculated by looking at
the data in advance.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread David Fetter
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
 
 Josh Berkus josh@agliodbs.com writes:
 
  Greg, Tom,
  
   a) We already use block based sampling to reduce overhead.  If
   you're talking about using the entire block and not just
   randomly sampled tuples from within those blocks then your
   sample will be biased.
  
  There are actually some really good equations to work with this,
  estimating both row correlation and n-distinct based on sampling
  complete blocks.  See prior discussions on this list on
  N-distinct.
 
 In the prior discussions someone posted the paper with the algorithm
 I mentioned.  That paper mentions that previous work showed poor
 results at estimating n_distinct even with sample sizes as large as
 50% or more.

Which paper?  People have referenced several different ones.

   1) You have n^2 possible two-column combinations.  That's a lot
   of processing and storage.
  
  Yes, that's the hard problem to solve.  Actually, btw, it's n!,
  not n^2.
 
 You have O(n^2) possible *two-column* combinations and O(n!)
 n-column combinations.  I was hoping two-column combinations would
 be enough information to extrapolate from for larger combinations.

The math nerd in me says that this is bad math, but it might work
well enough by ass-u-ming a lack of higher-order correllations.

   2) It isn't even clear what data you're exactly looking for.
   Certainly correlation is just shorthand here and isn't what
   you're actually looking for. 
  
  Actually, I'd think that a correlation number estimate (0 =
  complete uncorrelated, 1 = completely correlated) would be
  sufficient to improve row count estimation significantly, without
  incurring the vast overhead of histogramXhistorgram manipulation.
 
 No, correlation is actually quite uninteresting here.  Consider for
 example two columns state and area code.  The area code is
 pretty much 100% dependent on state, but will show very little
 correlation.

There are quantitative tests of independence available.  I'm not sure
whether any of these have been applied even theoretically in
DBMS-land.

 Hopefully you're right that you don't need complete histograms.
 Perhaps there's some statistics concept they don't teach in stats
 101 that would cover this need.  What we're looking for is a
 function f(a,b,x) that gives the net selectivity given a and b, the
 selectivity of two clauses based on two columns, and x some simple
 value that can be easily calculated by looking at the data in
 advance.

That would be neat :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Mark Kirkwood

Tom Lane wrote:


Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:

/*
 * Estimate the number of index pages that will be retrieved.
 *
 * For all currently-supported index types, the first page of the index is
 * a metadata page, and we should figure on fetching that plus a pro-rated
 * fraction of the remaining pages.
 */
if (index-pages  1  index-tuples  0)
{
numIndexPages = (numIndexTuples / index-tuples) * (index-pages - 1);
numIndexPages += 1;/* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;

/*
 * Compute the index access cost.
 *
 * Disk cost: our generic assumption is that the index pages will be read
 * sequentially, so they have cost 1.0 each, not random_page_cost.
 */
*indexTotalCost = numIndexPages;

There are several things wrong with this, at least in its application to
btrees.  It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages).  On the other hand it's surely
overcharging for metapage touches.  As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)


I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0.  This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change.  And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.



This sounds good to me, although the 2.0 - 4.0 cost jump may cause 
(more) cases of people seeing their index scans in pre-8.2 versions 
becoming some other type of access in 8.2. I guess a comment about 
testing existing applications could be placed in the 8.2 release notes?



Now we have seen a lot of cases in which indexscans were being drastically
overestimated, so increasing the cost estimate even more may seem like a
horrid idea.  But I believe that most or all of these cases were ones
where the same index was being visited many times, and so the real
estimation failure is not accounting for cache effects across multiple
indexscans.  Rather than being afraid to raise the cost estimate for an
indexscan in isolation, I think it's time to bite the bullet and do
something about that.

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned.  If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost.  But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.
However there is one case where we do have some information: when
computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
to use the estimated size of the outer relation as the loop count.
And this is exactly the case where we are falling down in practice:
the complaints we've seen are mostly about overestimating the cost of a
nestloop-with-inner-index-scan.  So I think handling just this case would
be a big step forward, even if it's not a theoretically pure general
solution.




If I understand correctly, this is the case were we normally see folks 
needing add several 'set enable_xxx=off' statements to get the nested 
loop plan that *actually* works best :-). So, also looks good!


Cheers

Mark


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 With this model, the disk cost to fetch a single
 index entry will be estimated as random_page_cost (default 4.0) rather
 than the current fixed 2.0.  This shouldn't hurt things too much for
 simple indexscans --- especially since anyone running with a reduced
 random_page_cost won't see as much change.  And it will increase the costs
 for bitmap scans that cross many index pages, which is what we need in
 light of Philippe's numbers.

 This sounds good to me, although the 2.0 - 4.0 cost jump may cause 
 (more) cases of people seeing their index scans in pre-8.2 versions 
 becoming some other type of access in 8.2. I guess a comment about 
 testing existing applications could be placed in the 8.2 release notes?

Yeah, that comes with the territory.  One point to note is that with
this model, setting random_page_cost below 2.0 will actually make small
indexscans look *cheaper* than they do now.  So it'll certainly be
possible to make the thing jump in that direction if you need to.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] 'CVS-Unknown' buildfarm failures?

2006-06-01 Thread Andrew Dunstan
Tom Lane said:
 meerkat and snake both have persistent CVS-Unknown failures in some
 but not all branches.  I can't see any evidence of an actual failure in
 their logs though.  What I do see is ? entries about files that
 shouldn't be there --- for instance, meerkat apparently needs a make
 distclean.  If that's what's causing the failure report, could we get
 the buildfarm to show a more useful status message?  I'd always assumed
 that CVS-Unknown suggested a transient problem such as
 connection loss, and there wasn't any need for human intervention.

 A more radical answer is to have the script go ahead and delete the
 offending files itself, but I can see where that might not have good
 fail-soft behavior ...



cvs-unknown means there are unknown files in the repo:

   my $unknown_files = grep {/^\?/ } @cvslog;
...
   send_result('CVS-Unknown',$unknown_files,[EMAIL PROTECTED])
 if ($unknown_files);

This is almost always a case of operator error. buildfarm only ever builds
in a copy of the repo, not in the permanent repo itself, so there should
NEVER be any file there which does not come from CVS. I have repeatedly
advised buildfarm member owners not to build by hand in the buildfarm repos.
 Not everybody listens, apparently.

All this is intended to ensure that we are actually working on a faithful
reflection of the postgresql.org repo, and not something that has been
mangled somehow.

I can call it CVS-Unknown-Files if that will make it clearer.

cheers

andrew





---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] 'CVS-Unknown' buildfarm failures?

2006-06-01 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Tom Lane said:
 meerkat and snake both have persistent CVS-Unknown failures in some
 but not all branches.  I can't see any evidence of an actual failure in
 their logs though.

 cvs-unknown means there are unknown files in the repo:

Oh.  Well, it needs renamed then ;-).  Per our message style guidelines,
calling an error unknown is seldom a good idea.

 I can call it CVS-Unknown-Files if that will make it clearer.

Maybe CVS-Extraneous-Files?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Going for all green buildfarm results

2006-06-01 Thread Tom Lane
I've been making another pass over getting rid of buildfarm failures.
The remaining ones I see at the moment are:

firefly HEAD: intermittent failures in the stats test.  We seem to have
fixed every other platform back in January, but not this one.

kudu HEAD: one-time failure 6/1/06 in statement_timeout test, never seen
before.  Is it possible system was under enough load that the 1-second
timeout fired before control reached the exception block?

tapir HEAD: pilot error, insufficient SysV shmem settings

carp various: carp seems to have *serious* hardware problems, as it
has been failing randomly in all branches for a long time.  I suggest
putting that poor machine out to pasture.

penguin 8.0: fails in tsearch2.  Previous investigation says that the
failure is unfixable without initdb, which we are not going to force
for 8.0 branch.  I suggest retiring penguin from checking 8.0, as
there's not much point in continuing to see a failure there.  Or is
it worth improving buildfarm to be able to skip specific tests?

penguin 7.4: fails in initdb, with what seems to be a variant of the
alignment issue that kills tsearch2 in 8.0.  We won't fix this either,
so again might as well stop tracking this branch on this machine.

cobra, stoat, sponge 7.4: pilot error.  Either install Tk or configure
--without-tk.

firefly 7.4: dblink test fails, with what looks like an rpath problem.
Another one that we fixed awhile ago, and the fix worked on every
platform but this one.

firefly 7.3: trivial regression diffs; we could install variant
comparison files if anyone cared.

cobra, stoat, caribou 7.3: same Tk configuration error as in 7.4 branch

Firefly is obviously the outlier here.  I dunno if anyone cares enough
about SCO to spend time investigating it (I don't).  Most of the others
just need a little bit of attention from the machine owner.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster