Re: [PERFORM] Caching of Queries

2004-09-27 Thread Jeff
[ discussion of server side result caching ]
and lets not forget PG's major fork it will throw into things:  MVCC
The results of query A may hold true for txn 1, but not txn 2 and so on 
.
That would have to be taken into account as well and would greatly 
complicate things.

It is always possible to do a poor man's query cache with triggers.. 
which would just leave you with basically a materialized view.

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Mitch Pirtle
On Mon, 27 Sep 2004 18:20:48 +0100, Matt Clark [EMAIL PROTECTED] wrote:
 This is very true.  Client side caching is an enormous win for apps, but it
 requires quite a lot of logic, triggers to update last-modified fields on
 relevant tables, etc etc.  Moving some of this logic to the DB would perhaps
 not usually be quite as efficient as a bespoke client caching solution, but
 it will above all be a lot easier for the application developer!

In the world of PHP it is trivial thanks to PEAR's Cache_Lite.  The
project lead for Mambo implemented page-level caching in a day, and
had all the triggers for clearing the cache included in the content
management interface - not difficult at all.

Basically you set a default in seconds for the HTML results to be
cached, and then have triggers set that force the cache to regenerate
(whenever CRUD happens to the content, for example).

Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
believer out of me!

-- Mitch

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] best statistic target for boolean columns

2004-09-27 Thread Gregory Stark

 Gaetano,
 
  don't you think the best statistic target for a boolean
  column is something like 2?  Or in general the is useless
  have a statistics target  data type cardinality ?
 
 It depends, really, on the proportionality of the boolean values; if they're 
 about equal, I certainly wouldn't raise Stats from the default of 10.   If, 
 however, it's very dispraportionate -- like 2% true and 98% false -- then it 
 may pay to have better statistics so that the planner doesn't assume 50% 
 hits, which it otherwise might.

No, actually the stats table keeps the n most common values and their
frequency (usually in percentage). So really a target of 2 ought to be enough
for boolean values. In fact that's all I see in pg_statistic; I'm assuming
there's a full histogram somewhere but I don't see it. Where would it be?

However the target also dictates how large a sample of the table to take. A
target of two represents a very small sample. So the estimations could be
quite far off.

I ran the experiment and for a table with 2036 false rows out of 204,624 the
estimate was 1720. Not bad. But then I did vacuum full analyze and got an
estimate of 688. Which isn't so good.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Jim C. Nasby
On Thu, Sep 23, 2004 at 08:29:25AM -0700, Mr Pink wrote:
 Not knowing anything about the internals of pg, I don't know how this relates, but 
 in theory, 
 query plan caching is not just about saving time re-planning queries, it's about 
 scalability.
 Optimizing queries requires shared locks on the database metadata, which, as I 
 understand it
 causes contention and serialization, which kills scalability. 

One of the guru's can correct me if I'm wrong here, but AFAIK metadata
lookups use essentially the same access methods as normal queries. This
means MVCC is used and no locking is required. Even if locks were
required, they would be shared read locks which wouldn't block each
other.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] best statistic target for boolean columns

2004-09-27 Thread Josh Berkus
Gaetano,

 don't you think the best statistic target for a boolean
 column is something like 2?  Or in general the is useless
 have a statistics target  data type cardinality ?

It depends, really, on the proportionality of the boolean values; if they're 
about equal, I certainly wouldn't raise Stats from the default of 10.   If, 
however, it's very dispraportionate -- like 2% true and 98% false -- then it 
may pay to have better statistics so that the planner doesn't assume 50% 
hits, which it otherwise might.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] best statistic target for boolean columns

2004-09-27 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 No, actually the stats table keeps the n most common values and their
 frequency (usually in percentage). So really a target of 2 ought to be enough
 for boolean values. In fact that's all I see in pg_statistic; I'm assuming
 there's a full histogram somewhere but I don't see it. Where would it be?

It's not going to be there.  The histogram only covers values that are
not in the most-frequent-values list, and therefore it won't exist for a
column that is completely describable by most-frequent-values.

 However the target also dictates how large a sample of the table to take. A
 target of two represents a very small sample. So the estimations could be
 quite far off.

Right.  The real point of stats target for such columns is that it
determines how many rows to sample, and thereby indirectly implies
the accuracy of the statistics.  For a heavily skewed boolean column
you'd want a high target so that the number of occurrences of the
infrequent value would be estimated accurately.

It's also worth noting that the number of rows sampled is driven by the
largest per-column stats target in the table, and so reducing stats
target to 2 for a boolean column will save *zero* effort unless all the
columns in the table are booleans.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Matt Clark

Basically you set a default in seconds for the HTML results to be
cached, and then have triggers set that force the cache to regenerate
(whenever CRUD happens to the content, for example).
Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
believer out of me!
 

Nice to have it in a library, but if you want to be that simplistic then 
it's easy in any language.  What if a process on server B modifies a n 
important value that server A has cached though?  Coherency (albeit that 
the client may choose to not use it) is a must for a general solution.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Rod Taylor
 More to the point though, I think this is a feature that really really 
 should be in the DB, because then it's trivial for people to use.  

How does putting it into PGPool make it any less trivial for people to
use?


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] best statistic target for boolean columns

2004-09-27 Thread Gaetano Mendola
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Josh Berkus wrote:
| Gaetano,
|
|
|don't you think the best statistic target for a boolean
|column is something like 2?  Or in general the is useless
|have a statistics target  data type cardinality ?
|
|
| It depends, really, on the proportionality of the boolean values; if they're
| about equal, I certainly wouldn't raise Stats from the default of 10.   If,
| however, it's very dispraportionate -- like 2% true and 98% false -- then it
| may pay to have better statistics so that the planner doesn't assume 50%
| hits, which it otherwise might.
So, I didn't understand how the statistics hystogram works.
I'm going to take a look at analyze.c
Regards
Gaetano Mendola
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBWHr07UpzwH2SGd4RAi8nAJoDOa7j+5IjDEcqBvB4ATXRzRPB+wCfWZ0p
OCmUew9zlyqVkxB9iWKoGAw=
=7lkZ
-END PGP SIGNATURE-
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Steve Atkins
On Mon, Sep 27, 2004 at 09:19:12PM +0100, Matt Clark wrote:

 Basically you set a default in seconds for the HTML results to be
 cached, and then have triggers set that force the cache to regenerate
 (whenever CRUD happens to the content, for example).
 
 Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
 believer out of me!
 
  
 
 Nice to have it in a library, but if you want to be that simplistic then 
 it's easy in any language.  What if a process on server B modifies a n 
 important value that server A has cached though?  Coherency (albeit that 
 the client may choose to not use it) is a must for a general solution.

memcached is one solution designed for that situation. Easy to use
from most languages.  Works. Lets you use memory on systems where you
have it, rather than using up valuable database server RAM that's
better spent caching disk sectors.

Any competently written application where caching results would be a
suitable performance boost can already implement application or
middleware caching fairly easily, and increase performance much more
than putting result caching into the database would.

I don't see caching results in the database as much of a win for most
well written applications.  Toy benchmarks, sure, but for real apps it
seems it would add a lot of complexity, and violate the whole point of
using an ACID database.

(Caching parse trees or query plans, though? It'd be interesting to
 model what effect that'd have.)

Cheers,
  Steve


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Matt Clark






  
More to the point though, I think this is a feature that really really 
should be in the DB, because then it's trivial for people to use.  

  
  
How does putting it into PGPool make it any less trivial for people to
use?

The answers are at
http://www2b.biglobe.ne.jp/~caco/pgpool/index-e.html . Specifically,
it's a separate application that needs configuration, the homepage has
no real discussion of the potential pitfalls of pooling and what this
implementation does to get around them, you get the idea. I'm sure
it's great software, but it doesn't come as part of the DB server, so
95% of people who would benefit from query caching being implemented in
it never will. If it shipped with and was turned on by default in SUSE
or RedHat that would be a different matter. Which I realise makes me
look like one of those people who doesn't appreciate code unless it's
'popular', but I hope I'm not *that* bad...

Oh OK, I'll say it, this is a perfect example of why My*** has so much
more mindshare. It's not better, but it sure makes the average Joe
_feel_ better. Sorry, I've got my corporate hat on today, I'm sure
I'll feel a little less cynical tomorrow.

M




Re: [PERFORM] Caching of Queries

2004-09-27 Thread Matt Clark

Any competently written application where caching results would be a
suitable performance boost can already implement application or
middleware caching fairly easily, and increase performance much more
than putting result caching into the database would.
 

I guess the performance increase is that you can spend $10,000 on a 
developer, or $10,000 on hardware, and for the most part get a more 
reliable result the second way.  MemcacheD is fine(ish), but it's not a 
panacea, and it's more than easy to shoot yourself in the foot with it.  
Caching is hard enough that lots of people do it badly - I'd rather use 
an implementation from the PG team than almost anywhere else.

I don't see caching results in the database as much of a win for most
well written applications.  Toy benchmarks, sure, but for real apps it
seems it would add a lot of complexity, and violate the whole point of
using an ACID database.
 

Well the point surely is to _remove_ complexity from the application, 
which is written by God Knows Who, and put it in the DB, which is 
written by God And You.  And you can still have ACID (cached data is not 
the same as stale data, although once you have the former, the latter 
can begin to look tempting sometimes).

M
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] best statistic target for boolean columns

2004-09-27 Thread Gaetano Mendola
Tom Lane wrote:
Gregory Stark [EMAIL PROTECTED] writes:
No, actually the stats table keeps the n most common values and their
frequency (usually in percentage). So really a target of 2 ought to be enough
for boolean values. In fact that's all I see in pg_statistic; I'm assuming
there's a full histogram somewhere but I don't see it. Where would it be?

It's not going to be there.  The histogram only covers values that are
not in the most-frequent-values list, and therefore it won't exist for a
column that is completely describable by most-frequent-values.

However the target also dictates how large a sample of the table to take. A
target of two represents a very small sample. So the estimations could be
quite far off.

Right.  The real point of stats target for such columns is that it
determines how many rows to sample, and thereby indirectly implies
the accuracy of the statistics.  For a heavily skewed boolean column
you'd want a high target so that the number of occurrences of the
infrequent value would be estimated accurately.
It's also worth noting that the number of rows sampled is driven by the
largest per-column stats target in the table, and so reducing stats
target to 2 for a boolean column will save *zero* effort unless all the
columns in the table are booleans.
Thank you all, now I have more clear how it works.
Btw last time I was thinking: why during an explain analyze we can not use
the information on about the really extracted rows vs the extimated rows ?
Now I'm reading an article, written by the same author that ispired the magic 300
on analyze.c, about Self-tuning Histograms. If this is implemented, I understood
we can take rid of vacuum analyze for mantain up to date the statistics.
Have someone in his plans to implement it ?
After all the idea is simple: compare during normal selects the extimated rows and
the actual extracted rows then use this free information to refine the histograms.

Regards
Gaetano Mendola

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] best statistic target for boolean columns

2004-09-27 Thread Neil Conway
On Tue, 2004-09-28 at 08:42, Gaetano Mendola wrote:
 Now I'm reading an article, written by the same author that ispired the magic 300
 on analyze.c, about Self-tuning Histograms. If this is implemented, I understood
 we can take rid of vacuum analyze for mantain up to date the statistics.
 Have someone in his plans to implement it ?

http://www.mail-archive.com/[EMAIL PROTECTED]/msg17477.html

Tom's reply is salient. I still think self-tuning histograms would be
worth looking at for the multi-dimensional case.

-Neil



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Tom Lane
Iain [EMAIL PROTECTED] writes:
 I can only tell you (roughly) how it works wth Oracle,

Which unfortunately has little to do with how it works with Postgres.
This latches stuff is irrelevant to us.

In practice, any repetitive planning in PG is going to be consulting
catalog rows that it draws from the backend's local catalog caches.
After the first read of a given catalog row, the backend won't need
to re-read it unless the associated table has a schema update.  (There
are some other cases, like a VACUUM FULL of the catalog the rows came
from, but in practice catalog cache entries don't change often in most
scenarios.)  We need place only one lock per table referenced in order
to interlock against schema updates; not one per catalog row used.

The upshot of all this is that any sort of shared plan cache is going to
create substantially more contention than exists now --- and that's not
even counting the costs of managing the cache, ie deciding when to throw
away entries.

A backend-local plan cache would avoid the contention issues, but would
of course not allow amortizing planning costs across multiple backends.

I'm personally dubious that sharing planning costs is a big deal.
Simple queries generally don't take that long to plan.  Complicated
queries do, but I think the reusability odds go down with increasing
query complexity.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Iain
Hi Tom,

 This "latches" stuff is irrelevant to us.

Well, that's good to know anyway, thanks for setting me straight. Maybe
Oracle could take a leaf out of PGs book instead of the other way around. I
recall that you mentioned the caching of the schema before, so even though I
assumed PG was latching the metadata, I had begun to wonder if it was
actually neccessary.

While it7s obviously not as critical as I thought, I think there may still
be some potential for query caching by pg. It would be nice to have the
option anyway, as different applications have different needs.

I think that re-use of SQL in applications (ie controlling the proliferation
of SQL statements that are minor variants of each other) is a good goal for
maintainability, even if it doesn't have a major impact on performance as it
seems you are suggesting in the case of pg. Even complex queries that must
be constructed dynamically typically only have a finite number of options
and can still use bind variables, so in a well tuned system, they should
still be viable candidates for caching (ie, if they aren't being bumped out
of the cache by thousands of little queries not using binds).

I'll just finish by saying that, developing applications in a way that would
take advantage of any query caching still seems like good practice to me,
even if the target DBMS has no query caching. For now, that's what I plan to
do with future PG/Oracle/Hypersonic (my 3 favourite DBMSs) application
development anyway.

Regards
Iain


- Original Message - 
From: "Tom Lane" [EMAIL PROTECTED]
To: "Iain" [EMAIL PROTECTED]
Cc: "Jim C. Nasby" [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, September 28, 2004 12:17 PM
Subject: Re: [PERFORM] Caching of Queries


 "Iain" [EMAIL PROTECTED] writes:
  I can only tell you (roughly) how it works wth Oracle,

 Which unfortunately has little to do with how it works with Postgres.
 This "latches" stuff is irrelevant to us.

 In practice, any repetitive planning in PG is going to be consulting
 catalog rows that it draws from the backend's local catalog caches.
 After the first read of a given catalog row, the backend won't need
 to re-read it unless the associated table has a schema update.  (There
 are some other cases, like a VACUUM FULL of the catalog the rows came
 from, but in practice catalog cache entries don't change often in most
 scenarios.)  We need place only one lock per table referenced in order
 to interlock against schema updates; not one per catalog row used.

 The upshot of all this is that any sort of shared plan cache is going to
 create substantially more contention than exists now --- and that's not
 even counting the costs of managing the cache, ie deciding when to throw
 away entries.

 A backend-local plan cache would avoid the contention issues, but would
 of course not allow amortizing planning costs across multiple backends.

 I'm personally dubious that sharing planning costs is a big deal.
 Simple queries generally don't take that long to plan.  Complicated
 queries do, but I think the reusability odds go down with increasing
 query complexity.

 regards, tom lane

 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?

http://archives.postgresql.org


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])