Re: [PERFORM] Caching of Queries
[ discussion of server side result caching ] and lets not forget PG's major fork it will throw into things: MVCC The results of query A may hold true for txn 1, but not txn 2 and so on . That would have to be taken into account as well and would greatly complicate things. It is always possible to do a poor man's query cache with triggers.. which would just leave you with basically a materialized view. -- Jeff Trout [EMAIL PROTECTED] http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Caching of Queries
On Mon, 27 Sep 2004 18:20:48 +0100, Matt Clark [EMAIL PROTECTED] wrote: This is very true. Client side caching is an enormous win for apps, but it requires quite a lot of logic, triggers to update last-modified fields on relevant tables, etc etc. Moving some of this logic to the DB would perhaps not usually be quite as efficient as a bespoke client caching solution, but it will above all be a lot easier for the application developer! In the world of PHP it is trivial thanks to PEAR's Cache_Lite. The project lead for Mambo implemented page-level caching in a day, and had all the triggers for clearing the cache included in the content management interface - not difficult at all. Basically you set a default in seconds for the HTML results to be cached, and then have triggers set that force the cache to regenerate (whenever CRUD happens to the content, for example). Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a believer out of me! -- Mitch ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] best statistic target for boolean columns
Gaetano, don't you think the best statistic target for a boolean column is something like 2? Or in general the is useless have a statistics target data type cardinality ? It depends, really, on the proportionality of the boolean values; if they're about equal, I certainly wouldn't raise Stats from the default of 10. If, however, it's very dispraportionate -- like 2% true and 98% false -- then it may pay to have better statistics so that the planner doesn't assume 50% hits, which it otherwise might. No, actually the stats table keeps the n most common values and their frequency (usually in percentage). So really a target of 2 ought to be enough for boolean values. In fact that's all I see in pg_statistic; I'm assuming there's a full histogram somewhere but I don't see it. Where would it be? However the target also dictates how large a sample of the table to take. A target of two represents a very small sample. So the estimations could be quite far off. I ran the experiment and for a table with 2036 false rows out of 204,624 the estimate was 1720. Not bad. But then I did vacuum full analyze and got an estimate of 688. Which isn't so good. -- greg ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Caching of Queries
On Thu, Sep 23, 2004 at 08:29:25AM -0700, Mr Pink wrote: Not knowing anything about the internals of pg, I don't know how this relates, but in theory, query plan caching is not just about saving time re-planning queries, it's about scalability. Optimizing queries requires shared locks on the database metadata, which, as I understand it causes contention and serialization, which kills scalability. One of the guru's can correct me if I'm wrong here, but AFAIK metadata lookups use essentially the same access methods as normal queries. This means MVCC is used and no locking is required. Even if locks were required, they would be shared read locks which wouldn't block each other. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] best statistic target for boolean columns
Gaetano, don't you think the best statistic target for a boolean column is something like 2? Or in general the is useless have a statistics target data type cardinality ? It depends, really, on the proportionality of the boolean values; if they're about equal, I certainly wouldn't raise Stats from the default of 10. If, however, it's very dispraportionate -- like 2% true and 98% false -- then it may pay to have better statistics so that the planner doesn't assume 50% hits, which it otherwise might. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] best statistic target for boolean columns
Gregory Stark [EMAIL PROTECTED] writes: No, actually the stats table keeps the n most common values and their frequency (usually in percentage). So really a target of 2 ought to be enough for boolean values. In fact that's all I see in pg_statistic; I'm assuming there's a full histogram somewhere but I don't see it. Where would it be? It's not going to be there. The histogram only covers values that are not in the most-frequent-values list, and therefore it won't exist for a column that is completely describable by most-frequent-values. However the target also dictates how large a sample of the table to take. A target of two represents a very small sample. So the estimations could be quite far off. Right. The real point of stats target for such columns is that it determines how many rows to sample, and thereby indirectly implies the accuracy of the statistics. For a heavily skewed boolean column you'd want a high target so that the number of occurrences of the infrequent value would be estimated accurately. It's also worth noting that the number of rows sampled is driven by the largest per-column stats target in the table, and so reducing stats target to 2 for a boolean column will save *zero* effort unless all the columns in the table are booleans. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Caching of Queries
Basically you set a default in seconds for the HTML results to be cached, and then have triggers set that force the cache to regenerate (whenever CRUD happens to the content, for example). Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a believer out of me! Nice to have it in a library, but if you want to be that simplistic then it's easy in any language. What if a process on server B modifies a n important value that server A has cached though? Coherency (albeit that the client may choose to not use it) is a must for a general solution. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Caching of Queries
More to the point though, I think this is a feature that really really should be in the DB, because then it's trivial for people to use. How does putting it into PGPool make it any less trivial for people to use? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] best statistic target for boolean columns
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Josh Berkus wrote: | Gaetano, | | |don't you think the best statistic target for a boolean |column is something like 2? Or in general the is useless |have a statistics target data type cardinality ? | | | It depends, really, on the proportionality of the boolean values; if they're | about equal, I certainly wouldn't raise Stats from the default of 10. If, | however, it's very dispraportionate -- like 2% true and 98% false -- then it | may pay to have better statistics so that the planner doesn't assume 50% | hits, which it otherwise might. So, I didn't understand how the statistics hystogram works. I'm going to take a look at analyze.c Regards Gaetano Mendola -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBWHr07UpzwH2SGd4RAi8nAJoDOa7j+5IjDEcqBvB4ATXRzRPB+wCfWZ0p OCmUew9zlyqVkxB9iWKoGAw= =7lkZ -END PGP SIGNATURE- ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Caching of Queries
On Mon, Sep 27, 2004 at 09:19:12PM +0100, Matt Clark wrote: Basically you set a default in seconds for the HTML results to be cached, and then have triggers set that force the cache to regenerate (whenever CRUD happens to the content, for example). Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a believer out of me! Nice to have it in a library, but if you want to be that simplistic then it's easy in any language. What if a process on server B modifies a n important value that server A has cached though? Coherency (albeit that the client may choose to not use it) is a must for a general solution. memcached is one solution designed for that situation. Easy to use from most languages. Works. Lets you use memory on systems where you have it, rather than using up valuable database server RAM that's better spent caching disk sectors. Any competently written application where caching results would be a suitable performance boost can already implement application or middleware caching fairly easily, and increase performance much more than putting result caching into the database would. I don't see caching results in the database as much of a win for most well written applications. Toy benchmarks, sure, but for real apps it seems it would add a lot of complexity, and violate the whole point of using an ACID database. (Caching parse trees or query plans, though? It'd be interesting to model what effect that'd have.) Cheers, Steve ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Caching of Queries
More to the point though, I think this is a feature that really really should be in the DB, because then it's trivial for people to use. How does putting it into PGPool make it any less trivial for people to use? The answers are at http://www2b.biglobe.ne.jp/~caco/pgpool/index-e.html . Specifically, it's a separate application that needs configuration, the homepage has no real discussion of the potential pitfalls of pooling and what this implementation does to get around them, you get the idea. I'm sure it's great software, but it doesn't come as part of the DB server, so 95% of people who would benefit from query caching being implemented in it never will. If it shipped with and was turned on by default in SUSE or RedHat that would be a different matter. Which I realise makes me look like one of those people who doesn't appreciate code unless it's 'popular', but I hope I'm not *that* bad... Oh OK, I'll say it, this is a perfect example of why My*** has so much more mindshare. It's not better, but it sure makes the average Joe _feel_ better. Sorry, I've got my corporate hat on today, I'm sure I'll feel a little less cynical tomorrow. M
Re: [PERFORM] Caching of Queries
Any competently written application where caching results would be a suitable performance boost can already implement application or middleware caching fairly easily, and increase performance much more than putting result caching into the database would. I guess the performance increase is that you can spend $10,000 on a developer, or $10,000 on hardware, and for the most part get a more reliable result the second way. MemcacheD is fine(ish), but it's not a panacea, and it's more than easy to shoot yourself in the foot with it. Caching is hard enough that lots of people do it badly - I'd rather use an implementation from the PG team than almost anywhere else. I don't see caching results in the database as much of a win for most well written applications. Toy benchmarks, sure, but for real apps it seems it would add a lot of complexity, and violate the whole point of using an ACID database. Well the point surely is to _remove_ complexity from the application, which is written by God Knows Who, and put it in the DB, which is written by God And You. And you can still have ACID (cached data is not the same as stale data, although once you have the former, the latter can begin to look tempting sometimes). M ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] best statistic target for boolean columns
Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: No, actually the stats table keeps the n most common values and their frequency (usually in percentage). So really a target of 2 ought to be enough for boolean values. In fact that's all I see in pg_statistic; I'm assuming there's a full histogram somewhere but I don't see it. Where would it be? It's not going to be there. The histogram only covers values that are not in the most-frequent-values list, and therefore it won't exist for a column that is completely describable by most-frequent-values. However the target also dictates how large a sample of the table to take. A target of two represents a very small sample. So the estimations could be quite far off. Right. The real point of stats target for such columns is that it determines how many rows to sample, and thereby indirectly implies the accuracy of the statistics. For a heavily skewed boolean column you'd want a high target so that the number of occurrences of the infrequent value would be estimated accurately. It's also worth noting that the number of rows sampled is driven by the largest per-column stats target in the table, and so reducing stats target to 2 for a boolean column will save *zero* effort unless all the columns in the table are booleans. Thank you all, now I have more clear how it works. Btw last time I was thinking: why during an explain analyze we can not use the information on about the really extracted rows vs the extimated rows ? Now I'm reading an article, written by the same author that ispired the magic 300 on analyze.c, about Self-tuning Histograms. If this is implemented, I understood we can take rid of vacuum analyze for mantain up to date the statistics. Have someone in his plans to implement it ? After all the idea is simple: compare during normal selects the extimated rows and the actual extracted rows then use this free information to refine the histograms. Regards Gaetano Mendola ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] best statistic target for boolean columns
On Tue, 2004-09-28 at 08:42, Gaetano Mendola wrote: Now I'm reading an article, written by the same author that ispired the magic 300 on analyze.c, about Self-tuning Histograms. If this is implemented, I understood we can take rid of vacuum analyze for mantain up to date the statistics. Have someone in his plans to implement it ? http://www.mail-archive.com/[EMAIL PROTECTED]/msg17477.html Tom's reply is salient. I still think self-tuning histograms would be worth looking at for the multi-dimensional case. -Neil ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Caching of Queries
Iain [EMAIL PROTECTED] writes: I can only tell you (roughly) how it works wth Oracle, Which unfortunately has little to do with how it works with Postgres. This latches stuff is irrelevant to us. In practice, any repetitive planning in PG is going to be consulting catalog rows that it draws from the backend's local catalog caches. After the first read of a given catalog row, the backend won't need to re-read it unless the associated table has a schema update. (There are some other cases, like a VACUUM FULL of the catalog the rows came from, but in practice catalog cache entries don't change often in most scenarios.) We need place only one lock per table referenced in order to interlock against schema updates; not one per catalog row used. The upshot of all this is that any sort of shared plan cache is going to create substantially more contention than exists now --- and that's not even counting the costs of managing the cache, ie deciding when to throw away entries. A backend-local plan cache would avoid the contention issues, but would of course not allow amortizing planning costs across multiple backends. I'm personally dubious that sharing planning costs is a big deal. Simple queries generally don't take that long to plan. Complicated queries do, but I think the reusability odds go down with increasing query complexity. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Caching of Queries
Hi Tom, This "latches" stuff is irrelevant to us. Well, that's good to know anyway, thanks for setting me straight. Maybe Oracle could take a leaf out of PGs book instead of the other way around. I recall that you mentioned the caching of the schema before, so even though I assumed PG was latching the metadata, I had begun to wonder if it was actually neccessary. While it7s obviously not as critical as I thought, I think there may still be some potential for query caching by pg. It would be nice to have the option anyway, as different applications have different needs. I think that re-use of SQL in applications (ie controlling the proliferation of SQL statements that are minor variants of each other) is a good goal for maintainability, even if it doesn't have a major impact on performance as it seems you are suggesting in the case of pg. Even complex queries that must be constructed dynamically typically only have a finite number of options and can still use bind variables, so in a well tuned system, they should still be viable candidates for caching (ie, if they aren't being bumped out of the cache by thousands of little queries not using binds). I'll just finish by saying that, developing applications in a way that would take advantage of any query caching still seems like good practice to me, even if the target DBMS has no query caching. For now, that's what I plan to do with future PG/Oracle/Hypersonic (my 3 favourite DBMSs) application development anyway. Regards Iain - Original Message - From: "Tom Lane" [EMAIL PROTECTED] To: "Iain" [EMAIL PROTECTED] Cc: "Jim C. Nasby" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, September 28, 2004 12:17 PM Subject: Re: [PERFORM] Caching of Queries "Iain" [EMAIL PROTECTED] writes: I can only tell you (roughly) how it works wth Oracle, Which unfortunately has little to do with how it works with Postgres. This "latches" stuff is irrelevant to us. In practice, any repetitive planning in PG is going to be consulting catalog rows that it draws from the backend's local catalog caches. After the first read of a given catalog row, the backend won't need to re-read it unless the associated table has a schema update. (There are some other cases, like a VACUUM FULL of the catalog the rows came from, but in practice catalog cache entries don't change often in most scenarios.) We need place only one lock per table referenced in order to interlock against schema updates; not one per catalog row used. The upshot of all this is that any sort of shared plan cache is going to create substantially more contention than exists now --- and that's not even counting the costs of managing the cache, ie deciding when to throw away entries. A backend-local plan cache would avoid the contention issues, but would of course not allow amortizing planning costs across multiple backends. I'm personally dubious that sharing planning costs is a big deal. Simple queries generally don't take that long to plan. Complicated queries do, but I think the reusability odds go down with increasing query complexity. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])