Re: [HACKERS] cache estimates, cache access cost
2011/5/19 Robert Haas robertmh...@gmail.com: On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: The point is to get ratio in cache, not the distribution of the data in cache (pgfincore also allows you to see this information). I don't see how a stable (a server in production) system can have its ratio moving up and down so fast without known pattern. Really? It doesn't seem that hard to me. For example, your nightly reports might use a different set of tables than are active during the day yes, this is known pattern, I believe we can work with it. PS: there is very good blocker for the pg_class changes : what happens in a standby ? Maybe it just opens the door on how to unlock that or find another option to get the information per table but distinct per server. (or we don't care, at least for a first implementation, like for other parameters) That's a good point, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cache estimates, cache access cost
On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: 2011/5/19 Robert Haas robertmh...@gmail.com: On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: The point is to get ratio in cache, not the distribution of the data in cache (pgfincore also allows you to see this information). I don't see how a stable (a server in production) system can have its ratio moving up and down so fast without known pattern. Really? It doesn't seem that hard to me. For example, your nightly reports might use a different set of tables than are active during the day yes, this is known pattern, I believe we can work with it. I guess the case where I agree that this would be relatively static is on something like a busy OLTP system. If different users access different portions of the main tables, which parts of each relation are hot might move around, but overall the percentage of that relation in cache probably won't move around a ton, except perhaps just after running a one-off reporting query, or when the system is first starting up. But that's not everybody's workload. Imagine a system that is relatively lightly used. Every once in a while someone comes along and runs a big reporting query. Well, the contents of the buffer caches are might vary considerably depending on *which* big reporting queries ran most recently. Also, even if we knew what was going to be in cache at the start of the query, the execution of the query might change things greatly as it runs. For example, imagine a join between some table and itself. If we estimate that none of the data is i cache, we will almost certainly be wrong, because it's likely both sides of the join are going to access some of the same pages. Exactly how many depends on the details of the join condition and whether we choose to implement it by merging, sorting, or hashing. But it's likely going to be more than zero. This problem can also arise in other contexts - for example, if a query accesses a bunch of large tables, the tables that are accessed later in the computation might be less cached than the ones accessed earlier in the computation, because the earlier accesses pushed parts of the tables accessed later out of cache. Or, if a query requires a large sort, and the value of work_mem is very high (say 1GB), the sort might evict data from cache. Now maybe none of this matters a bit in practice, but it's something to think about. There was an interesting report on a problem along these lines from Kevin Grittner a while back. He found he needed to set seq_page_cost and random_page_cost differently for the database user that ran the nightly reports, precisely because the degree of caching was very different than it was for the daily activity, and he got bad plans otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cache estimates, cache access cost
2011/5/19 Robert Haas robertmh...@gmail.com: On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: 2011/5/19 Robert Haas robertmh...@gmail.com: On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: The point is to get ratio in cache, not the distribution of the data in cache (pgfincore also allows you to see this information). I don't see how a stable (a server in production) system can have its ratio moving up and down so fast without known pattern. Really? It doesn't seem that hard to me. For example, your nightly reports might use a different set of tables than are active during the day yes, this is known pattern, I believe we can work with it. I guess the case where I agree that this would be relatively static is on something like a busy OLTP system. If different users access different portions of the main tables, which parts of each relation are hot might move around, but overall the percentage of that relation in cache probably won't move around a ton, except perhaps just after running a one-off reporting query, or when the system is first starting up. yes. But that's not everybody's workload. Imagine a system that is relatively lightly used. Every once in a while someone comes along and runs a big reporting query. Well, the contents of the buffer caches are might vary considerably depending on *which* big reporting queries ran most recently. Yes, I agree. This scenario is for the case where oscache_percent and pgcache_percent are subject to change I guess. We can defined 1/ if the values can/need to be change 2/ when update the values. For 2/ the database usage may help to trigger an ANALYZE when required. But to be honest I'd like to hear more of the strategy suggested by Greg here. Those scenari are good keep in mind to build good indicators for both the plugin to do the ANALYZE and to solve 2/ Also, even if we knew what was going to be in cache at the start of the query, the execution of the query might change things greatly as it runs. For example, imagine a join between some table and itself. If we estimate that none of the data is i cache, we will almost certainly be wrong, because it's likely both sides of the join are going to access some of the same pages. Exactly how many depends on the details of the join condition and whether we choose to implement it by merging, sorting, or hashing. But it's likely going to be more than zero. This problem can also arise in other contexts - for example, if a query accesses a bunch of large tables, the tables that are accessed later in the computation might be less cached than the ones accessed earlier in the computation, because the earlier accesses pushed parts of the tables accessed later out of cache. Yes I believe the Mackert and Lohman formula has been good so far and I didn't suggest at any moment to remove it. It will need some rewrite to handle it with the new GUC and new pg_class columns but the code is already in the place for that. Or, if a query requires a large sort, and the value of work_mem is very high (say 1GB), the sort might evict data from cache. Now maybe none of this matters a bit in practice, but it's something to think about. Yes I agree again. There was an interesting report on a problem along these lines from Kevin Grittner a while back. He found he needed to set seq_page_cost and random_page_cost differently for the database user that ran the nightly reports, precisely because the degree of caching was very different than it was for the daily activity, and he got bad plans otherwise. this is in fact a very interesting use case. I believe the same strategy can be applied and update cache_page_cost and pg_class. But I really like if it closes this use case: seq_page_cost, random_page_cost and cache_page_cost must not need to be changed, they should be more 'hardware dependent'. What will need to be changed is in fact the frequency of ANALYZE CACHE in such case (or arbitrary set values). It should allow the planner and costsize functions to have accurate values and provide the best plan (again, the cache estimation coming from the running query remain in the hands of the Mackert and Lohman). OK, maybe the user will have to write some ANALYZE CACHE; between some queries in his scenarios. Maybe a good scenario to add to the performance farm ? (as others but this one has the very good value to be a production case) I'll write those scenarios in a wiki page so it can be used to review corner cases and possible issues (not now, it is late here). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] cache estimates, cache access cost
On Sun, May 15, 2011 at 11:52 PM, Greg Smith g...@2ndquadrant.com wrote: Cédric Villemain wrote: http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache This rebases easily to make Cedric's changes move to the end; I just pushed a version with that change to https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone wants a cleaner one to browse. I've attached a patch too if that's more your thing. Thank you. I don't much like sucking in other people's git repos - it tends to take a lot longer than just opening a patch file, and if I add the repo as a remote then my git repo ends up bloated. :-( The more important question is how to store the data collected and then use it for optimizing queries. Agreed, but unless I'm missing something, this patch does nothing about that. I think the first step needs to be to update all the formulas that are based on random_page_cost and seq_page_cost to properly take cache_page_cost into account - and in some cases it may be a bit debatable what the right mathematics are. For what it's worth, I don't believe for a minute that an analyze process that may run only run on a given table every six months has a chance of producing useful statistics about the likelihood that a table will be cached. The buffer cache can turn over completely in under a minute, and a minute is a lot less than a month. Now, if we measured this information periodically for a long period of time and averaged it, that might be a believable basis for setting an optimizer parameter. But I think we should take the approach recently discussed on performance: allow it to be manually set by the administrator on a per-relation basis, with some reasonable default (maybe based on the size of the relation relative to effective_cache_size) if the administrator doesn't intervene. I don't want to be excessively negative about the approach of examining the actual behavior of the system and using that to guide system behavior - indeed, I think there are quite a few places where we would do well to incorporate that approach to a greater degree than we do currently. But I think that it's going to take a lot of research, and a lot of work, and a lot of performance testing, to convince ourselves that we've come up with an appropriate feedback mechanism that will actually deliver better performance across a large variety of workloads. It would be much better, IMHO, to *first* get a cached_page_cost parameter added, even if the mechanism by which caching percentages are set is initially quite crude - that will give us a clear-cut benefit that people can begin enjoying immediately. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cache estimates, cache access cost
2011/5/17 Robert Haas robertmh...@gmail.com: On Sun, May 15, 2011 at 11:52 PM, Greg Smith g...@2ndquadrant.com wrote: Cédric Villemain wrote: http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache This rebases easily to make Cedric's changes move to the end; I just pushed a version with that change to https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone wants a cleaner one to browse. I've attached a patch too if that's more your thing. Thank you. I don't much like sucking in other people's git repos - it tends to take a lot longer than just opening a patch file, and if I add the repo as a remote then my git repo ends up bloated. :-( The more important question is how to store the data collected and then use it for optimizing queries. Agreed, but unless I'm missing something, this patch does nothing about that. I think the first step needs to be to update all the formulas that are based on random_page_cost and seq_page_cost to properly take cache_page_cost into account - and in some cases it may be a bit debatable what the right mathematics are. Yes, I provide the branch only in case someone want to hack the costsize and to close the problem of getting stats. For what it's worth, I don't believe for a minute that an analyze process that may run only run on a given table every six months has a chance of producing useful statistics about the likelihood that a table will be cached. The buffer cache can turn over completely in under a minute, and a minute is a lot less than a month. Now, if we measured this information periodically for a long period of time and averaged it, that might be a believable basis for setting an optimizer The point is to get ratio in cache, not the distribution of the data in cache (pgfincore also allows you to see this information). I don't see how a stable (a server in production) system can have its ratio moving up and down so fast without known pattern. Maybe it is datawarehouse, so data move a lot, then just update your per-relation stats before starting your queries as suggested in other threads. Maybe it is just a matter of frequency of stats update or explicit request like we *use to do* (ANALYZE foo;) to handle those situations. parameter. But I think we should take the approach recently discussed on performance: allow it to be manually set by the administrator on a per-relation basis, with some reasonable default (maybe based on the size of the relation relative to effective_cache_size) if the administrator doesn't intervene. I don't want to be excessively negative about the approach of examining the actual behavior of the system and using that to guide system behavior - indeed, I think there are quite a few places where we would do well to incorporate that approach to a greater degree than we do currently. But I think that it's going to take a lot of research, and a lot of work, and a lot of performance testing, to convince ourselves that we've come up with an appropriate feedback mechanism that will actually deliver better performance across a large variety of workloads. It would be much better, IMHO, to *first* get a cached_page_cost parameter added, even if the mechanism by which caching percentages are set is initially quite crude - that will give us a clear-cut benefit that people can begin enjoying immediately. The plugin I provided is just to be able to do first analysis on how the os cache size move. You can either use pgfincore to monitor that per table or use the patch and monitor columns values for *cache. I took the Hooks approach because it allows to do what you want :) You can set up a hook where you set the values you want to see, it allows for example to fix cold start values, or permanent values set by DBA or ... do what you want here. The topic is do we need more parameters to increase the value of our planner ? 1/ cache_page_cost 2/ cache information, arbitrary set or not. Starting with 1/ is ok for me, I prefer to try both at once if possible to remove the pain to hack twice costsize.c Several items are to be discussed after that: formulas to handle 'small' tables, data distribution usage (this one hit an old topic about auto-partitionning as we are here), cold state, hot state, ... PS: there is very good blocker for the pg_class changes : what happens in a standby ? Maybe it just opens the door on how to unlock that or find another option to get the information per table but distinct per server. (or we don't care, at least for a first implementation, like for other parameters) -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] cache estimates, cache access cost
Hello cache estimation and cache access cost are currently not accounted explicitly: they have a cost associated with but no constants (other than effective_cache_size but it has a very limited usage). Every IO cost is build with a derivation of the seq_page_cost, random_page_cost and the number of pages. Formulas are used in some places to make the cost more or less, to take into account caching and data alignment. There are: * estimation of page we will find in the postgresql buffer cache * estimation of page we will find in the operating system cache buffer cache and they can be compute for : * first access * several access We currently don't make distinction between both cache areas (there is more cache areas but we don't care here) and we 'prefer' estimate several access instead of the first one. There is also a point related to cost estimation, they are strong: for example once a sort goes over work_mem, its cost jumped because page access are accounted. The current cost estimations are already very good, most of our queries run well without those famous 'HINT' and the planner provide the best plan in most cases. But I believe that now we need more tools to improve even more the cost estimation. I would like to propose some ideas, not my ideas in all cases, the topic is in the air since a long time and probably that everything has already being said (at least around a beer or a pepsi) Adding a new GUC cache_page_cost: - allows to cost the page access when it is estimated in cache - allows to cost a sort exceeding work_mem but which should not hit disk - allows to use random_page_cost for what it should be. (I was tempted by a GUC write_page_cost but I am unsure for this one at this stage) Adding 2 columns to pg_class oscache_percent and pgcache_percent (or similar names): they allow to store stats about the percentage of a relation in each cache. - Usage should be to estimate cost of first access to pages then use the Mackert and Lohman formula on next access. The later only provide a way to estimate cost of re-reading. It is hard to advocate here with real expected performance gain other than: we will have more options for more precise planner decision and we may reduce the number of report for bad planning. (it is also in the todolist to improve cache estimation) -- I've already hack a bit the core for that and added the 2 new columns with hooks to update them. ANALYZE OSCACHE update one of them and a plugin can be used to provide the estimate (so how it's filled is not important, most OSes have solutions to estimate it accurately if someone wonder) It is as-is for POC, probably not clean enough to go to commit festand not expected to go there before some consensus are done. http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache -- Hacking costsize is ... dangerous, I would say. Breaking something which works already so well is easy. Changing only one cost function is not enough to keep a good balance Performance farm should help here ... and the full cycle for 9.2 too. Comments ? -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cache estimates, cache access cost
Cédric Villemain wrote: http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache This rebases easily to make Cedric's changes move to the end; I just pushed a version with that change to https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone wants a cleaner one to browse. I've attached a patch too if that's more your thing. I'd recommend not getting too stuck on the particular hook Cédric has added here to compute the cache estimate, which uses mmap and mincore to figure it out. It's possible to compute similar numbers, albeit less accurate, using an approach similar to how pg_buffercache inspects things. And I even once wrote a background writer extension that collected this sort of data as it was running the LRU scan anyway. Discussions of this idea seem to focus on how the what's in the cache? data is collected, which as far as I'm concerned is the least important part. There are multiple options, some work better than others, and there's no reason that can't be swapped out later. The more important question is how to store the data collected and then use it for optimizing queries. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us diff --git a/contrib/Makefile b/contrib/Makefile index 6967767..47652d5 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -27,6 +27,7 @@ SUBDIRS = \ lo \ ltree \ oid2name \ + oscache \ pageinspect \ passwordcheck \ pg_archivecleanup \ diff --git a/contrib/oscache/Makefile b/contrib/oscache/Makefile new file mode 100644 index 000..8d8dcc5 --- /dev/null +++ b/contrib/oscache/Makefile @@ -0,0 +1,15 @@ +# contrib/oscache/Makefile + +MODULE_big = oscache +OBJS = oscache.o + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/oscache +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/oscache/oscache.c b/contrib/oscache/oscache.c new file mode 100644 index 000..1ad7dc2 --- /dev/null +++ b/contrib/oscache/oscache.c @@ -0,0 +1,151 @@ +/*- + * + * oscache.c + * + * + * Copyright (c) 2011, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/oscache/oscache.c + * + *- + */ +/* { POSIX stuff */ +#include stdlib.h /* exit, calloc, free */ +#include sys/stat.h /* stat, fstat */ +#include sys/types.h /* size_t, mincore */ +#include unistd.h /* sysconf, close */ +#include sys/mman.h /* mmap, mincore */ +/* } */ + +/* { PostgreSQL stuff */ +#include postgres.h /* general Postgres declarations */ +#include utils/rel.h /* Relation */ +#include storage/bufmgr.h +#include catalog/catalog.h /* relpath */ +/* } */ + +PG_MODULE_MAGIC; + +void _PG_init(void); + +float4 oscache(Relation, ForkNumber); + +/* + * Module load callback + */ +void +_PG_init(void) +{ + /* Install hook. */ + OSCache_hook = oscache; +} + +/* + * oscache process the os cache inspection for the relation. + * It returns the percentage of blocks in OS cache. + */ +float4 +oscache(Relation relation, ForkNumber forkNum) +{ + int segment = 0; + char *relationpath; + char filename[MAXPGPATH]; + int fd; + int64 total_block_disk = 0; + int64 total_block_mem = 0; + + /* OS things */ + int64 pageSize = sysconf(_SC_PAGESIZE); /* Page size */ + register int64 pageIndex; + + relationpath = relpathperm(relation-rd_node, forkNum); + + /* + * For each segment of the relation + */ + snprintf(filename, MAXPGPATH, %s, relationpath); + while ((fd = open(filename, O_RDONLY)) != -1) + { + // for stat file + struct stat st; + // for mmap file + void *pa = (char *)0; + // for calloc file + unsigned char *vec = (unsigned char *)0; + int64 block_disk = 0; + int64 block_mem = 0; + + if (fstat(fd, st) == -1) + { + close(fd); + elog(ERROR, Can not stat object file : %s, +filename); + return 0; + } + + /* + * if file ok + * then process + */ + if (st.st_size != 0) + { + /* number of block in the current file */ + block_disk = st.st_size/pageSize; + + /* TODO We need to split mmap size to be sure (?) to be able to mmap */ + pa = mmap(NULL, st.st_size, PROT_NONE, MAP_SHARED, fd, 0); + if (pa == MAP_FAILED) + { +close(fd); +elog(ERROR, Can not mmap object file : %s, errno = %i,%s\nThis error can happen if there is not enought space in memory to do the projection. Please mail ced...@2ndquadrant.fr with '[oscache] ENOMEM' as subject., + filename, errno, strerror(errno)); +return 0; + } + + /* Prepare our vector containing all blocks information */ + vec = calloc(1, (st.st_size+pageSize-1)/pageSize); + if ((void *)0 == vec) + { +munmap(pa, st.st_size); +