Re: [HACKERS] cache estimates, cache access cost

2011-05-19 Thread Cédric Villemain
2011/5/19 Robert Haas robertmh...@gmail.com:
 On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 The point is to get ratio in cache, not the distribution of the data
 in cache (pgfincore also allows you to see this information).
 I don't see how a stable (a server in production) system can have its
 ratio moving up and down so fast without known pattern.

 Really?  It doesn't seem that hard to me.  For example, your nightly
 reports might use a different set of tables than are active during the
 day

yes, this is known pattern, I believe we can work with it.


 PS: there is very good blocker for the pg_class changes : what happens
 in a standby ? Maybe it just opens the door on how to unlock that or
 find another option to get the information per table but distinct per
 server. (or we don't care, at least for a first implementation, like
 for other parameters)

 That's a good point, too.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache estimates, cache access cost

2011-05-19 Thread Robert Haas
On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 2011/5/19 Robert Haas robertmh...@gmail.com:
 On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 The point is to get ratio in cache, not the distribution of the data
 in cache (pgfincore also allows you to see this information).
 I don't see how a stable (a server in production) system can have its
 ratio moving up and down so fast without known pattern.

 Really?  It doesn't seem that hard to me.  For example, your nightly
 reports might use a different set of tables than are active during the
 day

 yes, this is known pattern, I believe we can work with it.

I guess the case where I agree that this would be relatively static is
on something like a busy OLTP system.  If different users access
different portions of the main tables, which parts of each relation
are hot might move around, but overall the percentage of that relation
in cache probably won't move around a ton, except perhaps just after
running a one-off reporting query, or when the system is first
starting up.

But that's not everybody's workload.  Imagine a system that is
relatively lightly used.  Every once in a while someone comes along
and runs a big reporting query.  Well, the contents of the buffer
caches are might vary considerably depending on *which* big reporting
queries ran most recently.

Also, even if we knew what was going to be in cache at the start of
the query, the execution of the query might change things greatly as
it runs.  For example, imagine a join between some table and itself.
If we estimate that none of the data is i cache, we will almost
certainly be wrong, because it's likely both sides of the join are
going to access some of the same pages.  Exactly how many depends on
the details of the join condition and whether we choose to implement
it by merging, sorting, or hashing.  But it's likely going to be more
than zero.  This problem can also arise in other contexts - for
example, if a query accesses a bunch of large tables, the tables that
are accessed later in the computation might be less cached than the
ones accessed earlier in the computation, because the earlier accesses
pushed parts of the tables accessed later out of cache.  Or, if a
query requires a large sort, and the value of work_mem is very high
(say 1GB), the sort might evict data from cache.  Now maybe none of
this matters a bit in practice, but it's something to think about.

There was an interesting report on a problem along these lines from
Kevin Grittner a while back.  He found he needed to set seq_page_cost
and random_page_cost differently for the database user that ran the
nightly reports, precisely because the degree of caching was very
different than it was for the daily activity, and he got bad plans
otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache estimates, cache access cost

2011-05-19 Thread Cédric Villemain
2011/5/19 Robert Haas robertmh...@gmail.com:
 On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/5/19 Robert Haas robertmh...@gmail.com:
 On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 The point is to get ratio in cache, not the distribution of the data
 in cache (pgfincore also allows you to see this information).
 I don't see how a stable (a server in production) system can have its
 ratio moving up and down so fast without known pattern.

 Really?  It doesn't seem that hard to me.  For example, your nightly
 reports might use a different set of tables than are active during the
 day

 yes, this is known pattern, I believe we can work with it.

 I guess the case where I agree that this would be relatively static is
 on something like a busy OLTP system.  If different users access
 different portions of the main tables, which parts of each relation
 are hot might move around, but overall the percentage of that relation
 in cache probably won't move around a ton, except perhaps just after
 running a one-off reporting query, or when the system is first
 starting up.

yes.


 But that's not everybody's workload.  Imagine a system that is
 relatively lightly used.  Every once in a while someone comes along
 and runs a big reporting query.  Well, the contents of the buffer
 caches are might vary considerably depending on *which* big reporting
 queries ran most recently.

Yes, I agree. This scenario is for the case where oscache_percent and
pgcache_percent are subject to change I guess. We can defined 1/ if
the values can/need to be change 2/ when update the values. For 2/ the
database usage may help to trigger an ANALYZE when required. But to be
honest I'd like to hear more of the strategy suggested by Greg here.

Those scenari are good keep in mind to build good indicators for both
the plugin to do the ANALYZE and to solve 2/


 Also, even if we knew what was going to be in cache at the start of
 the query, the execution of the query might change things greatly as
 it runs.  For example, imagine a join between some table and itself.
 If we estimate that none of the data is i cache, we will almost
 certainly be wrong, because it's likely both sides of the join are
 going to access some of the same pages.  Exactly how many depends on
 the details of the join condition and whether we choose to implement
 it by merging, sorting, or hashing.  But it's likely going to be more
 than zero.  This problem can also arise in other contexts - for
 example, if a query accesses a bunch of large tables, the tables that
 are accessed later in the computation might be less cached than the
 ones accessed earlier in the computation, because the earlier accesses
 pushed parts of the tables accessed later out of cache.

Yes I believe the Mackert and Lohman formula has been good so far and
I didn't suggest at any moment to remove it.
It will need some rewrite to handle it with the new GUC and new
pg_class columns but the code is already in the place for that.

 Or, if a
 query requires a large sort, and the value of work_mem is very high
 (say 1GB), the sort might evict data from cache.  Now maybe none of
 this matters a bit in practice, but it's something to think about.

Yes I agree again.


 There was an interesting report on a problem along these lines from
 Kevin Grittner a while back.  He found he needed to set seq_page_cost
 and random_page_cost differently for the database user that ran the
 nightly reports, precisely because the degree of caching was very
 different than it was for the daily activity, and he got bad plans
 otherwise.

this is in fact a very interesting use case.  I believe the same
strategy can be applied and update cache_page_cost and pg_class.
But I really like if it closes this use case: seq_page_cost,
random_page_cost and cache_page_cost must not need to be changed, they
should be more 'hardware dependent'. What will need to be changed is
in fact the frequency of ANALYZE CACHE in such case (or arbitrary set
values). It should allow the planner and costsize functions to have
accurate values and provide the best plan (again, the cache estimation
coming from the running query remain in the hands of the Mackert and
Lohman).
OK, maybe the user will have to write some ANALYZE CACHE; between some
queries in his scenarios.

Maybe a good scenario to add to the performance farm ? (as others but
this one has the very good value to be a production case)

I'll write those scenarios in a wiki page so it can be used to review
corner cases and possible issues (not now, it is late here).


 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] cache estimates, cache access cost

2011-05-17 Thread Robert Haas
On Sun, May 15, 2011 at 11:52 PM, Greg Smith g...@2ndquadrant.com wrote:
 Cédric Villemain wrote:

 http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

 This rebases easily to make Cedric's changes move to the end; I just pushed
 a version with that change to
 https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
 wants a cleaner one to browse.  I've attached a patch too if that's more
 your thing.

Thank you.  I don't much like sucking in other people's git repos - it
tends to take a lot longer than just opening a patch file, and if I
add the repo as a remote then my git repo ends up bloated.  :-(

 The more important question is how to store the data collected and
 then use it for optimizing queries.

Agreed, but unless I'm missing something, this patch does nothing
about that.  I think the first step needs to be to update all the
formulas that are based on random_page_cost and seq_page_cost to
properly take cache_page_cost into account - and in some cases it may
be a bit debatable what the right mathematics are.

For what it's worth, I don't believe for a minute that an analyze
process that may run only run on a given table every six months has a
chance of producing useful statistics about the likelihood that a
table will be cached.  The buffer cache can turn over completely in
under a minute, and a minute is a lot less than a month.  Now, if we
measured this information periodically for a long period of time and
averaged it, that might be a believable basis for setting an optimizer
parameter.  But I think we should take the approach recently discussed
on performance: allow it to be manually set by the administrator on a
per-relation basis, with some reasonable default (maybe based on the
size of the relation relative to effective_cache_size) if the
administrator doesn't intervene.  I don't want to be excessively
negative about the approach of examining the actual behavior of the
system and using that to guide system behavior - indeed, I think there
are quite a few places where we would do well to incorporate that
approach to a greater degree than we do currently.  But I think that
it's going to take a lot of research, and a lot of work, and a lot of
performance testing, to convince ourselves that we've come up with an
appropriate feedback mechanism that will actually deliver better
performance across a large variety of workloads.  It would be much
better, IMHO, to *first* get a cached_page_cost parameter added, even
if the mechanism by which caching percentages are set is initially
quite crude - that will give us a clear-cut benefit that people can
begin enjoying immediately.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache estimates, cache access cost

2011-05-17 Thread Cédric Villemain
2011/5/17 Robert Haas robertmh...@gmail.com:
 On Sun, May 15, 2011 at 11:52 PM, Greg Smith g...@2ndquadrant.com wrote:
 Cédric Villemain wrote:

 http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

 This rebases easily to make Cedric's changes move to the end; I just pushed
 a version with that change to
 https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
 wants a cleaner one to browse.  I've attached a patch too if that's more
 your thing.

 Thank you.  I don't much like sucking in other people's git repos - it
 tends to take a lot longer than just opening a patch file, and if I
 add the repo as a remote then my git repo ends up bloated.  :-(

 The more important question is how to store the data collected and
 then use it for optimizing queries.

 Agreed, but unless I'm missing something, this patch does nothing
 about that.  I think the first step needs to be to update all the
 formulas that are based on random_page_cost and seq_page_cost to
 properly take cache_page_cost into account - and in some cases it may
 be a bit debatable what the right mathematics are.

Yes, I provide the branch only in case someone want to hack the
costsize and to close the problem of getting stats.


 For what it's worth, I don't believe for a minute that an analyze
 process that may run only run on a given table every six months has a
 chance of producing useful statistics about the likelihood that a
 table will be cached.  The buffer cache can turn over completely in
 under a minute, and a minute is a lot less than a month.  Now, if we
 measured this information periodically for a long period of time and
 averaged it, that might be a believable basis for setting an optimizer

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.
Maybe it is datawarehouse, so data move a lot, then just update your
per-relation stats before starting your queries as suggested in other
threads. Maybe it is just a matter of frequency of stats update or
explicit request like we *use to do* (ANALYZE foo;) to handle those
situations.

 parameter.  But I think we should take the approach recently discussed
 on performance: allow it to be manually set by the administrator on a
 per-relation basis, with some reasonable default (maybe based on the
 size of the relation relative to effective_cache_size) if the
 administrator doesn't intervene.  I don't want to be excessively
 negative about the approach of examining the actual behavior of the
 system and using that to guide system behavior - indeed, I think there
 are quite a few places where we would do well to incorporate that
 approach to a greater degree than we do currently.  But I think that
 it's going to take a lot of research, and a lot of work, and a lot of
 performance testing, to convince ourselves that we've come up with an
 appropriate feedback mechanism that will actually deliver better
 performance across a large variety of workloads.  It would be much
 better, IMHO, to *first* get a cached_page_cost parameter added, even
 if the mechanism by which caching percentages are set is initially
 quite crude - that will give us a clear-cut benefit that people can
 begin enjoying immediately.

The plugin I provided is just to be able to do first analysis on how
the os cache size move. You can either use pgfincore to monitor that
per table or use the patch and monitor columns values for *cache.

I took the Hooks approach because it allows to do what you want :)
You can set up a hook where you set the values you want to see, it
allows for example to fix cold start values, or permanent values set
by DBA or ... do what you want here.

The topic is do we need more parameters to increase the value of our planner ?
1/ cache_page_cost
2/ cache information, arbitrary set or not.

Starting with 1/ is ok for me, I prefer to try both at once if
possible to remove the pain to hack twice costsize.c

Several items are to be discussed after that: formulas to handle
'small' tables, data distribution usage (this one hit an old topic
about auto-partitionning  as we are here), cold state, hot state, ...

PS: there is very good blocker for the pg_class changes : what happens
in a standby ? Maybe it just opens the door on how to unlock that or
find another option to get the information per table but distinct per
server. (or we don't care, at least for a first implementation, like
for other parameters)
-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] cache estimates, cache access cost

2011-05-15 Thread Cédric Villemain
Hello

cache estimation and cache access cost are currently not accounted
explicitly: they have a cost associated with but no constants (other
than effective_cache_size but it has a very limited usage).

Every IO cost is build with a derivation of the seq_page_cost,
random_page_cost and the number of pages. Formulas are used in some
places to make the cost more or less, to take into account caching and
data alignment.

There are:

 * estimation of page we will find in the postgresql buffer cache
 * estimation of page we will find in the operating system cache buffer cache

and they can be compute for :

 * first access
 * several access

We currently don't make distinction between both cache areas (there is
more cache areas but we don't care here) and we 'prefer' estimate
several access instead of the first one.

There is also a point related to cost estimation, they are strong: for
example once a sort goes over work_mem, its cost jumped because page
access are accounted.

The current cost estimations are already very good, most of our
queries run well without those famous 'HINT' and the planner provide
the best plan in most cases.

But I believe that now we need more tools to improve even more the
cost estimation.
I would like to propose some ideas, not my ideas in all cases, the
topic is in the air since a long time and probably that everything has
already being said (at least around a beer or a pepsi)

Adding a new GUC cache_page_cost:
- allows to cost the page access when it is estimated in cache
- allows to cost a sort exceeding work_mem but which should not hit disk
- allows to use random_page_cost for what it should be.
(I was tempted by a GUC write_page_cost but I am unsure for this one
at this stage)

Adding 2 columns to pg_class oscache_percent and pgcache_percent
(or similar names): they allow to store stats about the percentage of
a relation in each cache.
- Usage should be to estimate cost of first access to pages then use
the Mackert and Lohman formula on next access. The later only provide
a way to estimate cost of re-reading.

It is hard to advocate here with real expected performance gain other
than: we will have more options for more precise planner decision and
we may reduce the number of report for bad planning. (it is also in
the todolist to improve  cache estimation)

--

I've already hack a bit the core for that and added the 2 new columns
with hooks to update them. ANALYZE OSCACHE update one of them and a
plugin can be used to provide the estimate (so how it's filled is not
important, most OSes have solutions to estimate it accurately if
someone wonder)
It is as-is for POC, probably not clean enough to go to commit festand
not expected to go there before some consensus are done.
http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

-- 

Hacking costsize is ... dangerous, I would say. Breaking something
which works already so well is easy. Changing only one cost function
is not enough to keep a good balance
Performance farm should help here ... and the full cycle for 9.2 too.

Comments ?
-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache estimates, cache access cost

2011-05-15 Thread Greg Smith

Cédric Villemain wrote:

http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
  


This rebases easily to make Cedric's changes move to the end; I just 
pushed a version with that change to 
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone 
wants a cleaner one to browse.  I've attached a patch too if that's more 
your thing.


I'd recommend not getting too stuck on the particular hook Cédric has 
added here to compute the cache estimate, which uses mmap and mincore to 
figure it out.  It's possible to compute similar numbers, albeit less 
accurate, using an approach similar to how pg_buffercache inspects 
things.  And I even once wrote a background writer extension that 
collected this sort of data as it was running the LRU scan anyway.  
Discussions of this idea seem to focus on how the what's in the cache? 
data is collected, which as far as I'm concerned is the least important 
part.  There are multiple options, some work better than others, and 
there's no reason that can't be swapped out later.  The more important 
question is how to store the data collected and then use it for 
optimizing queries.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


diff --git a/contrib/Makefile b/contrib/Makefile
index 6967767..47652d5 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		lo		\
 		ltree		\
 		oid2name	\
+		oscache	\
 		pageinspect	\
 		passwordcheck	\
 		pg_archivecleanup \
diff --git a/contrib/oscache/Makefile b/contrib/oscache/Makefile
new file mode 100644
index 000..8d8dcc5
--- /dev/null
+++ b/contrib/oscache/Makefile
@@ -0,0 +1,15 @@
+# contrib/oscache/Makefile
+
+MODULE_big = oscache
+OBJS = oscache.o
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/oscache
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/oscache/oscache.c b/contrib/oscache/oscache.c
new file mode 100644
index 000..1ad7dc2
--- /dev/null
+++ b/contrib/oscache/oscache.c
@@ -0,0 +1,151 @@
+/*-
+ *
+ * oscache.c
+ *
+ *
+ * Copyright (c) 2011, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/oscache/oscache.c
+ *
+ *-
+ */
+/* { POSIX stuff */
+#include stdlib.h /* exit, calloc, free */
+#include sys/stat.h /* stat, fstat */
+#include sys/types.h /* size_t, mincore */
+#include unistd.h /* sysconf, close */
+#include sys/mman.h /* mmap, mincore */
+/* } */
+
+/* { PostgreSQL stuff */
+#include postgres.h /* general Postgres declarations */
+#include utils/rel.h /* Relation */
+#include storage/bufmgr.h
+#include catalog/catalog.h /* relpath */
+/* } */
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+float4 oscache(Relation, ForkNumber);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Install hook. */
+	OSCache_hook = oscache;
+}
+
+/*
+ * oscache process the os cache inspection for the relation.
+ * It returns the percentage of blocks in OS cache.
+ */
+float4
+oscache(Relation relation, ForkNumber forkNum)
+{
+	int  segment = 0;
+	char *relationpath;
+	char filename[MAXPGPATH];
+	int fd;
+	int64  total_block_disk = 0;
+	int64  total_block_mem  = 0;
+
+	/* OS things */
+	int64 pageSize  = sysconf(_SC_PAGESIZE); /* Page size */
+	register int64 pageIndex;
+
+	relationpath = relpathperm(relation-rd_node, forkNum);
+
+	/*
+	 * For each segment of the relation
+	 */
+	snprintf(filename, MAXPGPATH, %s, relationpath);
+	while ((fd = open(filename, O_RDONLY)) != -1)
+	{
+		// for stat file
+		struct stat st;
+		// for mmap file
+		void *pa = (char *)0;
+		// for calloc file
+		unsigned char *vec = (unsigned char *)0;
+		int64  block_disk = 0;
+		int64  block_mem  = 0;
+
+		if (fstat(fd, st) == -1)
+		{
+			close(fd);
+			elog(ERROR, Can not stat object file : %s,
+filename);
+			return 0;
+		}
+
+		/*
+		* if file ok
+		* then process
+		*/
+		if (st.st_size != 0)
+		{
+			/* number of block in the current file */
+			block_disk = st.st_size/pageSize;
+
+			/* TODO We need to split mmap size to be sure (?) to be able to mmap */
+			pa = mmap(NULL, st.st_size, PROT_NONE, MAP_SHARED, fd, 0);
+			if (pa == MAP_FAILED)
+			{
+close(fd);
+elog(ERROR, Can not mmap object file : %s, errno = %i,%s\nThis error can happen if there is not enought space in memory to do the projection. Please mail ced...@2ndquadrant.fr with '[oscache] ENOMEM' as subject.,
+	filename, errno, strerror(errno));
+return 0;
+			}
+
+			/* Prepare our vector containing all blocks information */
+			vec = calloc(1, (st.st_size+pageSize-1)/pageSize);
+			if ((void *)0 == vec)
+			{
+munmap(pa, st.st_size);
+