Re: [PERFORM] Caching of Queries

2004-12-23 Thread Pierre-Frdric Caillaud

I've looked at PREPARE, but apparently it only lasts per-session -  
that's
worthless in our case (web based service, one connection per  
data-requiring
connection).
	You don't use persistent connections ???
	Your problem might simply be the connection time overhead (also including  
a few TCP roudtrips).

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-10-07 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes:
 First, it's not a particular problem with pgpool. As far as I know any
 connection pool solution has exactly the same problem. Second, it's
 easy to fix if PostgreSQL provides a functionarity such as:drop all
 temporary tables if any.

I don't like that definition exactly --- it would mean that every time
we add more backend-local state, we expect client drivers to know to
issue the right incantation to reset that kind of state.

I'm thinking we need to invent a command like RESET CONNECTION that
resets GUC variables, drops temp tables, forgets active NOTIFYs, and
generally does whatever else needs to be done to make the session state
appear virgin.  When we add more such state, we can fix it inside the
backend without bothering clients.

I now realize that our RESET ALL command for GUC variables was not
fully thought out.  We could possibly redefine it as doing the above,
but that might break some applications ...

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-10-07 Thread Tatsuo Ishii
 Tatsuo Ishii [EMAIL PROTECTED] writes:
  First, it's not a particular problem with pgpool. As far as I know any
  connection pool solution has exactly the same problem. Second, it's
  easy to fix if PostgreSQL provides a functionarity such as:drop all
  temporary tables if any.
 
 I don't like that definition exactly --- it would mean that every time
 we add more backend-local state, we expect client drivers to know to
 issue the right incantation to reset that kind of state.
 
 I'm thinking we need to invent a command like RESET CONNECTION that
 resets GUC variables, drops temp tables, forgets active NOTIFYs, and
 generally does whatever else needs to be done to make the session state
 appear virgin.  When we add more such state, we can fix it inside the
 backend without bothering clients.

Great. It's much better than I propose.

 I now realize that our RESET ALL command for GUC variables was not
 fully thought out.  We could possibly redefine it as doing the above,
 but that might break some applications ...
 
   regards, tom lane
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-10-06 Thread Tatsuo Ishii
  I don't know what you are exactly referring to in above URL 
  when you are talking about potential pitfalls of pooling. 
  Please explain more.
 
 Sorry, I wasn't implying that pgpool doesn't deal with the issues, just that
 some people aren't necessarily aware of them up front.  For instance, pgpool
 does an 'abort transaction' and a 'reset all' in lieu of a full reconnect
 (of course, since a full reconnect is exactly what we are trying to avoid).
 Is this is enough to guarantee that a given pooled connection behaves
 exactly as a non-pooled connection would from a client perspective?  For
 instance, temporary tables are usually dropped at the end of a session, so a
 client (badly coded perhaps) that does not already use persistent
 connections might be confused when the sequence 'connect, create temp table
 foo ..., disconnect, connect, create temp table foo ...' results in the
 error 'Relation 'foo' already exists'.

First, it's not a particular problem with pgpool. As far as I know any
connection pool solution has exactly the same problem. Second, it's
easy to fix if PostgreSQL provides a functionarity such as:drop all
temporary tables if any. I think we should implement it if we agree
that connection pooling should be implemented outside the PostgreSQL
engine itself. I think cores agree with this.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Caching of Queries

2004-10-05 Thread Matt Clark

 I don't know what you are exactly referring to in above URL 
 when you are talking about potential pitfalls of pooling. 
 Please explain more.

Sorry, I wasn't implying that pgpool doesn't deal with the issues, just that
some people aren't necessarily aware of them up front.  For instance, pgpool
does an 'abort transaction' and a 'reset all' in lieu of a full reconnect
(of course, since a full reconnect is exactly what we are trying to avoid).
Is this is enough to guarantee that a given pooled connection behaves
exactly as a non-pooled connection would from a client perspective?  For
instance, temporary tables are usually dropped at the end of a session, so a
client (badly coded perhaps) that does not already use persistent
connections might be confused when the sequence 'connect, create temp table
foo ..., disconnect, connect, create temp table foo ...' results in the
error 'Relation 'foo' already exists'.



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-10-04 Thread Jim C. Nasby
On Fri, Oct 01, 2004 at 10:10:40AM -0700, Josh Berkus wrote:
 Transparent query caching is the industry standard for how these things 
 are handled.   However, Postgres' lack of this feature has made me consider 
 other approaches, and I'm starting to wonder if the standard query caching 
 -- where a materialized query result, or some reduction thereof, is cached in 
 database memory -- isn't the best way to cache things.  I'm going to 
 abbreviate it SQC for the rest of this e-mail.
 
Not to quibble, but are you sure that's the standard? Oracle and DB2
don't do this, and I didn't think MSSQL did either. What they do do is
cache query *plans*. This is a *huge* deal in Oracle; search
http://asktom.oracle.com for 'soft parse'.

In any case, I think a means of marking some specific queries as being
cachable is an excellent idea; perfect for 'static data' scenarios. What
I don't know is how much will be saved.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud

pgpool (which makes some rather questionable claims IMO); any decent web
application language/environment will support connection pooling.
That's why it should not be tied to something specific as pgpool.
	If you want performance, which is the case here, usually you have a  
webserver serving static files, and an application server serving dynamic  
pages.
	This is not necessarily a huge application server, it can be as simple as  
an Apache instance serving static files, with a special path mod_proxy'ed  
to another instance of apache acting as an application server.
	IMHO this is a nice way to do it, because you have a light weight static  
files server which can spawn many processes without using precious  
resources like memory and postgres connections, and a specialized server  
which has a lot less processes, each one having more size, a db  
connection, etc. The connexions are permanent, of course, so there is no  
connection overhead. The proxy has an extra advantage buffering the data  
from the app server and sending it back slowly to the client, so the app  
server can then very quickly process the next request instead of hogging a  
db connection while the html is slowly trickled back to the client.
	IMHO the standard PHP way of doing things (just one server) is wrong  
because every server process, even if it's serving static files, hogs a  
connection and thus needs an extra layer for pooling.
	Thus, I see query result caching as a way to pushing further  
architectures which are already optimized for performance, not as a  
band-aid for poor design solutions like the one-apache server with pooling.

Now, a proposition :
Here is where we are now, a typical slow query :
PREPARE myquery(text,integer)
EXECUTE myquery('john',2)
My proposition :
PREPARE myquery(text,integer)
PLANNED USING ('john',2)
CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
DEPENDS ON $1, $2
MAXIMUM CACHE TIME '5 minute'::interval
MINIMUM CACHE TIME '1 minute'::interval
MAXIMUM CACHE SIZE 200
AS SELECT count(*) as number FROM mytable WHERE myname=$2 AND myfield=$1;
EXECUTE myquery('john',2)
	Explainations :
	---
	PLANNED USING ('john',2)
	Tells the planner to compute the stored query plan using the given  
parameters. This is independent from caching but could be a nice feature  
as it would avoid the possibility of storing a bad query plan.

	---
	CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
	Specifies that the result is to be cached. There is an optional condition  
(here, IF ...) telling postgres of when and where it should cache, or not  
cache. It could be useful to avoid wasting cache space.
	---
		DEPENDS ON $1, $2
	Defines the cache key. I don't know if this is useful, as the query  
parameters make a pretty obvious cache key so why repeat them. It could be  
used to add other data as a cache key, like :
		DEPENDS ON (SELECT somefunction($1))
	Also a syntax for specifying which tables should be watched for updates,  
and which should be ignored, could be interesting.
	---
		MAXIMUM CACHE TIME '5 minute'::interval
	Pretty obvious.
	---
		MINIMUM CACHE TIME '1 minute'::interval
	This query is a count and I want a fast but imprecise count. Thus, I  
specify a minimum cache time of 1 minute, meaning that the result will  
stay in the cache even if the tables change. This is dangerous, so I'd  
suggest the following :

		MINIMUM CACHE TIME CASE WHEN result.number10 THEN '1 minute'::interval  
ELSE '5 second'::interval

	Thus the cache time is an expression ; it is evaluated after performed  
the query. There needs to be a way to access the 'count' result, which I  
called 'result.number' because of the SELECT count() as number.
	The result could also be used in the CACHE IF.

	The idea here is that the count will vary over time, but we accept some  
imprecision to gain speed. SWho cares if there are 225 or 227 messages in  
a forum thread counter anyway ? However, if there are 2 messages, first  
caching the query is less necessary because it's fast, and second a  
variation in the count will be much easier to spot, thus we specify a  
shorter cache duration for small counts and a longer duration for large  
counts.

	For queries returning result sets, this is not usable of course, but a  
special feature for speeding count() queries would be welcome !

---
MAXIMUM CACHE SIZE 200
Pretty obvious. Size in bytes.
	For queries returning several rows, MIN/MAX on result rows could be  
useful also :
		MAXIMUM RESULT ROWS nnn
	Or maybe :
		CACHE IF (select count(*) from result)  nnn


	Thinking about it, using prepared queries seems a bad idea ; maybe the  
cache should act on the result of functions. This would force the  
application programmers to put the queries they want to optimize in  

Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud

1) The materialized data is available in 3 different forms; a list, a  
detail
view, and a spreadsheet.  Each form as somewhat different columns and
different rules about ordering, which would likely confuse an SQC  
planner.
In this implementation, all 3 forms are able to share the same cache.
See my proposal to cache function results.
You can create a cached function and :
	SELECT your rows FROM cached_function(parameters) WHERE ... ORDER BY...  
GROUP BY...

	will only fetch the function result from the cache, and then the only  
additional costs are the ORDER and GROUP BY... the query parsing is very  
simple, it's just a select, and a cached function scan

	I think caching can be made much more powerful if it is made usable like  
this. I mean, not only cache a query and its result, but being able to use  
cached queries internally like this and manipulaing them, adds value to  
the cached data and allows storing less data in the cache because  
duplicates are avoided. Thus we could use cached results in CHECK()  
conditions, inside plsql functions, anywhere...

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud

If it was in pgpool or something similar, I could devote a separate  
machine just for caching results leaving the db server untouched.
	BUT you would be limited to caching complete queries. There is a more  
efficient strategy...


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-10-03 Thread Tatsuo Ishii
 More to the point though, I think this is a feature that really really 
 should be in the DB, because then it's trivial for people to use.  
 
 
 
 How does putting it into PGPool make it any less trivial for people to
 use?
 
 The answers are at  http://www2b.biglobe.ne.jp/~caco/pgpool/index-e.html 
 .  Specifically, it's a separate application that needs configuration, 
 the homepage has no real discussion of the potential pitfalls of pooling 
 and what this implementation does to get around them, you get the idea.  

I don't know what you are exactly referring to in above URL when you
are talking about potential pitfalls of pooling. Please explain
more.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Caching of Queries

2004-10-02 Thread Josh Berkus
Aaron,

 I'm not sure I understand your req fully.

I'm not surprised.I got wrapped up in an overly involved example and 
completely left off the points I was illustrating.   So here's the points, in 
brief:

1) Query caching is not  a single problem, but rather several different 
problems requiring several different solutions.

2) Of these several different solutions, any particular query result caching 
implementation (but particularly MySQL's) is rather limited in its 
applicability, partly due to the tradeoffs required.Per your explanation, 
Oracle has improved this by offering a number of configurable options.

3) Certain other caching problems would be solved in part by the ability to 
construct in-memory tables which would be non-durable and protected from 
cache-flushing.  This is what I'm interested in chatting about.

BTW, I AM using a summary table.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Caching of Queries

2004-10-02 Thread William Yu
Josh Berkus wrote:
1) Query caching is not  a single problem, but rather several different 
problems requiring several different solutions.

2) Of these several different solutions, any particular query result caching 
implementation (but particularly MySQL's) is rather limited in its 
applicability, partly due to the tradeoffs required.Per your explanation, 
Oracle has improved this by offering a number of configurable options.

3) Certain other caching problems would be solved in part by the ability to 
construct in-memory tables which would be non-durable and protected from 
cache-flushing.  This is what I'm interested in chatting about.
Just my 2 cents on this whole issue. I would lean towards having result 
caching in pgpool versus the main backend. I want every ounce of memory 
on a database server devoted to the database. Caching results would 
double the effect of cache flushing ... ie, now both the results and the 
pages used to build the results are in memory pushing out other stuff to 
disk that may be just as important.

If it was in pgpool or something similar, I could devote a separate 
machine just for caching results leaving the db server untouched.

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-10-02 Thread Josh Berkus
William,

 Just my 2 cents on this whole issue. I would lean towards having result
 caching in pgpool versus the main backend. I want every ounce of memory
 on a database server devoted to the database. Caching results would
 double the effect of cache flushing ... ie, now both the results and the
 pages used to build the results are in memory pushing out other stuff to
 disk that may be just as important.

 If it was in pgpool or something similar, I could devote a separate
 machine just for caching results leaving the db server untouched.

Oddly, Joe Conway just mentioned the same idea to me.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Jim C. Nasby
On Fri, Oct 01, 2004 at 06:43:42AM +0100, Matt Clark wrote:
 
 If you're not using a connection pool of some kind then you might as
 well forget query plan caching, because your connect overhead will swamp
 the planning cost. This does not mean you have to use something like
 pgpool (which makes some rather questionable claims IMO); any decent web
 application language/environment will support connection pooling.
  
 
 Hmm, a question of definition -  there's a difference between a pool and 
 a persistent connection.  Pretty much all web apps have one connection 
 per process, which is persistent (i.e. not dropped and remade for each 
 request), but not shared between processes, therefore not pooled.

OK, that'd work too... the point is if you're re-connecting all the time
it doesn't really matter what else you do for performance.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Matt Clark
 OK, that'd work too... the point is if you're re-connecting 
 all the time it doesn't really matter what else you do for 
 performance.

Yeah, although there is the chap who was asking questions on the list
recently who had some very long-running code on his app servers, so was best
off closing the connection because he had far too many postmaster processes
just sitting there idle all the time!

But you're right, it's a killer usually.

M 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Aaron Werman
The context of the discussion was a hack to speed queries against static
tables, so MVCC is not relevent.  As soon as any work unit against a
referenced table commits, the cache is invalid, and in fact the table
shouldn't be a candidate for this caching for a while. In fact, this cache
would reduce some the MVCC 'select count(*) from us_states' type of horrors.

(The attraction of a server side cache is obviously that it could *with no
server or app changes* dramatically improve performance. A materialized view
is a specialized denormalization-ish mechanism to optimize a category of
queries and requires the DBA to sweat the details. It is very hard to cache
things stochastically without writing a server. Trigger managed extracts
won't help you execute 1,000 programs issuing the  query  select sec_level
from sec where division=23 each second or a big table loaded monthly.)



- Original Message - 
From: Jeff [EMAIL PROTECTED]
To: Mitch Pirtle [EMAIL PROTECTED]
Cc: Aaron Werman [EMAIL PROTECTED]; Scott Kirkwood
[EMAIL PROTECTED]; Neil Conway [EMAIL PROTECTED];
[EMAIL PROTECTED]; Tom Lane [EMAIL PROTECTED]
Sent: Monday, September 27, 2004 2:25 PM
Subject: Re: [PERFORM] Caching of Queries


 [ discussion of server side result caching ]

 and lets not forget PG's major fork it will throw into things:  MVCC
 The results of query A may hold true for txn 1, but not txn 2 and so on
 .
 That would have to be taken into account as well and would greatly
 complicate things.

 It is always possible to do a poor man's query cache with triggers..
 which would just leave you with basically a materialized view.

 --
 Jeff Trout [EMAIL PROTECTED]
 http://www.jefftrout.com/
 http://www.stuarthamm.net/


 ---(end of broadcast)---
 TIP 8: explain analyze is your friend


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Josh Berkus
People:

Transparent query caching is the industry standard for how these things 
are handled.   However, Postgres' lack of this feature has made me consider 
other approaches, and I'm starting to wonder if the standard query caching 
-- where a materialized query result, or some reduction thereof, is cached in 
database memory -- isn't the best way to cache things.  I'm going to 
abbreviate it SQC for the rest of this e-mail.

Obviously, the draw of SQC is its transparency to developers.   With it, the 
Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you 
set it up, give it some RAM, and it just works.   As someone who frequently 
has to consult based on limited knowledge, I can understand the appeal.

However, one of the problems with SQC, aside from the ones already mentioned 
of stale data and/or cache-clearing, is that (at least in applications like 
MySQL's) it is indiscriminate and caches, at least breifly, unique queries as 
readily as common ones.   Possibly Oracle's implementation is more 
sophisticated; I've not had an opportunity.

The other half of that problem is that an entire query is cached, rather than 
just the relevant data to uniquely identify the request to the application.
This is bad in two respects; one that the entire query needs to be parsed to 
see if a new query is materially equivalent, and that two materially 
different queries which could utilize overlapping ranges of the same 
underlying result set must instead cache their results seperately, eating up 
yet more memory.

To explain what I'm talking about, let me give you a counter-example of 
another approach.

I have a data-warehousing application with a web front-end.The data in the 
application is quite extensive and complex, and only a summary is presented 
to the public users -- but that summary is a query involving about 30 lines 
and 16 joins.  This summary information is available in 3 slightly different 
forms.  Further, the client has indicated that an up to 1/2 hour delay in 
data freshness is acceptable.

The first step is forcing that materialized view of the data into memory.  
Right now I'm working on a reliable way to do that without using Memcached, 
which won't install on our Solaris servers.  Temporary tables have the 
annoying property of being per-connection, which doesn't work in a pool of 60 
connections.

The second step, which I completed first due to the lack of technical 
obstacles, is to replace all queries against this data with calls to a 
Set-Returning Function (SRF).   This allowed me to re-direct where the data 
was coming from -- presumably the same thing could be done through RULES, but 
it would have been considerably harder to implement.

The first thing the SRF does is check the criteria passed to it against a set 
of cached (in a table) criteria with that user's permission level which is  
1/2 hour old.   If the same criteria are found, then the SRF is returned a 
set of row identifiers for the materialized view (MV), and looks up the rows 
in the MV and returns those to the web client.   

If no identical set of criteria are found, then the query is run to get a set 
of identifiers which are then cached, and the SRF returns the queried rows.

Once I surmount the problem of storing all the caching information in 
protected memory, the advantages of this approach over SQC are several:

1) The materialized data is available in 3 different forms; a list, a detail 
view, and a spreadsheet.  Each form as somewhat different columns and 
different rules about ordering, which would likely confuse an SQC planner.   
In this implementation, all 3 forms are able to share the same cache.

2) The application is comparing only sets of unambguous criteria rather than 
long queries which would need to be compared in planner form in order to 
determine query equivalence. 

3) With the identifier sets, we are able to cache other information as well, 
such as a count of rows, further limiting the number of queries we must run.

4) This approach is ideally suited to the pagination and re-sorting common to 
a web result set.  As only the identifiers are cached, the results can be 
re-sorted and broken in to pages after the cache read, a fast, all-in-memory 
operation.

In conclusion, what I'm saying is that while forms of transparent query 
caching (plan, materialized or whatever) may be desirable for other reasons, 
it's quite possible to acheive a superior level of query caching through 
tight integration with the front-end application.   

If people are interested in this, I'd love to see some suggestions on ways to 
force the materialized view into dedicated memory.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Aaron Werman
I'm not sure I understand your req fully. If the same request is repeatedly
done with same parameters, you could implement a proxy web server with a
croned script to purge stale pages. If there is substantially the same data
being summarized, doing your own summary tables works; if accessed enough,
they're in memory. I interleaved some notes into your posting.

- Original Message - 

From: Josh Berkus [EMAIL PROTECTED]

To: Postgresql Performance [EMAIL PROTECTED]

Sent: Friday, October 01, 2004 1:10 PM

Subject: Re: [PERFORM] Caching of Queries




 People:

 Transparent query caching is the industry standard for how these
things
 are handled.   However, Postgres' lack of this feature has made me
consider
 other approaches, and I'm starting to wonder if the standard query
caching
 -- where a materialized query result, or some reduction thereof, is cached
in
 database memory -- isn't the best way to cache things.  I'm going to
 abbreviate it SQC for the rest of this e-mail.

 Obviously, the draw of SQC is its transparency to developers.   With it,
the
 Java/Perl/PHP programmers and the DBA don't have to communicate at all -- 
you
 set it up, give it some RAM, and it just works.   As someone who
frequently
 has to consult based on limited knowledge, I can understand the appeal.

My sense is that pg is currently unique among popular dbmses in having the
majority of applications being homegrown (a chicken / egg / advocacy issue -
if I install a CMS, I'm not the DBA or the PHP programmer - and I don't want
to change the code; we'll see more about this when native WinPg happens).





 However, one of the problems with SQC, aside from the ones already
mentioned
 of stale data and/or cache-clearing, is that (at least in applications
like
 MySQL's) it is indiscriminate and caches, at least breifly, unique queries
as
 readily as common ones.   Possibly Oracle's implementation is more
 sophisticated; I've not had an opportunity.

I'm not sure I agree here. Stale data and caching choice are
optimizer/buffer manager choices and implementation can decide whether to
allow stale data. These are design choices involving development effort and
choices of where to spend server cycles and memory. All buffering choices
cache unique objects, I'm not sure why this is bad (but sensing you want
control of the choices). FWIW, this is my impression of other dbmses.

In MySQL, a global cache can be specified with size and globally, locally,
or through statement hints in queries to suggest caching results. I don't
believe that these could be used as common subexpressions (with an exception
of MERGE table component results). The optimizer knows nothing about the
cached results - SQL select statements are hashed, and can be replaced by
the the cached statement/results on a match.

In DB2 and Oracle result sets are not cached. They have rich sets of
materialized view features (that match your requirements). They allow a
materialized view to be synchronous with table updates or asynchronous.
Synchronous is often an unrealistic option, and asynchronous materialized
views are refreshed at a specified schedule. The optimizers allow query
rewrite (in Oracle it is a session option) so one can connect to the
database and specify that the optimizer is allowed to replace subexpressions
with data from (possibly stale) materialized views. SQL Server 2K has more
restrictive synchronous MVs, but I've never used them.

So, in your example use in Oracle, you would need to define appropriate MVs
with a ½ hour refresh frequency, and hope that the planner would use them in
your queries. The only change in the app is on connection you would allow
use of asynchronous stale data.

You're suggesting an alternative involving identifying common, but
expensive, subexpressions and generating MVs for them. This is a pretty
sophisticated undertaking, and probably requires some theory research to
determine if it's viable.



 The other half of that problem is that an entire query is cached, rather
than
 just the relevant data to uniquely identify the request to the
application.
 This is bad in two respects; one that the entire query needs to be parsed
to
 see if a new query is materially equivalent, and that two materially
 different queries which could utilize overlapping ranges of the same
 underlying result set must instead cache their results separately, eating
up
 yet more memory.

There are two separate issues. The cost of parse/optimization and the cost
of results retrieval. Other dbmses hash statement text. This is a good
thing, and probably 3 orders of magnitude faster than parse and
optimization. (Oracle also has options to replace literals with parameters
and match parse trees instead of text, expecting parse costs to be less than
planning costs.) MySQL on a match simply returns the result set. Oracle and
DB2 attempt to rewrite queries to use the DBA selected extracts. The MySQL
approach seems to be almost what you're describing: all it needs

Re: [PERFORM] Caching of Queries

2004-09-30 Thread Jim C. Nasby
On Mon, Sep 27, 2004 at 09:30:31PM +0100, Matt Clark wrote:
 It's certainly the case that the typical web app (which, along with 
 warehouses, seems to be one half of the needy apps), could probably do 
 worse than use pooling as well.  I'm not well up enough on pooling to 
 know how bulletproof it is though, which is why I included it in my list 
 of things that make me go 'hmm'.  It would be really nice not to 
 have to take both things together.
 
If you're not using a connection pool of some kind then you might as
well forget query plan caching, because your connect overhead will swamp
the planning cost. This does not mean you have to use something like
pgpool (which makes some rather questionable claims IMO); any decent web
application language/environment will support connection pooling.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-09-30 Thread Matt Clark

If you're not using a connection pool of some kind then you might as
well forget query plan caching, because your connect overhead will swamp
the planning cost. This does not mean you have to use something like
pgpool (which makes some rather questionable claims IMO); any decent web
application language/environment will support connection pooling.
 

Hmm, a question of definition -  there's a difference between a pool and 
a persistent connection.  Pretty much all web apps have one connection 
per process, which is persistent (i.e. not dropped and remade for each 
request), but not shared between processes, therefore not pooled.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-09-28 Thread Aaron Werman

- Original Message - 
From: Tom Lane [EMAIL PROTECTED]
To: Iain [EMAIL PROTECTED]
Cc: Jim C. Nasby [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Monday, September 27, 2004 11:17 PM
Subject: Re: [PERFORM] Caching of Queries


 Iain [EMAIL PROTECTED] writes:
  I can only tell you (roughly) how it works wth Oracle,

 Which unfortunately has little to do with how it works with Postgres.
 This latches stuff is irrelevant to us.

Latches are the Oracle term for semaphores. Both Oracle and pg use
semaphores and spin locks to serialize activity in critical sections. I
believe that the point that blocking/queuing reduces scalability is valid.


 In practice, any repetitive planning in PG is going to be consulting
 catalog rows that it draws from the backend's local catalog caches.
 After the first read of a given catalog row, the backend won't need
 to re-read it unless the associated table has a schema update.  (There
 are some other cases, like a VACUUM FULL of the catalog the rows came
 from, but in practice catalog cache entries don't change often in most
 scenarios.)  We need place only one lock per table referenced in order
 to interlock against schema updates; not one per catalog row used.

 The upshot of all this is that any sort of shared plan cache is going to
 create substantially more contention than exists now --- and that's not
 even counting the costs of managing the cache, ie deciding when to throw
 away entries.

I imagine a design where a shared plan cache would consist of the plans,
indexed by a statement hash and again by dependant objects.  A statement to
be planned would be hashed and matched to the cache. DDL would need to
synchronously destroy all dependant plans. If each plan maintains a validity
flag, changing the cache wouldn't have to block so I don't see where there
would be contention.


 A backend-local plan cache would avoid the contention issues, but would
 of course not allow amortizing planning costs across multiple backends.

 I'm personally dubious that sharing planning costs is a big deal.
 Simple queries generally don't take that long to plan.  Complicated
 queries do, but I think the reusability odds go down with increasing
 query complexity.


I think both the parse and planning are major tasks if the transaction rate
is high. Simple queries can easily take much longer to plan than execute, so
this is a scalability concern. Caching complicated queries is valuable -
apps seem to have lots of similar queries because they are intimately
related to the data model.

 regards, tom lane

 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?

http://archives.postgresql.org


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Caching of Queries

2004-09-28 Thread Tom Lane
Aaron Werman [EMAIL PROTECTED] writes:
 I imagine a design where a shared plan cache would consist of the plans,
 indexed by a statement hash and again by dependant objects.  A statement to
 be planned would be hashed and matched to the cache. DDL would need to
 synchronously destroy all dependant plans. If each plan maintains a validity
  ^
 flag, changing the cache wouldn't have to block so I don't see where there
   ^^
 would be contention.

You have contention to access a shared data structure *at all* -- for
instance readers must lock out writers.  Or didn't you notice the self-
contradictions in what you just said?

Our current scalability problems dictate reducing such contention, not
adding whole new sources of it.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-09-28 Thread Aaron Werman

- Original Message - 
From: Tom Lane [EMAIL PROTECTED]
To: Aaron Werman [EMAIL PROTECTED]
Cc: Iain [EMAIL PROTECTED]; Jim C. Nasby [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Tuesday, September 28, 2004 9:58 AM
Subject: Re: [PERFORM] Caching of Queries


 Aaron Werman [EMAIL PROTECTED] writes:
  I imagine a design where a shared plan cache would consist of the plans,
  indexed by a statement hash and again by dependant objects.  A statement
to
  be planned would be hashed and matched to the cache. DDL would need to
  synchronously destroy all dependant plans. If each plan maintains a
validity
   ^
  flag, changing the cache wouldn't have to block so I don't see where
there
^^
  would be contention.

 You have contention to access a shared data structure *at all* -- for
 instance readers must lock out writers.  Or didn't you notice the self-
 contradictions in what you just said?

 Our current scalability problems dictate reducing such contention, not
 adding whole new sources of it.

You're right - that seems unclear. What I meant is that there can be a
global hash table that is never locked, and the hashes point to chains of
plans that are only locally locked for maintenance, such as gc and chaining
hash collisions. If maintenance was relatively rare and only local, my
assumption is that it wouldn't have global impact.

The nice thing about plan caching is that it can be sloppy, unlike block
cache, because it is only an optimization tweak. So, for example, if the
plan has atomic refererence times or counts there is no need to block, since
overwriting is not so bad. If the multiprocessing planner chains the same
plan twice, the second one would ultimately age out

/Aaron


 regards, tom lane


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-09-28 Thread Matthew Nuzum
I could spend a week or two tweaking the performance of my database servers
and probably make some sizeable improvements, but I'm not going to.

Why? Because PostgreSQL screams as it is.

I would make sure that if the consensus is to add some sort of caching that
it be done only if there is no hit to current performance and stability.
That being said, I think that server side caching has major buzz and there's
nothing wrong with adding features that sell.

I will disagree with 3 points made on the argument against caching.
Specifically, the benefit of doing caching on the db server is that the
benefits may be reaped by multiple clients where as caching on the client
side must be done by each client and may not be as effective.

So what if the caching has a slight chance of returning stale results?  Just
make sure people know about it in advance.  There are some things where
stale results are no big deal and if I can easily benefit from an aggressive
caching system, I will (and I do now with the adodb caching library, but
like I said, caching has to be done for each client).  In fact, I'm all for
using a low-tech cache expiration algorithm to keep complexity down.

Finally, if the caching is not likely to help (or may even hurt) simple
queries but is likely to help complex queries then fine, make sure people
know about it and let them decide if they can benefit. 

Sorry if I'm beating a dead horse or playing the devil's advocate.  Just
felt compelled to chime in.

-- 
Matthew Nuzum  + Man was born free, and everywhere
www.bearfruit.org  :  he is in chains, Rousseau
+~~+ Then you will know the truth, and 
the TRUTH will set you free, Jesus Christ (John 8:32 NIV)

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
Sent: Monday, September 27, 2004 1:19 AM
To: Neil Conway
Cc: Aaron Werman; Scott Kirkwood; [EMAIL PROTECTED]
Subject: Re: [PERFORM] Caching of Queries

Neil Conway [EMAIL PROTECTED] writes:
 I think the conclusion of past discussions about this feature is that
 it's a bad idea. Last I checked, MySQL has to clear the *entire* query
 cache when a single DML statement modifying the table in question is
 issued.

Do they actually make a rigorous guarantee that the cached result is
still accurate when/if it is returned to the client?  (That's an honest
question --- I don't know how MySQL implements this.)

IIRC, in our past threads on this topic, it was suggested that if you
can tolerate not-necessarily-up-to-date results, you should be doing
this sort of caching on the client side and not in the DB server at all.
I wouldn't try that in a true client scenario, but when the DB client
is application-server middleware, it would make some sense to cache in
the application server.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Jeff
[ discussion of server side result caching ]
and lets not forget PG's major fork it will throw into things:  MVCC
The results of query A may hold true for txn 1, but not txn 2 and so on 
.
That would have to be taken into account as well and would greatly 
complicate things.

It is always possible to do a poor man's query cache with triggers.. 
which would just leave you with basically a materialized view.

--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Mitch Pirtle
On Mon, 27 Sep 2004 18:20:48 +0100, Matt Clark [EMAIL PROTECTED] wrote:
 This is very true.  Client side caching is an enormous win for apps, but it
 requires quite a lot of logic, triggers to update last-modified fields on
 relevant tables, etc etc.  Moving some of this logic to the DB would perhaps
 not usually be quite as efficient as a bespoke client caching solution, but
 it will above all be a lot easier for the application developer!

In the world of PHP it is trivial thanks to PEAR's Cache_Lite.  The
project lead for Mambo implemented page-level caching in a day, and
had all the triggers for clearing the cache included in the content
management interface - not difficult at all.

Basically you set a default in seconds for the HTML results to be
cached, and then have triggers set that force the cache to regenerate
(whenever CRUD happens to the content, for example).

Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
believer out of me!

-- Mitch

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Jim C. Nasby
On Thu, Sep 23, 2004 at 08:29:25AM -0700, Mr Pink wrote:
 Not knowing anything about the internals of pg, I don't know how this relates, but 
 in theory, 
 query plan caching is not just about saving time re-planning queries, it's about 
 scalability.
 Optimizing queries requires shared locks on the database metadata, which, as I 
 understand it
 causes contention and serialization, which kills scalability. 

One of the guru's can correct me if I'm wrong here, but AFAIK metadata
lookups use essentially the same access methods as normal queries. This
means MVCC is used and no locking is required. Even if locks were
required, they would be shared read locks which wouldn't block each
other.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Matt Clark

Basically you set a default in seconds for the HTML results to be
cached, and then have triggers set that force the cache to regenerate
(whenever CRUD happens to the content, for example).
Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
believer out of me!
 

Nice to have it in a library, but if you want to be that simplistic then 
it's easy in any language.  What if a process on server B modifies a n 
important value that server A has cached though?  Coherency (albeit that 
the client may choose to not use it) is a must for a general solution.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Rod Taylor
 More to the point though, I think this is a feature that really really 
 should be in the DB, because then it's trivial for people to use.  

How does putting it into PGPool make it any less trivial for people to
use?


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Steve Atkins
On Mon, Sep 27, 2004 at 09:19:12PM +0100, Matt Clark wrote:

 Basically you set a default in seconds for the HTML results to be
 cached, and then have triggers set that force the cache to regenerate
 (whenever CRUD happens to the content, for example).
 
 Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
 believer out of me!
 
  
 
 Nice to have it in a library, but if you want to be that simplistic then 
 it's easy in any language.  What if a process on server B modifies a n 
 important value that server A has cached though?  Coherency (albeit that 
 the client may choose to not use it) is a must for a general solution.

memcached is one solution designed for that situation. Easy to use
from most languages.  Works. Lets you use memory on systems where you
have it, rather than using up valuable database server RAM that's
better spent caching disk sectors.

Any competently written application where caching results would be a
suitable performance boost can already implement application or
middleware caching fairly easily, and increase performance much more
than putting result caching into the database would.

I don't see caching results in the database as much of a win for most
well written applications.  Toy benchmarks, sure, but for real apps it
seems it would add a lot of complexity, and violate the whole point of
using an ACID database.

(Caching parse trees or query plans, though? It'd be interesting to
 model what effect that'd have.)

Cheers,
  Steve


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Matt Clark






  
More to the point though, I think this is a feature that really really 
should be in the DB, because then it's trivial for people to use.  

  
  
How does putting it into PGPool make it any less trivial for people to
use?

The answers are at
http://www2b.biglobe.ne.jp/~caco/pgpool/index-e.html . Specifically,
it's a separate application that needs configuration, the homepage has
no real discussion of the potential pitfalls of pooling and what this
implementation does to get around them, you get the idea. I'm sure
it's great software, but it doesn't come as part of the DB server, so
95% of people who would benefit from query caching being implemented in
it never will. If it shipped with and was turned on by default in SUSE
or RedHat that would be a different matter. Which I realise makes me
look like one of those people who doesn't appreciate code unless it's
'popular', but I hope I'm not *that* bad...

Oh OK, I'll say it, this is a perfect example of why My*** has so much
more mindshare. It's not better, but it sure makes the average Joe
_feel_ better. Sorry, I've got my corporate hat on today, I'm sure
I'll feel a little less cynical tomorrow.

M




Re: [PERFORM] Caching of Queries

2004-09-27 Thread Matt Clark

Any competently written application where caching results would be a
suitable performance boost can already implement application or
middleware caching fairly easily, and increase performance much more
than putting result caching into the database would.
 

I guess the performance increase is that you can spend $10,000 on a 
developer, or $10,000 on hardware, and for the most part get a more 
reliable result the second way.  MemcacheD is fine(ish), but it's not a 
panacea, and it's more than easy to shoot yourself in the foot with it.  
Caching is hard enough that lots of people do it badly - I'd rather use 
an implementation from the PG team than almost anywhere else.

I don't see caching results in the database as much of a win for most
well written applications.  Toy benchmarks, sure, but for real apps it
seems it would add a lot of complexity, and violate the whole point of
using an ACID database.
 

Well the point surely is to _remove_ complexity from the application, 
which is written by God Knows Who, and put it in the DB, which is 
written by God And You.  And you can still have ACID (cached data is not 
the same as stale data, although once you have the former, the latter 
can begin to look tempting sometimes).

M
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Iain
Jim,

I can only tell you (roughly) how it works wth Oracle, and it's a very well
documented and laboured point over there - it's the cornerstone of Oracle's
scalability architecture, so if you don't believe me, or my explanation is
just plain lacking, then it wouldn't be a bad idea to check it out. The
other Tom aka Tomas Kyte runs the Ask Tom site which is a great source of
info on this. It's also very well explained in his book Expert one on one
Oracle I think it was called. I havn't seen any reason yet as to why the
same issues shouldn't, don't or  wouldn't apply to pg.

Your comment is both right and wrong. Yes, metadata lookups are essentially
the same as as access methods for normal queries. Any time you read data in
the DB you have to place a shared lock, often called a latch - it's a
lightweight type of lock. The trouble is that while a data page can have
multiple latches set at any time, only 1 process can be placing a  a latch
on a page at a time. This doesn't sound so serious so far, latches are
lightweight afterall, however... even in a database of a billion rows and
100+ tables, the database metadata is a very _small_ area. You must put
latches on the metadata tables to do optimization, so for example, if you
are optimizing a 10 table join, you must queue up 10 times to place your
latchs. You then do your optimization and queue up 10 more times to remove
your latches. In fact it is worse than this, because you won't queue up 10
times it's more likely to be a hundred times since it is far more complex
than 1 latch per table being optimized (you will be looking up statistics
and other things).

As I already said, even in a huge DB of a billion rows, these latches are
happening on a realatively small and concentrated data set - the metadata.
Even if there is no contention for the application data, the contention for
the metadata may be furious. Consider this scenario, you have a 1000 users
constantly submitting queries that must not only be soft parsed (SQL
statement syntax) but hard parsed (optimized) because you have no query
cache. Even if they are looking at completely different data, they'll all be
queuing up for latches on the same little patch of metadata. Doubling your
CPU speed or throwing in a fibre channel disk array will not help here, the
system smply won't scale.

Tom Lane noted that since the query cache would be in shared memory the
contention issue does not go away. This is true, but I don't think that it's
hard to see that the amount of contention is consderably less in any system
that is taking advantage of the caching facility - ie applications using
bind variables to reduce hard parsing. However, badly written applications
(from the point of view of query cache utilization) could very well
experience a degradation in performance. This could be handled with an
option to disable caching - or even better to disable caching of any sql not
using binds. I don't think even the mighty Oracle has that option.

As you may have guessed, my vote is for implementing a query cache that
includes plans.

I have no specific preference as to data caching. It doesn't seem to be so
important to me.

Regards
Iain



 On Thu, Sep 23, 2004 at 08:29:25AM -0700, Mr Pink wrote:
  Not knowing anything about the internals of pg, I don't know how this
relates, but in theory,
  query plan caching is not just about saving time re-planning queries,
it's about scalability.
  Optimizing queries requires shared locks on the database metadata,
which, as I understand it
  causes contention and serialization, which kills scalability.

 One of the guru's can correct me if I'm wrong here, but AFAIK metadata
 lookups use essentially the same access methods as normal queries. This
 means MVCC is used and no locking is required. Even if locks were
 required, they would be shared read locks which wouldn't block each
 other.
 -- 
 Jim C. Nasby, Database Consultant   [EMAIL PROTECTED]
 Give your computer some brain candy! www.distributed.net Team #1828

 Windows: Where do you want to go today?
 Linux: Where do you want to go tomorrow?
 FreeBSD: Are you guys coming, or what?

 ---(end of broadcast)---
 TIP 8: explain analyze is your friend


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Tom Lane
Iain [EMAIL PROTECTED] writes:
 I can only tell you (roughly) how it works wth Oracle,

Which unfortunately has little to do with how it works with Postgres.
This latches stuff is irrelevant to us.

In practice, any repetitive planning in PG is going to be consulting
catalog rows that it draws from the backend's local catalog caches.
After the first read of a given catalog row, the backend won't need
to re-read it unless the associated table has a schema update.  (There
are some other cases, like a VACUUM FULL of the catalog the rows came
from, but in practice catalog cache entries don't change often in most
scenarios.)  We need place only one lock per table referenced in order
to interlock against schema updates; not one per catalog row used.

The upshot of all this is that any sort of shared plan cache is going to
create substantially more contention than exists now --- and that's not
even counting the costs of managing the cache, ie deciding when to throw
away entries.

A backend-local plan cache would avoid the contention issues, but would
of course not allow amortizing planning costs across multiple backends.

I'm personally dubious that sharing planning costs is a big deal.
Simple queries generally don't take that long to plan.  Complicated
queries do, but I think the reusability odds go down with increasing
query complexity.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-09-27 Thread Iain
Hi Tom,
(B
(B This "latches" stuff is irrelevant to us.
(B
(BWell, that's good to know anyway, thanks for setting me straight. Maybe
(BOracle could take a leaf out of PGs book instead of the other way around. I
(Brecall that you mentioned the caching of the schema before, so even though I
(Bassumed PG was latching the metadata, I had begun to wonder if it was
(Bactually neccessary.
(B
(BWhile it7s obviously not as critical as I thought, I think there may still
(Bbe some potential for query caching by pg. It would be nice to have the
(Boption anyway, as different applications have different needs.
(B
(BI think that re-use of SQL in applications (ie controlling the proliferation
(Bof SQL statements that are minor variants of each other) is a good goal for
(Bmaintainability, even if it doesn't have a major impact on performance as it
(Bseems you are suggesting in the case of pg. Even complex queries that must
(Bbe constructed dynamically typically only have a finite number of options
(Band can still use bind variables, so in a well tuned system, they should
(Bstill be viable candidates for caching (ie, if they aren't being bumped out
(Bof the cache by thousands of little queries not using binds).
(B
(BI'll just finish by saying that, developing applications in a way that would
(Btake advantage of any query caching still seems like good practice to me,
(Beven if the target DBMS has no query caching. For now, that's what I plan to
(Bdo with future PG/Oracle/Hypersonic (my 3 favourite DBMSs) application
(Bdevelopment anyway.
(B
(BRegards
(BIain
(B
(B
(B- Original Message - 
(BFrom: "Tom Lane" [EMAIL PROTECTED]
(BTo: "Iain" [EMAIL PROTECTED]
(BCc: "Jim C. Nasby" [EMAIL PROTECTED]; [EMAIL PROTECTED]
(BSent: Tuesday, September 28, 2004 12:17 PM
(BSubject: Re: [PERFORM] Caching of Queries
(B
(B
(B "Iain" [EMAIL PROTECTED] writes:
(B  I can only tell you (roughly) how it works wth Oracle,
(B
(B Which unfortunately has little to do with how it works with Postgres.
(B This "latches" stuff is irrelevant to us.
(B
(B In practice, any repetitive planning in PG is going to be consulting
(B catalog rows that it draws from the backend's local catalog caches.
(B After the first read of a given catalog row, the backend won't need
(B to re-read it unless the associated table has a schema update.  (There
(B are some other cases, like a VACUUM FULL of the catalog the rows came
(B from, but in practice catalog cache entries don't change often in most
(B scenarios.)  We need place only one lock per table referenced in order
(B to interlock against schema updates; not one per catalog row used.
(B
(B The upshot of all this is that any sort of shared plan cache is going to
(B create substantially more contention than exists now --- and that's not
(B even counting the costs of managing the cache, ie deciding when to throw
(B away entries.
(B
(B A backend-local plan cache would avoid the contention issues, but would
(B of course not allow amortizing planning costs across multiple backends.
(B
(B I'm personally dubious that sharing planning costs is a big deal.
(B Simple queries generally don't take that long to plan.  Complicated
(B queries do, but I think the reusability odds go down with increasing
(B query complexity.
(B
(B regards, tom lane
(B
(B ---(end of broadcast)---
(B TIP 6: Have you searched our list archives?
(B
(Bhttp://archives.postgresql.org
(B
(B
(B---(end of broadcast)---
(BTIP 2: you can get off all lists at once with the unregister command
(B(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [PERFORM] Caching of Queries

2004-09-25 Thread Gaetano Mendola
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Joshua D. Drake wrote:
|
|Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
|connections
|/ second is realistic, if not underestimated.  As is the nature of
|our site
|(realtime information about online gaming), there's a huge fan base
|and as a
|big upset happens, we'll do 50,000 page views in a span of 3-5
|minutes.
|
|
|
|First, your posts show no evidences of the CS storm bug.
|
|Second, 300 *new* connections a second is a lot.   Each new connection
|requires a significant amount of both database and OS overhead.   This
|is why all the other web developers use a connection pool.
|
|
|
| I would second this. You need to be running a connection pool and
| probably multiple web servers in
| front of that. You are talking about a huge amount of connections in
| that amount of time.
|
| Josh Drake
|
|
|
|In fact, I wouldn't be surprised if your lockups are on the OS level,
|even; I don't recall that you cited what OS you're using, but I can
|imagine locking up Linux 2.4 trying to spawn 300 new processes a
|second.
Not to mention that a proxy squid mounted in reverse proxy mode will
help a lot.
Regards
Gaetano Mendola


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBVaOg7UpzwH2SGd4RAnW4AJ9TYV0oSjYcv8Oxt4Ot/T/nJikoRgCg1Egx
r4KKm14ziu/KWFb3SnTK/U8=
=xgmw
-END PGP SIGNATURE-
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Gary Doades
On 22 Sep 2004 at 15:59, Tom Lane wrote:

 Scott Kirkwood [EMAIL PROTECTED] writes:
  What do you think?
 
 I think this would allow the problems of cached plans to bite
 applications that were previously not subject to them :-(.
 An app that wants plan re-use can use PREPARE to identify the
 queries that are going to be re-executed.
 
   regards, tom lane
 

And then there are the people that would like to upgrade and get a 
performance gain without having to change their programs. A simple 
conf flag could turn query/plan caching off for all those that rely on each 
statement being re-planned.

This is where SQLServer etc. tend to get big wins. I know from direct 
comparisons that SQLServer often takes quite a bit longer to parse/plan 
a select statement than Postgres, but wins out overall from its 
query/plan caching.

Regards,
Gary.


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Gaetano Mendola
Neil Conway wrote:
Another idea would be to improve the quality of the plan we generate at PREPARE time:
for instance you could generate 'n' plans for various combinations of
input parameters, and then choose the best query plan at EXECUTE time.
It's a difficult problem to solve, however (consider multiple parameters
to PREPARE, for example).
Do you mean store different plans for each different histogram segment ?
Regards
Gaetano Mendola
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Harald Fuchs
In article [EMAIL PROTECTED],
Scott Kirkwood [EMAIL PROTECTED] writes:

 I couldn't find anything in the docs or in the mailing list on this,
 but it is something that Oracle appears to do as does MySQL.
 The idea, I believe, is to do a quick (hash) string lookup of the
 query and if it's exactly the same as another query that has been done
 recently to re-use the old parse tree.

That's not was MySQL is doing.  MySQL caches not the query plan, but
the result set for the (hashed) query string.  If the same query comes
again, it is not executed at all (unless one of the tables involved
have been changed meanwhile).


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Mr Pink
Not knowing anything about the internals of pg, I don't know how this relates, but in 
theory, 
query plan caching is not just about saving time re-planning queries, it's about 
scalability.
Optimizing queries requires shared locks on the database metadata, which, as I 
understand it
causes contention and serialization, which kills scalability. 

I read this thread from last to first, and I'm not sure if I missed something, but if 
pg isnt
caching plans, then I would say plan caching should be a top priority for future 
enhancements. It
needn't be complex either: if the SQL string is the same, and none of the tables 
involved in the
query have changed (in structure), then re-use the cached plan. Basically, DDL and 
updated
statistics would have to invalidate plans for affected tables. 

Preferably, it should work equally for prepared statements and those not pre-prepared. 
If you're
not using prepare (and bind variables) though, your plan caching down the drain 
anyway...

I don't think that re-optimizing based on values of bind variables is needed. It seems 
like it
could actually be counter-productive and difficult to asses it's impact.

That's the way I see it anyway.

:)

--- Scott Kirkwood [EMAIL PROTECTED] wrote:

 I couldn't find anything in the docs or in the mailing list on this,
 but it is something that Oracle appears to do as does MySQL.
 The idea, I believe, is to do a quick (hash) string lookup of the
 query and if it's exactly the same as another query that has been done
 recently to re-use the old parse tree.
 It should save the time of doing the parsing of the SQL and looking up
 the object in the system tables.
 It should probably go through the planner again because values passed
 as parameters may have changed. Although, for extra points it could
 look at the previous query plan as a hint.
 On the surface it looks like an easy enhancement, but what do I know?
 I suppose it would benefit mostly those programs that use a lot of
 PQexecParams() with simple queries where a greater percentage of the
 time is spent parsing the SQL rather than building the execute plan.
 What do you think?
 
 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings
 




___
Do you Yahoo!?
Declare Yourself - Register online to vote today!
http://vote.yahoo.com

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Jason Coene
I'm not an expert, but I've been hunting down a killer performance problem
for a while now.  It seems this may be the cause.

At peak load, our database slows to a trickle.  The CPU and disk utilization
are normal - 20-30% used CPU and disk performance good.

All of our postgres processes end up in the semwai state - seemingly
waiting on other queries to complete.  If the system isn't taxed in CPU or
disk, I have a good feeling that this may be the cause.  I didn't know that
planning queries could create such a gridlock, but based on Mr Pink's
explanation, it sounds like a very real possibility.

We're running on SELECT's, and the number of locks on our high traffic
tables grows to the hundreds.  If it's not the SELECT locking (and we don't
get that many INSERT/UPDATE on these tables), could the planner be doing it?

At peak load (~ 1000 queries/sec on highest traffic table, all very
similar), the serialized queries pile up and essentially create a DoS on our
service - requiring a restart of the PG daemon.  Upon stop  start, it's
back to normal.

I've looked at PREPARE, but apparently it only lasts per-session - that's
worthless in our case (web based service, one connection per data-requiring
connection).

Does this sound plausible?  Is there an alternative way to do this that I
don't know about?  Additionally, in our case, I personally don't see any
downside to caching and using the same query plan when the only thing
substituted are variables.  In fact, I'd imagine it would help performance
significantly in high-volume web applications.

Thanks,

Jason

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-performance-
 [EMAIL PROTECTED] On Behalf Of Mr Pink
 Sent: Thursday, September 23, 2004 11:29 AM
 To: Scott Kirkwood; [EMAIL PROTECTED]
 Subject: Re: [PERFORM] Caching of Queries
 
 Not knowing anything about the internals of pg, I don't know how this
 relates, but in theory,
 query plan caching is not just about saving time re-planning queries, it's
 about scalability.
 Optimizing queries requires shared locks on the database metadata, which,
 as I understand it
 causes contention and serialization, which kills scalability.
 
 I read this thread from last to first, and I'm not sure if I missed
 something, but if pg isnt
 caching plans, then I would say plan caching should be a top priority for
 future enhancements. It
 needn't be complex either: if the SQL string is the same, and none of the
 tables involved in the
 query have changed (in structure), then re-use the cached plan. Basically,
 DDL and updated
 statistics would have to invalidate plans for affected tables.
 
 Preferably, it should work equally for prepared statements and those not
 pre-prepared. If you're
 not using prepare (and bind variables) though, your plan caching down the
 drain anyway...
 
 I don't think that re-optimizing based on values of bind variables is
 needed. It seems like it
 could actually be counter-productive and difficult to asses it's impact.
 
 That's the way I see it anyway.
 
 :)
 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Tom Lane
Jason Coene [EMAIL PROTECTED] writes:
 All of our postgres processes end up in the semwai state - seemingly
 waiting on other queries to complete.  If the system isn't taxed in CPU or
 disk, I have a good feeling that this may be the cause.

Whatever that is, I'll bet lunch that it's got 0 to do with caching
query plans.  Can you get stack tracebacks from some of the stuck
processes?  What do they show in ps?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Gaetano Mendola
Jason Coene wrote:
I'm not an expert, but I've been hunting down a killer performance problem
for a while now.  It seems this may be the cause.
At peak load, our database slows to a trickle.  The CPU and disk utilization
are normal - 20-30% used CPU and disk performance good.
For a peak load 20-30% used CPU this mean you reached your IO bottleneck.
All of our postgres processes end up in the semwai state - seemingly
waiting on other queries to complete.  If the system isn't taxed in CPU or
disk, I have a good feeling that this may be the cause.  I didn't know that
planning queries could create such a gridlock, but based on Mr Pink's
explanation, it sounds like a very real possibility.
We're running on SELECT's, and the number of locks on our high traffic
tables grows to the hundreds.  If it's not the SELECT locking (and we don't
get that many INSERT/UPDATE on these tables), could the planner be doing it?
At peak load (~ 1000 queries/sec on highest traffic table, all very
similar), the serialized queries pile up and essentially create a DoS on our
service - requiring a restart of the PG daemon.  Upon stop  start, it's
back to normal.
Give us informations on this queries, a explain analyze could be a good start
point.
I've looked at PREPARE, but apparently it only lasts per-session - that's
worthless in our case (web based service, one connection per data-requiring
connection).
Trust me the PREPARE is not doing miracle in shenarios like yours . If you use 
postgres
in a web service environment what you can use is a connection pool ( look for pgpoll 
IIRC ),
if you use a CMS then try to enable the cache in order to avoid to hit the DB for each
request.

Regards
Gaetano Mendola
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Caching of Queries

2004-09-23 Thread jason.servetar
Scott: 

We have seen similar issues when we have had massive load on our web
server. My determination was that simply the act of spawning and
stopping postgres sessions was very heavy on the box, and by
implementing connection pooling (sqlrelay), we got much higher
throughput, and better response on the server then we would get any
other way. 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Jason Coene
Sent: Thursday, September 23, 2004 10:53 AM
To: 'Mr Pink'; 'Scott Kirkwood'
Cc: [EMAIL PROTECTED]
Subject: Re: [PERFORM] Caching of Queries

I'm not an expert, but I've been hunting down a killer performance
problem
for a while now.  It seems this may be the cause.

At peak load, our database slows to a trickle.  The CPU and disk
utilization
are normal - 20-30% used CPU and disk performance good.

All of our postgres processes end up in the semwai state - seemingly
waiting on other queries to complete.  If the system isn't taxed in CPU
or
disk, I have a good feeling that this may be the cause.  I didn't know
that
planning queries could create such a gridlock, but based on Mr Pink's
explanation, it sounds like a very real possibility.

We're running on SELECT's, and the number of locks on our high traffic
tables grows to the hundreds.  If it's not the SELECT locking (and we
don't
get that many INSERT/UPDATE on these tables), could the planner be doing
it?

At peak load (~ 1000 queries/sec on highest traffic table, all very
similar), the serialized queries pile up and essentially create a DoS on
our
service - requiring a restart of the PG daemon.  Upon stop  start, it's
back to normal.

I've looked at PREPARE, but apparently it only lasts per-session -
that's
worthless in our case (web based service, one connection per
data-requiring
connection).

Does this sound plausible?  Is there an alternative way to do this that
I
don't know about?  Additionally, in our case, I personally don't see any
downside to caching and using the same query plan when the only thing
substituted are variables.  In fact, I'd imagine it would help
performance
significantly in high-volume web applications.

Thanks,

Jason

 -Original Message-
 From: [EMAIL PROTECTED]
[mailto:pgsql-performance-
 [EMAIL PROTECTED] On Behalf Of Mr Pink
 Sent: Thursday, September 23, 2004 11:29 AM
 To: Scott Kirkwood; [EMAIL PROTECTED]
 Subject: Re: [PERFORM] Caching of Queries
 
 Not knowing anything about the internals of pg, I don't know how this
 relates, but in theory,
 query plan caching is not just about saving time re-planning queries,
it's
 about scalability.
 Optimizing queries requires shared locks on the database metadata,
which,
 as I understand it
 causes contention and serialization, which kills scalability.
 
 I read this thread from last to first, and I'm not sure if I missed
 something, but if pg isnt
 caching plans, then I would say plan caching should be a top priority
for
 future enhancements. It
 needn't be complex either: if the SQL string is the same, and none of
the
 tables involved in the
 query have changed (in structure), then re-use the cached plan.
Basically,
 DDL and updated
 statistics would have to invalidate plans for affected tables.
 
 Preferably, it should work equally for prepared statements and those
not
 pre-prepared. If you're
 not using prepare (and bind variables) though, your plan caching down
the
 drain anyway...
 
 I don't think that re-optimizing based on values of bind variables is
 needed. It seems like it
 could actually be counter-productive and difficult to asses it's
impact.
 
 That's the way I see it anyway.
 
 :)
 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Jason Coene
Hi Tom,

Easily recreated with Apache benchmark, ab -n 3 -c 3000
http://webserver .  This runs 1 query per page, everything else is cached
on webserver. 

The lone query:

SELECT 
id, 
gameid, 
forumid, 
subject 
  FROM threads 
  WHERE nuked = 0 
  ORDER BY nuked DESC, 
lastpost DESC LIMIT 8

Limit  (cost=0.00..1.99 rows=8 width=39) (actual time=27.865..28.027 rows=8
loops=1)
   -  Index Scan Backward using threads_ix_nuked_lastpost on threads
(cost=0.0 0..16824.36 rows=67511 width=39) (actual time=27.856..27.989
rows=8 loops=1)
 Filter: (nuked = 0)
 Total runtime: 28.175 ms

I'm not sure how I go about getting the stack traceback you need.  Any info
on this?  Results of ps below.  System is dual xeon 2.6, 2gb ram, hardware
raid 10 running FreeBSD 5.2.1.

Jason

last pid: 96094;  load averages:  0.22,  0.35,  0.38
up 19+20:50:37  13:10:45
161 processes: 2 running, 151 sleeping, 8 lock
CPU states: 12.2% user,  0.0% nice, 16.9% system,  1.6% interrupt, 69.4%
idle
Mem: 120M Active, 1544M Inact, 194M Wired, 62M Cache, 112M Buf, 2996K Free
Swap: 4096M Total, 4096M Free

  PID USERNAME PRI NICE   SIZERES STATE  C   TIME   WCPUCPU COMMAND
50557 pgsql 980 95276K  4860K select 0  24:00  0.59%  0.59% postgres
95969 pgsql  40 96048K 34272K sbwait 0   0:00  2.10%  0.29% postgres
95977 pgsql -40 96048K 29620K semwai 2   0:00  1.40%  0.20% postgres
96017 pgsql  40 96048K 34280K sbwait 0   0:00  2.05%  0.20% postgres
95976 pgsql -40 96048K 30564K semwai 3   0:00  1.05%  0.15% postgres
95970 pgsql -40 96048K 24404K semwai 1   0:00  1.05%  0.15% postgres
95972 pgsql -40 96048K 21060K semwai 1   0:00  1.05%  0.15% postgres
96053 pgsql -40 96048K 24140K semwai 3   0:00  1.54%  0.15% postgres
96024 pgsql -40 96048K 22192K semwai 3   0:00  1.54%  0.15% postgres
95985 pgsql -40 96048K 15208K semwai 3   0:00  1.54%  0.15% postgres
96033 pgsql 980 95992K  7812K *Giant 2   0:00  1.54%  0.15% postgres
95973 pgsql -40 96048K 30936K semwai 3   0:00  0.70%  0.10% postgres
95966 pgsql  40 96048K 34272K sbwait 0   0:00  0.70%  0.10% postgres
95983 pgsql  40 96048K 34272K sbwait 2   0:00  1.03%  0.10% postgres
95962 pgsql  40 96048K 34268K sbwait 2   0:00  0.70%  0.10% postgres
95968 pgsql -40 96048K 26232K semwai 2   0:00  0.70%  0.10% postgres
95959 pgsql  40 96048K 34268K sbwait 2   0:00  0.70%  0.10% postgres

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 Sent: Thursday, September 23, 2004 1:06 PM
 To: Jason Coene
 Cc: 'Mr Pink'; 'Scott Kirkwood'; [EMAIL PROTECTED]
 Subject: Re: [PERFORM] Caching of Queries
 
 Jason Coene [EMAIL PROTECTED] writes:
  All of our postgres processes end up in the semwai state - seemingly
  waiting on other queries to complete.  If the system isn't taxed in CPU
 or
  disk, I have a good feeling that this may be the cause.
 
 Whatever that is, I'll bet lunch that it's got 0 to do with caching
 query plans.  Can you get stack tracebacks from some of the stuck
 processes?  What do they show in ps?
 
   regards, tom lane


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Greg Stark

Jason Coene [EMAIL PROTECTED] writes:

 All of our postgres processes end up in the semwai state - seemingly
 waiting on other queries to complete.  If the system isn't taxed in CPU or
 disk, I have a good feeling that this may be the cause.  

Well, it's possible contention of some sort is an issue but it's not clear
that it's planning related contention.

 We're running on SELECT's, and the number of locks on our high traffic
 tables grows to the hundreds.  

Where are you seeing this? What information do you have about these locks?

 I've looked at PREPARE, but apparently it only lasts per-session - that's
 worthless in our case (web based service, one connection per data-requiring
 connection).

Well the connection time in postgres is pretty quick. But a lot of other
things, including prepared queries but also including other factors are a lot
more effective if you have long-lived sessions.

I would strongly recommend you consider some sort of persistent database
connection for your application. Most web based services run queries from a
single source base where all the queries are written in-house. In that
situation you can ensure that one request never leaves the session in an
unusual state (like setting guc variables strangely, or leaving a transaction
open, or whatever).

That saves you the reconnect time, which as I said is actually small, but
could still be contributing to your problem. I think it also makes the buffer
cache more effective as well. And It also means you can prepare all your
queries and reuse them on subsequent requests.

The nice thing about web based services is that while each page only executes
each query once, you tend to get the same pages over and over thousands of
times. So if they prepare their queries the first time around they can reuse
those prepared queries thousands of times.

Using a text cache of the query string on the server side is just a
work-around for failing to do that on the client side. It's much more
efficient and more flexible to do it on the client-side.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Tom Lane
Jason Coene [EMAIL PROTECTED] writes:
 I'm not sure how I go about getting the stack traceback you need.  Any info
 on this?  Results of ps below.  System is dual xeon 2.6, 2gb ram, hardware
 raid 10 running FreeBSD 5.2.1.

Hmm.  Dual Xeon sets off alarm bells ...

I think you are probably looking at the same problem previously reported
by Josh Berkus among others.  Does the rate of context swaps shown by
vmstat go through the roof when this happens?  If you strace or ktrace
one of the backends, do you see lots of semop()s and little else?

Check the archives for this thread among others:
http://archives.postgresql.org/pgsql-performance/2004-04/msg00249.php
The test case you are talking about is a tight indexscan loop, which
is pretty much the same scenario as here:
http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php

The fundamental problem is heavy contention for access to a shared data
structure.  We're still looking for good solutions, but in the context
of this thread it's worth pointing out that a shared query-plan cache
would itself be subject to heavy contention, and arguably would make
this sort of problem worse not better.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Matt Clark
 I've looked at PREPARE, but apparently it only lasts 
 per-session - that's worthless in our case (web based 
 service, one connection per data-requiring connection).

That's a non-sequitur.  Most 'normal' high volume web apps have persistent
DB connections, one per http server process.  Are you really dropping DB
connections and reconnecting each time a new HTTP request comes in?

M


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Gaetano Mendola
Jason Coene wrote:
Hi Tom,
Easily recreated with Apache benchmark, ab -n 3 -c 3000
http://webserver .  This runs 1 query per page, everything else is cached
on webserver. 
That test require 3 access with 3000 connections that is not a normal
load. Describe us your HW.
3000 connections means a very huge load, may you provide also the result of
vmstat 5 my webserver trash already with -c 120 !
how many connection your postgres can manage ?
You have to consider to use a connection pool with that ammount of connections.
Regards
Gaetano Mendola
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Markus Schaber
Hi, Jason,

On Thu, 23 Sep 2004 12:53:25 -0400
Jason Coene [EMAIL PROTECTED] wrote:

 I've looked at PREPARE, but apparently it only lasts per-session - that's
 worthless in our case (web based service, one connection per data-requiring
 connection).

This sounds like the loads of connection init and close may be the
reason for the slowdown. Can you use connection pooling in your service?

HTH,
Markus


-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Josh Berkus
Tom,

 I think you are probably looking at the same problem previously reported
 by Josh Berkus among others.  Does the rate of context swaps shown by
 vmstat go through the roof when this happens?  If you strace or ktrace
 one of the backends, do you see lots of semop()s and little else?

That would be interesting.   Previously we've only demonstrated the problem on 
long-running queries, but I suppose it could also affect massive concurrent 
query access.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 I think you are probably looking at the same problem previously reported
 by Josh Berkus among others.

 That would be interesting.  Previously we've only demonstrated the
 problem on long-running queries, but I suppose it could also affect
 massive concurrent query access.

Well, the test cases we used were designed to get the system into a
tight loop of grabbing and releasing shared buffers --- a long-running
index scan is certainly one of the best ways to do that, but there are
others.

I hadn't focused before on the point that Jason is launching a new
connection for every query.  In that scenario I think the bulk of the
cycles are going to go into loading the per-backend catalog caches with
the system catalog rows that are needed to parse and plan the query.
The catalog fetches to get those rows are effectively mini-queries
with preset indexscan plans, so it's not hard to believe that they'd be
hitting the BufMgrLock nearly as hard as a tight indexscan loop.  Once
all the pages needed are cached in shared buffers, there's no I/O delays
to break the loop, and so you could indeed get into the context swap
storm regime we saw before.

I concur with the thought that using persistent connections might go a
long way towards alleviating his problem.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Jason Coene
Hi All,

It does sound like we should be pooling connections somehow.  I'll be
looking at implementing that shortly.  I'd really like to understand what
the actual problem is, though.

Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300 connections
/ second is realistic, if not underestimated.  As is the nature of our site
(realtime information about online gaming), there's a huge fan base and as a
big upset happens, we'll do 50,000 page views in a span of 3-5 minutes.

I get the same results with:

ab -n 1 -c 150 http://www.gotfrag.com/portal/news/

I've attached results from the above test, showing open locks, top output,
and vmstat 5.

Tom, I've run the test described in:

http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php

Results attached in mptest.txt.  The box did experience the same problems as
we've seen before.  I ran it under a separate database (test), and it still
caused our other queries to slow significantly from our production database
(gf) - semwait again.

It does look like the cs column under CPU (which I'd assume is Context
Swap) does bump up significantly (10x or better) during both my ab test, and
the test you suggested in that archived message.

Reading the first thread you pointed out (2004-04/msg00249.php), Josh Berkus
was questioning the ServerWorks chipsets.  We're running on the Intel E7501
Chipset (MSI board).  Our CPU's are 2.66 GHz with 533MHz FSB, Hyperthreading
enabled.  Unfortunately, I don't have physical access to the machine to turn
HT off.


Thanks,

Jason



 -Original Message-
 From: Gaetano Mendola [mailto:[EMAIL PROTECTED]
 Sent: Thursday, September 23, 2004 1:41 PM
 To: Jason Coene
 Subject: Re: Caching of Queries
 
 Jason Coene wrote:
  Hi Tom,
 
  Easily recreated with Apache benchmark, ab -n 3 -c 3000
  http://webserver .  This runs 1 query per page, everything else is
 cached
  on webserver.
 
 That test require 3 access with 3000 connections that is not a normal
 load. Describe us your HW.
 
 3000 connections means a very huge load, may you provide also the result
 of
 vmstat 5 my webserver trash already with -c 120 !
 
 how many connection your postgres can manage ?
 
 You have to consider to use a connection pool with that ammount of
 connections.
 
 
 Regards
 Gaetano Mendola

gf=#  SELECT r.relname, l.mode, count(*) AS numlocks
   FROM pg_locks l, pg_class r
  WHERE r.oid = l.relation
  GROUP BY r.relname, l.mode
  ORDER BY count(*) DESC;

 relname  |  mode   | numlocks
--+-+--
 threads  | AccessShareLock |   63
 threads_ix_nuked_lastpost| AccessShareLock |   47
 threads_ix_nuked_gameid_lastpost | AccessShareLock |7
 pg_class | AccessShareLock |5
 pg_opclass_am_name_nsp_index | AccessShareLock |3
 pg_opclass   | AccessShareLock |3
 pg_class_oid_index   | AccessShareLock |3
 pg_type  | AccessShareLock |2
 pg_statistic | AccessShareLock |2
 pg_attribute | AccessShareLock |2
 pg_amop_opc_strategy_index   | AccessShareLock |2
 pg_attrdef   | AccessShareLock |2
 pg_trigger_tgrelid_tgname_index  | AccessShareLock |2
 pg_trigger   | AccessShareLock |2
 users| AccessShareLock |2
 pg_statistic_relid_att_index | AccessShareLock |2
 pg_type_oid_index| AccessShareLock |2
 pg_amop  | AccessShareLock |2
 pg_attribute_relid_attnum_index  | AccessShareLock |2
 comments | AccessShareLock |2
 pg_shadow| AccessShareLock |2
 acls | AccessShareLock |1
 pg_index_indexrelid_index| AccessShareLock |1
 pg_attrdef_adrelid_adnum_index   | AccessShareLock |1
 surveyresults_ix_userid  | AccessShareLock |1
 pg_cast  | AccessShareLock |1
 pg_shadow_usesysid_index | AccessShareLock |1
 pg_index | AccessShareLock |1
 games| AccessShareLock |1
 usersessions | AccessShareLock |1
 surveyoptions| AccessShareLock |1
 countries| AccessShareLock |1
 surveyresults| AccessShareLock |1
 vopenlocks   | AccessShareLock |1
 pg_class_relname_nsp_index   | AccessShareLock |1
 pg_namespace | AccessShareLock |1
 pg_database  | AccessShareLock |1
 surveys  | AccessShareLock |  

Re: [PERFORM] Caching of Queries

2004-09-23 Thread Jason Coene
Update:

I just tried running the same test (ab with 150 concurrent connections)
while connecting to postgres through 35 persistent connections (PHP
library), and had roughly the same type of results.  This should eliminate
the new connection overhead.  I've attached top and vmstat.  I let it run
until it had completed 800 requests.  Unless I'm missing something, there's
more than the new connection IO load here.

Jason

 -Original Message-
 From: Jason Coene [mailto:[EMAIL PROTECTED]
 Sent: Thursday, September 23, 2004 3:08 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: RE: Caching of Queries
 
 Hi All,
 
 It does sound like we should be pooling connections somehow.  I'll be
 looking at implementing that shortly.  I'd really like to understand what
 the actual problem is, though.
 
 Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
 connections
 / second is realistic, if not underestimated.  As is the nature of our
 site
 (realtime information about online gaming), there's a huge fan base and as
 a
 big upset happens, we'll do 50,000 page views in a span of 3-5 minutes.
 
 I get the same results with:
 
 ab -n 1 -c 150 http://www.gotfrag.com/portal/news/
 
 I've attached results from the above test, showing open locks, top output,
 and vmstat 5.
 
 Tom, I've run the test described in:
 
 http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php
 
 Results attached in mptest.txt.  The box did experience the same problems
 as
 we've seen before.  I ran it under a separate database (test), and it
 still
 caused our other queries to slow significantly from our production
 database
 (gf) - semwait again.
 
 It does look like the cs column under CPU (which I'd assume is Context
 Swap) does bump up significantly (10x or better) during both my ab test,
 and
 the test you suggested in that archived message.
 
 Reading the first thread you pointed out (2004-04/msg00249.php), Josh
 Berkus
 was questioning the ServerWorks chipsets.  We're running on the Intel
 E7501
 Chipset (MSI board).  Our CPU's are 2.66 GHz with 533MHz FSB,
 Hyperthreading
 enabled.  Unfortunately, I don't have physical access to the machine to
 turn
 HT off.
 
 
 Thanks,
 
 Jason
 
 
 
  -Original Message-
  From: Gaetano Mendola [mailto:[EMAIL PROTECTED]
  Sent: Thursday, September 23, 2004 1:41 PM
  To: Jason Coene
  Subject: Re: Caching of Queries
 
  Jason Coene wrote:
   Hi Tom,
  
   Easily recreated with Apache benchmark, ab -n 3 -c 3000
   http://webserver .  This runs 1 query per page, everything else is
  cached
   on webserver.
 
  That test require 3 access with 3000 connections that is not a
 normal
  load. Describe us your HW.
 
  3000 connections means a very huge load, may you provide also the result
  of
  vmstat 5 my webserver trash already with -c 120 !
 
  how many connection your postgres can manage ?
 
  You have to consider to use a connection pool with that ammount of
  connections.
 
 
  Regards
  Gaetano Mendola

last pid: 48239;  load averages:  5.83,  2.43,  1.50   up 19+22:59:04  15:19:12
127 processes: 16 running, 111 sleeping
CPU states: 17.7% user,  0.0% nice, 20.0% system,  1.0% interrupt, 61.3% idle
Mem: 125M Active, 1456M Inact, 193M Wired, 96M Cache, 112M Buf, 54M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME PRI NICE   SIZERES STATE  C   TIME   WCPUCPU COMMAND
48190 pgsql -40 97408K 86416K semwai 1   0:01  3.35%  1.32% postgres
47761 pgsql -40 96816K 56708K semwai 2   0:01  0.90%  0.88% postgres
47765 pgsql  40 96816K 56708K sbwait 3   0:01  0.90%  0.88% postgres
47754 pgsql -40 96816K 56708K semwai 2   0:01  0.85%  0.83% postgres
47763 pgsql -40 96816K 56708K semwai 0   0:01  0.85%  0.83% postgres
47741 pgsql  40 96816K 56708K sbwait 3   0:01  0.75%  0.73% postgres
47674 pgsql -40 96264K 38992K semwai 1   0:01  0.74%  0.73% postgres
47753 pgsql -40 96816K 56708K semwai 1   0:00  0.65%  0.63% postgres
48204 pgsql -40 96856K 46752K semwai 0   0:00  2.15%  0.63% postgres
47698 pgsql  40 96240K 37792K sbwait 3   0:01  0.59%  0.59% postgres
47757 pgsql -40 96816K 56708K semwai 3   0:01  0.60%  0.59% postgres
47740 pgsql  40 96240K 37768K sbwait 0   0:01  0.55%  0.54% postgres
47759 pgsql  40 96816K 56708K sbwait 0   0:01  0.50%  0.49% postgres
47735 pgsql -40 96240K 37772K semwai 0   0:00  0.50%  0.49% postgres
48223 pgsql -40 96984K 55980K semwai 1   0:00  2.69%  0.49% postgres
48102 pgsql  40 96136K 54956K sbwait 0   0:00  0.69%  0.44% postgres
47718 pgsql  40 96816K 56716K sbwait 1   0:01  0.40%  0.39% postgres
48225 pgsql1230 96272K 57156K RUN0   0:00  2.80%  0.39% postgres
48053 pgsql -40 96136K 55040K semwai 0   0:00  0.48%  0.34% postgres
48041 pgsql  40 96136K 54992K sbwait 1   0:00  0.47%  0.34% postgres
48222 pgsql -4

Re: [PERFORM] Caching of Queries

2004-09-23 Thread Josh Berkus
Jason,

 Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
 connections
 / second is realistic, if not underestimated.  As is the nature of
 our site
 (realtime information about online gaming), there's a huge fan base
 and as a
 big upset happens, we'll do 50,000 page views in a span of 3-5
 minutes.

First, your posts show no evidences of the CS storm bug.

Second, 300 *new* connections a second is a lot.   Each new connection
requires a significant amount of both database and OS overhead.   This
is why all the other web developers use a connection pool.

In fact, I wouldn't be surprised if your lockups are on the OS level,
even; I don't recall that you cited what OS you're using, but I can
imagine locking up Linux 2.4 trying to spawn 300 new processes a
second.

--Josh

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-09-23 Thread Joshua D. Drake






  
Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
connections
/ second is realistic, if not underestimated.  As is the nature of
our site
(realtime information about online gaming), there's a huge fan base
and as a
big upset happens, we'll do 50,000 page views in a span of 3-5
minutes.

  
  
First, your posts show no evidences of the CS storm bug.

Second, 300 *new* connections a second is a lot.   Each new connection
requires a significant amount of both database and OS overhead.   This
is why all the other web developers use a connection pool.

  

I would second this. You need to be running a connection pool and
probably multiple web servers in 
front of that. You are talking about a huge amount of connections in
that amount of time.

Josh Drake




  In fact, I wouldn't be surprised if your lockups are on the OS level,
even; I don't recall that you cited what OS you're using, but I can
imagine locking up Linux 2.4 trying to spawn 300 new processes a
second.

--Josh

---(end of broadcast)---
TIP 8: explain analyze is your friend
  



-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL




Re: [PERFORM] Caching of Queries (now with pgpool)

2004-09-23 Thread Jason Coene
Hi Josh,

I just tried using pgpool to pool the connections, and ran:

ab -n 1000 -c 50 http://wstg.int/portal/news/

I ran some previous queries to get pgpool to pre-establish all the
connections, and ab ran for a few minutes (with one query per page, eek!).
It was still exhibiting the same problems as before.  While so many new
connections at once can surely make the problem worse (and pgpool will
surely help there), shouldn't this prove that it's not the only issue?

We're running FreeBSD 5.2.1

I've attached open locks, running queries, query plans, top output and
vmstat 5 output for while ab was running, from start to finish.

Any ideas?

Jason


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-performance-
 [EMAIL PROTECTED] On Behalf Of Josh Berkus
 Sent: Thursday, September 23, 2004 8:06 PM
 To: Jason Coene; [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [PERFORM] Caching of Queries
 
 Jason,
 
  Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
  connections
  / second is realistic, if not underestimated.  As is the nature of
  our site
  (realtime information about online gaming), there's a huge fan base
  and as a
  big upset happens, we'll do 50,000 page views in a span of 3-5
  minutes.
 
 First, your posts show no evidences of the CS storm bug.
 
 Second, 300 *new* connections a second is a lot.   Each new connection
 requires a significant amount of both database and OS overhead.   This
 is why all the other web developers use a connection pool.
 
 In fact, I wouldn't be surprised if your lockups are on the OS level,
 even; I don't recall that you cited what OS you're using, but I can
 imagine locking up Linux 2.4 trying to spawn 300 new processes a
 second.
 
 --Josh
 
 ---(end of broadcast)---
 TIP 8: explain analyze is your friend
OPEN LOCKS:


gf=#  SELECT r.relname, l.mode, count(*) AS numlocks
   FROM pg_locks l, pg_class r
  WHERE r.oid = l.relation
  GROUP BY r.relname, l.mode
  ORDER BY count(*) DESC;

 relname  |  mode   | numlocks
--+
 threads  | AccessShareLock |   43
 threads_ix_nuked_lastpost| AccessShareLock |   35
 threads_ix_nuked_gameid_lastpost | AccessShareLock |7
 pg_attribute | AccessShareLock |1
 v_locks  | AccessShareLock |1
 pg_class | AccessShareLock |1
 usersessions | AccessShareLock |1
 countries| AccessShareLock |1
 users| AccessShareLock |1
 userstats_ix_id  | AccessShareLock |1
 pg_statistic_relid_att_index | AccessShareLock |1
 pg_attribute_relid_attnum_index  | AccessShareLock |1
 userstats| AccessShareLock |1
 demos| AccessShareLock |1
 pg_cast_source_target_index  | AccessShareLock |1
 pg_locks | AccessShareLock |1
 users_ix_id  | AccessShareLock |1
 buddies  | AccessShareLock |1
 buddies_ix_userid| AccessShareLock |1
 pg_cast  | AccessShareLock |1
 pg_statistic | AccessShareLock |1
(21 rows)

RUNNING QUERIES (AND HOW LONG FOR):

gf=# SELECT pg_stat_activity.usename, round(date_part('epoch'::text, now() - 
pg_stat_activity.query_start)) AS duration, pg_stat_activity.current_query
   FROM pg_stat_activity
  ORDER BY round(date_part('epoch'::text, now() - pg_stat_activity.query_start)) DESC;

 usename | duration |
   current_query


-+--+---


--
 gf  |4 | SELECT id, gameid, forumid, subject FROM threads WHERE nuked = 0 
AND gameid = 1 ORDER BY nuked DESC, gameid DESC, lastpost DESC LIMIT 8
 gf  |3 | SELECT id, gameid, forumid, subject FROM threads WHERE nuked = 0 
ORDER BY nuked DESC, lastpost DESC LIMIT 8
 gf  |3 | SELECT id, gameid, forumid, subject FROM threads WHERE nuked = 0 
ORDER BY nuked DESC, lastpost DESC LIMIT 8
 gf  |3 | SELECT id, gameid, forumid, subject FROM threads WHERE nuked = 0 
ORDER BY nuked DESC, lastpost DESC LIMIT 8
 gf  |3 | IDLE
 gf  |3 | SELECT id, gameid, forumid, subject FROM threads WHERE nuked = 0 
ORDER BY nuked DESC, lastpost DESC LIMIT 8
 gf  |3 | SELECT id, gameid, forumid

Re: [PERFORM] Caching of Queries (now with pgpool)

2004-09-23 Thread Michael Adler
On Thu, Sep 23, 2004 at 09:23:51PM -0400, Jason Coene wrote:
 I ran some previous queries to get pgpool to pre-establish all the
 connections, and ab ran for a few minutes (with one query per page, eek!).
 It was still exhibiting the same problems as before.  While so many new
 connections at once can surely make the problem worse (and pgpool will
 surely help there), shouldn't this prove that it's not the only issue?

 Any ideas?

Now that your connections are persistent, you may benefit from using
PREPAREd queries.

-Mike

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-09-22 Thread Tom Lane
Scott Kirkwood [EMAIL PROTECTED] writes:
 What do you think?

I think this would allow the problems of cached plans to bite
applications that were previously not subject to them :-(.
An app that wants plan re-use can use PREPARE to identify the
queries that are going to be re-executed.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-09-22 Thread Neil Conway
On Thu, 2004-09-23 at 05:59, Tom Lane wrote:
 I think this would allow the problems of cached plans to bite
 applications that were previously not subject to them :-(.
 An app that wants plan re-use can use PREPARE to identify the
 queries that are going to be re-executed.

I agree; if you want to do some work in this area, making improvements
to PREPARE would IMHO be the best bet. For example, some people have
talked about having PREPARE store queries in shared memory. Another idea
would be to improve the quality of the plan we generate at PREPARE time:
for instance you could generate 'n' plans for various combinations of
input parameters, and then choose the best query plan at EXECUTE time.
It's a difficult problem to solve, however (consider multiple parameters
to PREPARE, for example).

-Neil



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org