Re: 7.5 change documentation (was Re: [HACKERS] cache control?)

2004-01-27 Thread Simon Riggs
 Bruce Momjian wrote 
 Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
   If the TODO-list-with-dash isn't the correct place to have looked,
is
   there another list of committed changes for the next release?
 
  We tend to rely on the CVS commit logs as the definitive source.
You
  can pull the info from the CVS server (I use cvs2cl.pl to format the
  results nicely), or read the archives of pgsql-committers.
 
  In theory there should be a section at the head of release.sgml
  mentioning the major changes done-so-far, but for various reasons
this
  hasn't gotten installed in the 7.5 branch yet.  (Look at the CVS
  versions during 7.4 development to see how we did it last time.)
 
  As far as the ARC change goes, I believe Jan still considers it a
  work-in-progress, so it may not be appropriate to list yet anyway.
  (Jan, where are you on that exactly?)
 
   Do we need such a list? (I'd be happy to compile and maintain this
if
 it
   agreed that it is a good idea to have such a document or process
as
   separate from TODO - I'll be doing this anyway before I pass
further
   comments!)
 
  If you wanted to go through the existing 7.5 commits and write up a
  new done-so-far section, it'd save someone else (like me or Bruce)
  from having to do it sometime soon ...
 
 Doesn't Robert Treat's News Bits list all the major changes weekly?
 That would b e a good source.
 

Bruce - The excellent work that both you and Robert do is a slightly
different view to what I had in mind - I agree they are all aspects of
the same information. I'm posting a first output of this now, so we can
discuss whether such a thing is useful, and or whether it can ever be
all of useful/accurate/timely.

I'll happily add this to the HEAD of release.sgml, though lets agree the
content/direction first, before I spend time on a more formal
publication mechanism.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] cache control?

2004-01-26 Thread Simon Riggs
Jan,

Happy to continue the discussion...though without changing my suggestion
that we defer any further more specialised improvements for now.

 Jan Wieck replied to...
 Simon Riggs wrote:
  If we know ahead of time that a large scan is going to have this
effect,
  why wait for the ARC to play its course, why not take exactly the
same
  action?
  Have large scans call StrategyHint also. (Maybe rename it...?)...of
  course, some extra code to establish it IS a large scan...
  ...large table lookup should wait until a shared catalog cache is
  implemented
 
 The problem with this is a) how to detect that something will be a
large
 scan, and b) how to decide what is a large scan in the first place.
 

My thoughts are that we know immediately prior to execution whether or
not a plan calls for a full table scan (FTS) (or not). We also know the
table and therefore its size. A large table in this context is one that
would disrupt the cache if it made it onto T2. We can discuss an
appropriate and usefully simple rule, perhaps sizeoftable(T)  2*C???

 Large sequential scans in warehousing are often part of more complex
 join operations. 

Yes, I agree. PostgreSQL is particularly prone to this currently,
because of the high number of plans that resolve to FTS. Complexity of
plan shouldn't effect the basic situation that we are reading all the
blocks of a table and putting them in sequentially into T1 and then
working on them. Plan complexity may increase the time that a T1 block
stays in memory, with subsequent increase in probability of promotion to
T1.

 And just because something returns a large number of
 result rows doesn't mean that the input data was that much.

I agree also that overall execution time may be unrelated to whether a
large table is involved. The number of output rows shouldn't have any
effect on input rows and thus data blocks that need to be cached.

(Jan gives a detailed analysis...ending with)
 Honestly, I don't even know what type of application could possibly
 produce such a screwed access pattern. And I am absolutely confident
one
 can find corner cases to wring down Oracles complicated configuration
 harness more easily.

I agree with everything you say. The algorithm copes well with almost
every sequential pattern of access and there is significant benefit from
ignoring the very very very rare cases that might give it problems.

My thoughts are about multiple concurrent accesses, specifically FTS on
large tables, rather than sequential ones.

 Buffers evicted from T1 are remembered in B1, and because of that even
 repeated sequential scans of the same large relation will only cycle
 through T1 blocks, never cause any turbulence in T2 or B2.

If we have a situation where a single backend makes repeated scans of
the same table, these will be sequential and will have no effect on T1.

In a DW situation, you are likely to have one or more very popular large
tables (maybe think of this as the Fact table, if you have a
dimensional design). The tables are large and therefore query execution
times will be extended (and accepted by user). In this situation it is
very likely that: i) a single user/app submits multiple requests from
other windows/threads
Or simply,
ii) multiple users access the popular table

The common effect will be concurrent, rather than sequential, access to
the popular table. Different SQL statements will have different plans
and will perform scans of the same table at different rates because of
other joins, more complex WHERE clauses etc. Like waves at a beach
moving at different rates. Every time one scan catches up with another,
it will cause T1 hits for almost the whole of the T1 list, promoting all
of these blocks to the top of the T2 MRU and thus spoiling the cache -
if it hits one it will probably hit most of them. This will not happen
ALL the time, but I don't want it to happen EVER. Even in DW situation,
I still want to be inserting data regularly (that's how the table got
big!), so I have index blocks and other stuff that I want almost
permanently in memory. Concurrent access via an index might have the
same effect, though less dramatically.

The closer the size of a table I to C, the greater the likelihood that
these spurious cache hits will occur. (Of course, it might be argued
that these are worthwhile and genuine cache hits - I argue that they are
not wanted and this is the main basis of my discussion). Of course, if a
table does fit in memory than that is very good. If a table was, say
2*C, then spurious cache hits will occur often and spoil the whole of
T2.

The DBT-3 workload is very similar to TPC-D/TPC-H workload. The test
consists of a power test (all queries sequentially in random order) and
a throughput test (1 or more concurrent streams, each stream executing
all queries in a random order). When this benchmark first came out most
vendors chose to perform the throughput test with only 1 stream (though
with parallel processing)...I would say one reason 

Re: [HACKERS] cache control?

2004-01-26 Thread Jan Wieck
Simon Riggs wrote:
Jan,
[...]

My thoughts are about multiple concurrent accesses, specifically FTS on
large tables, rather than sequential ones.
Single or multiple backends is irrelevant here because a data block only 
exists once, and therefore we have only one shared buffer cache.


Buffers evicted from T1 are remembered in B1, and because of that even
repeated sequential scans of the same large relation will only cycle
through T1 blocks, never cause any turbulence in T2 or B2.
If we have a situation where a single backend makes repeated scans of
the same table, these will be sequential and will have no effect on T1.
You really have to look at this a bit more global, not table related. 
The strategy of ARC is this:

In an unknown access pattern, if a specific block is accessed less 
frequently than every C requests, then it will only go into T1, age, get 
evicted and the CDB moves to B1, will get removed from that and is 
forgotten. Every block that is accessed more frequently than C will be 
after it's last access in any of the four queues of the directory and 
immediately go into T2.

The adjustment of the target T1 size is an attempt to catch as many 
newcomers as possible. If an application does many inserts, it will 
access new blocks very soon again, so that a small T1 is sufficient to 
hold them in memory until their next access where they move into T2. An 
application that does non-uniform random access to blocks (there are 
always bestsellers and less frequently asked items), then a larger T1 
might better satisfy that access pattern.

In a DW situation, you are likely to have one or more very popular large
tables (maybe think of this as the Fact table, if you have a
dimensional design). The tables are large and therefore query execution
times will be extended (and accepted by user). In this situation it is
very likely that: i) a single user/app submits multiple requests from
other windows/threads
Or simply,
ii) multiple users access the popular table
If that causes that it's blocks are more frequently requested than every 
C lookups, it belongs into T2.

The common effect will be concurrent, rather than sequential, access to
the popular table. Different SQL statements will have different plans
and will perform scans of the same table at different rates because of
other joins, more complex WHERE clauses etc. Like waves at a beach
moving at different rates. Every time one scan catches up with another,
it will cause T1 hits for almost the whole of the T1 list, promoting all
of these blocks to the top of the T2 MRU and thus spoiling the cache -
if it hits one it will probably hit most of them. This will not happen
ALL the time, but I don't want it to happen EVER. Even in DW situation,
I still want to be inserting data regularly (that's how the table got
big!), so I have index blocks and other stuff that I want almost
permanently in memory. Concurrent access via an index might have the
same effect, though less dramatically.
The closer the size of a table I to C, the greater the likelihood that
these spurious cache hits will occur. (Of course, it might be argued
that these are worthwhile and genuine cache hits - I argue that they are
not wanted and this is the main basis of my discussion). Of course, if a
table does fit in memory than that is very good. If a table was, say
2*C, then spurious cache hits will occur often and spoil the whole of
T2.
How can any generic algorithm ever sense that when the application is 
accessing the same blocks multiple times, it should NOT cache them? Are 
you asking for a fine granulated tuning of cache priorities and 
behaviour on a per table basis?

The DBT-3 workload is very similar to TPC-D/TPC-H workload. The test
consists of a power test (all queries sequentially in random order) and
a throughput test (1 or more concurrent streams, each stream executing
all queries in a random order). When this benchmark first came out most
vendors chose to perform the throughput test with only 1 stream (though
with parallel processing)...I would say one reason for this is poor
cache management...hence recent changes in various commercial products.
In summary, I believe there is a reasonably common effect in DW
situations where concurrent query access to large and popular tables
will result in undesirable cache spoiling. This effect will still occur
even after the ARC improvements are introduced - though in every other
case I can think of, the ARC code is a major improvement on earlier
strategies and should be hailed as a major improvement in automatic
performance adaptation.
There are two solution ideas:
i) change the code so that FTS on large tables use the no cache
strategy that has already been developed to support Vaccuum.
ii) more complex: synchronise the FTS of the large table so that all
backends that want scans produce only one set of I/Os and they share the
block many times (yet still don't put it in cache!). FTS don't start at
the beginning every time, they start 

Re: [HACKERS] cache control?

2004-01-26 Thread Simon Riggs
Jan,

I think we should suspend further discussion for now...in summary:

ARC Buffer management is an important new performance feature for 7.5;
the implementation is a good one and should have positive benefit for
everybody's workload. ARC will adapt to a variety of situations and has
been designed to allow Vacuum to avoid interfering with user
applications.

That's the important bit: The implementation notes are detailed; I've
read them a few times to ensure I've got it straight. I am confident
that the situation I described CAN exist with regard to multiple
concurrent queries performing full table scans upon a single large
table. Further debate on that point is continuing because of my poor
explanation of that situation; forgive me. Thanks very much for your
further explanations and examples.

I will take a more practical tack on this now: providing evidence of a
real query mix that exhibits the described properties and quantifying
the effects and their frequency. If it IS worth it, and I accept that it
may not be, I'll have a hack at the very specialised improvement I was
suggesting, for very specific workload types.

Best Regards

Simon Riggs



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] cache control?

2004-01-23 Thread Simon Riggs
 Jan Wieck wrote:

 have you read src/backend/storage/buffer/README of current CVS tip?

 Tom Lane wrote:

 Um, did you read the discussion of the ARC buffer management algorithm
 that's already been implemented for 7.5?
 

Tom, Jan, 

No, I hadn't read this. Thank you both for your time and trouble to
point this out for me, which I was not aware of. 

My understanding, possibly faulty, was that, if work was completed, then
it appears on the TODO list with a dash in front of it. The new cache
management strategy isn't mentioned there, so was not aware that any
work was completed (or even in progress). No finger pointing, just an
observation of how the development process works...

If the TODO-list-with-dash isn't the correct place to have looked, is
there another list of committed changes for the next release? The latest
README in CVS doesn't have a list of what's new in 7.5 or similar.

Do we need such a list? (I'd be happy to compile and maintain this if it
agreed that it is a good idea to have such a document or process as
separate from TODO - I'll be doing this anyway before I pass further
comments!)

Regards, Simon



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] cache control?

2004-01-23 Thread Simon Riggs

 Jan Wieck wrote:

 have you read src/backend/storage/buffer/README of current CVS tip?
 
 The algorithm in the new replacement strategy is an attempt to figure
 that SMALL_TABLE_THRESHOLD automatically. Do you see anything that can
 be improved in that algorithm?
 

Jan,

I've read src/backend/storage/buffer/README rev 1.6 as you suggest. The
new algorithm looks great - many thanks for implementing that.

I'm not able to improve on this for the general case - I especially like
the automatic management that it gives, allowing you to avoid additional
DBA set parameters (and the coding to add these option
settings/keywords).

My concern was for DBT-3 performance and general Decision Support (DSS)
workloads, where large proportion of table scans occur (not on the DBT-3
single-threaded test). The new strategy is much better than the older
one and is likely to have a positive effect in this area. I don't think,
right now, that anything further should be changed, in the interests of
stability.

For the record/for the future: My observation was that two commercial
databases focused on DSS use a strategy which in terms of the new ARC
implementation is effectively: place blocks in T1 (RECENCY/RECYCLE
buffer) and NEVER promote them to T2 (FREQUENCY/KEEP buffer) when they
do large object scans.

In the new README, you note that:
   StrategyHintVacuum(bool vacuum_active)

   Because vacuum reads all relations of the entire
database
   through the buffer manager, it can greatly disturb the
   buffer replacement strategy. This function is used by
vacuum
   to inform that all subsequent buffer lookups are caused
   by vacuum scanning relations.

...I would say that scans of very large tables also greatly disturb the
buffer replacement strategy, i.e. have exactly the same effect on the
cache as the Vacuum utility. 

You'd clearly thought of the idea before me, though with regard to
Vacuum. 

If we know ahead of time that a large scan is going to have this effect,
why wait for the ARC to play its course, why not take exactly the same
action?
Have large scans call StrategyHint also. (Maybe rename it...?)...of
course, some extra code to establish it IS a large scan...
...large table lookup should wait until a shared catalog cache is
implemented
 
Anyway, this idea can wait at least until we have extensive performance
tuning on DBT-3 with 7.5. Thanks again for adding the new algorithm.

Best Regards, Simon 

... 
 Simon Riggs wrote:
... 
 
  My suggestion would be to:
  - split the buffer cache into two, just as Oracle does: KEEP 
RECYCLE.
  This could default to KEEP=66% of total memory available, but could
also
  be settable by init parameter.
  [changes to the memory management routines]
  - if we do a scan on a table whose size in blocks is more than some
  fraction (25%?) of KEEP bufferpool then we place the blocks into
RECYCLE
  bufferpool. This can be decided immediately following optimization,
  rather than including it within the optimizer decision process since
we
  aren't going to change the way the statement executes, we're just
going
  to stop it from having an adverse effect on other current or future
  statements.
  [additional test to set parameter, then work out where to note it]
 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


7.5 change documentation (was Re: [HACKERS] cache control?)

2004-01-23 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 If the TODO-list-with-dash isn't the correct place to have looked, is
 there another list of committed changes for the next release?

We tend to rely on the CVS commit logs as the definitive source.  You
can pull the info from the CVS server (I use cvs2cl.pl to format the
results nicely), or read the archives of pgsql-committers.

In theory there should be a section at the head of release.sgml
mentioning the major changes done-so-far, but for various reasons this
hasn't gotten installed in the 7.5 branch yet.  (Look at the CVS
versions during 7.4 development to see how we did it last time.)

As far as the ARC change goes, I believe Jan still considers it a
work-in-progress, so it may not be appropriate to list yet anyway.
(Jan, where are you on that exactly?)

 Do we need such a list? (I'd be happy to compile and maintain this if it
 agreed that it is a good idea to have such a document or process as
 separate from TODO - I'll be doing this anyway before I pass further
 comments!)

If you wanted to go through the existing 7.5 commits and write up a
new done-so-far section, it'd save someone else (like me or Bruce)
from having to do it sometime soon ...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] cache control?

2004-01-22 Thread Jan Wieck
Simon,

have you read src/backend/storage/buffer/README of current CVS tip?

The algorithm in the new replacement strategy is an attempt to figure 
that SMALL_TABLE_THRESHOLD automatically. Do you see anything that can 
be improved in that algorithm?

Jan

Simon Riggs wrote:

This discussion seems likely to have a major effect on DBT-3 (DSS-type)
performance from PostgreSQL...
On Fri, 16 Jan 2004, Michael Brusser wrote:

 Is there a way to force database to load
 a frequently-accessed table into cache and keep it there?
Scott Marlow replied...

Nope.  But there is a new cache buffer handler that may make it into
7.5
that would make that happen automagically.
The important question here is what forces blocks out of cache? rather
than thinking about how to directly keep them there.
Jeroen T. Vermeulen wrote:
Sent: Friday, January 16, 2004 23:02
Subject: [HACKERS] Reverse scans?
Would it be doable, and would it be useful, to try to alternate the
directions of table and index scans every time each table/index was
fully scanned?
I was thinking that it could help cache performance at various levels
in cases where data at the end of a large table, say, that remained in
memory after a scan, would otherwise be flushed out by a new scan of
the
same table.  If the next scan of the same table was to go in the other
direction, any remains of the last time around that were still in the
filesystem cache, buffer pool, hard disk cache etc. would stand a
greater
chance of being reused.
Jereon's idea is a good one when we consider the current behaviour,
which is 
that large scans are placed into block buffer cache, which then forces
other data out. I would like to question the latter behaviour, so we can
address the cause and not just the symptom.

Earlier versions of Oracle had a parameter called something like
SMALL_TABLE_THRESHOLD. Any scan of a table larger than that threshold
was NOT placed into buffer cache, but was consumed directly by the
shadow process (the backend). Teradata also uses a similar buffer
control technique for large table scans.
If a table is too large to fit into buffer, it clearly wasn't going to
be cached properly in the first place; Jereon's idea only works well for
tables near to the size of the cache. If the table is MUCH bigger then
it will have very little gain. Good original thinking, though I'm not
sure its worth it.
Oracle 9i now offers some variety for buffer cache management (as does
DB2). You can specify at the tablespace and object level whether to use
one of two buffer pools, KEEP or RECYCLE. The idea is to separate out
the two types of blocks - ones that are there because they're well used
and other blocks that are there at the moment, but unlikely to stay.
My suggestion would be to:
- split the buffer cache into two, just as Oracle does: KEEP  RECYCLE.
This could default to KEEP=66% of total memory available, but could also
be settable by init parameter.
[changes to the memory management routines]
- if we do a scan on a table whose size in blocks is more than some
fraction (25%?) of KEEP bufferpool then we place the blocks into RECYCLE
bufferpool. This can be decided immediately following optimization,
rather than including it within the optimizer decision process since we
aren't going to change the way the statement executes, we're just going
to stop it from having an adverse effect on other current or future
statements.
[additional test to set parameter, then work out where to note it]
Notice that I haven't suggested that the KEEP/RECYCLE option could be
specified at table level. That optionality sounds like a lot of extra
work, when what is needed is the automatic avoidance of cache-spoiling
behaviour. (This would still mean that very large indexes with random
request patterns would still spoil cache...maybe implement that later?)
This would remove most reasons for spoiling the cache and blocks would
then leave the cache only when they were genuinely no longer wanted.
Any comments?? Takers?

---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org


--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] cache control?

2004-01-22 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Any comments?? Takers?

Um, did you read the discussion of the ARC buffer management algorithm
that's already been implemented for 7.5?

The main objection I have to doing it Oracle's way is that that creates
more parameters that DBAs have to get right for reasonable performance.
Self-tuning algorithms are better, when available.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] cache control?

2004-01-18 Thread Reinoud van Leeuwen
On Fri, Jan 16, 2004 at 12:00:08PM -0500, Michael Brusser wrote:
 Is there a way to force database to load
 a frequently-accessed table into cache and keep it there?

If it is frequently accessed, I guess it would be in the cachke 
permanently

-- 
__
Nothing is as subjective as reality
Reinoud van Leeuwen[EMAIL PROTECTED]
http://www.xs4all.nl/~reinoud
__

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] cache control?

2004-01-16 Thread Neil Conway
Michael Brusser [EMAIL PROTECTED] writes:
 Is there a way to force database to load a frequently-accessed table
 into cache and keep it there?

No.

BTW, this idea has been suggested in the past, so check the archives
for the prior discussions on this topic. The usual response is that
the PostgreSQL bufmgr should already placing hot pages into the cache,
so there isn't really a need for this mechanism. (And if the bufmgr
doesn't do this well enough, we should improve the bufmgr -- as Jan
has done for 7.5)

-Neil


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] cache control?

2004-01-16 Thread scott.marlowe
On Fri, 16 Jan 2004, Michael Brusser wrote:

 Is there a way to force database to load
 a frequently-accessed table into cache and keep it there?

Nope.  But there is a new cache buffer handler that may make it into 7.5 
that would make that happen automagically.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]