Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-04-02 Thread Bruce Momjian

test version, but I am putting in the queue so we can track it there.

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---


Simon Riggs wrote:
 On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote:
  On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
 
   With the default
   value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in 
   pool,
   just like existing sequential scans. Is this intended?
  
  Yes, but its not very useful for testing to have done that. I'll do
  another version within the hour that sets N=0 (only) back to current
  behaviour for VACUUM.
 
 New test version enclosed, where scan_recycle_buffers = 0 doesn't change
 existing VACUUM behaviour.
 
 -- 
   Simon Riggs 
   EnterpriseDB   http://www.enterprisedb.com
 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-26 Thread Bruce Momjian

Simon, is this patch ready to be added to the patch queue? I assume not.

---

Simon Riggs wrote:
 On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote:
  On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
 
   With the default
   value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in 
   pool,
   just like existing sequential scans. Is this intended?
  
  Yes, but its not very useful for testing to have done that. I'll do
  another version within the hour that sets N=0 (only) back to current
  behaviour for VACUUM.
 
 New test version enclosed, where scan_recycle_buffers = 0 doesn't change
 existing VACUUM behaviour.
 
 -- 
   Simon Riggs 
   EnterpriseDB   http://www.enterprisedb.com
 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-13 Thread Simon Riggs
On Mon, 2007-03-12 at 22:16 -0700, Luke Lonergan wrote:

 You may know we've built something similar and have seen similar gains.

Cool

 We're planning a modification that I think you should consider: when there
 is a sequential scan of a table larger than the size of shared_buffers, we
 are allowing the scan to write through the shared_buffers cache.

Write? For which operations?

I was thinking to do this for bulk writes also, but it would require
changes to bgwriter's cleaning sequence. Are you saying to write say ~32
buffers then fsync them, rather than letting bgwriter do that? Then
allow those buffers to be reused?

 The hypothesis is that if a relation is of a size equal to or less than the
 size of shared_buffers, it is cacheable and should use the standard LRU
 approach to provide for reuse.

Sounds reasonable. Please say more.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-13 Thread Simon Riggs
On Tue, 2007-03-13 at 13:40 +0900, ITAGAKI Takahiro wrote:
 Simon Riggs [EMAIL PROTECTED] wrote:
 
With the default
value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers 
in pool,
just like existing sequential scans. Is this intended?
   
  New test version enclosed, where scan_recycle_buffers = 0 doesn't change
  existing VACUUM behaviour.
 
 This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM
 with background load using slowdown-ed pgbench in this instance. I believe
 the patch is useful in normal cases, not only for VACUUM FREEZE.
 
N |  time  | WAL flush(*)
 -++---
0 | 112.8s | 44.3%
1 | 148.9s | 52.1%
8 | 105.1s | 17.6%
   16 |  96.9s |  8.7%
   32 | 103.9s |  6.3%
   64 |  89.4s |  6.6%
  128 |  80.0s |  3.8%

Looks good.

Not sure what value of N to pick for normal use. The objectives are
i) don't let VACUUMs spoil the cache
ii) speed up standalone VACUUMs
iii) don't let VACUUM cause others to repeatedly WAL flush

I'm thinking N=16 meets all 3 objectives. We could make VACUUM go faster
still, but by knocking more blocks out of cache that someone doing real
work might need. That will slow them down almost as much as forcing them
to flush WAL, so I'd want to be conservative with VACUUM.

Does anybody think we need a new parameter for this, or are we happy at
16 buffers in the recycle loop for VACUUM?

At this point I should note something I haven't mentioned before.
VACUUMs force other backends to flush out WAL only when we have an I/O
bound workload. If the database already fits in memory then BufferAlloc
never needs to run and therefore we don't need to flush WAL. So I can
understand that the effect of WAL flushing may not have been noticed by
many testers. 

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-13 Thread Luke Lonergan
Simon,

On 3/13/07 2:37 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 We're planning a modification that I think you should consider: when there
 is a sequential scan of a table larger than the size of shared_buffers, we
 are allowing the scan to write through the shared_buffers cache.
 
 Write? For which operations?

I'm actually just referring to the sequential scan writing into the shared
buffers cache, sorry for the write through :-)
  
 I was thinking to do this for bulk writes also, but it would require
 changes to bgwriter's cleaning sequence. Are you saying to write say ~32
 buffers then fsync them, rather than letting bgwriter do that? Then
 allow those buffers to be reused?

Off topic, but we think we just found the reason(s) for the abysmal heap
insert performance of pgsql and are working on a fix to that as well.  It
involves two main things: the ping-ponging seeks used to extend a relfile
and the bgwriter not flushing aggressively enough.  We're hoping to move the
net heap insert rate from 12MB/s for a single stream to something more like
100 MB/s per stream, but it may take a week to get some early results and
find out if we're on the right track.  We've been wrong on this before ;-)

- Luke   



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread ITAGAKI Takahiro

Simon Riggs [EMAIL PROTECTED] wrote:

 I've implemented buffer recycling, as previously described, patch being
 posted now to -patches as scan_recycle_buffers.
 
 - for VACUUMs of any size, with the objective of reducing WAL thrashing
 whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
 originally suggested by Itagaki-san, just with a different
 implementation).

I tested your patch with VACUUM FREEZE. The performance was improved when
I set scan_recycle_buffers  32. I used VACUUM FREEZE to increase WAL traffic,
but this patch should be useful for normal VACUUMs with backgrond jobs!

   N | time  | WAL flush(*)
-+---+---
   0 | 58.7s |  0.01%
   1 | 80.3s | 81.76%
   8 | 73.4s | 16.73%
  16 | 64.2s |  9.24%
  32 | 59.0s |  4.88%
  64 | 56.7s |  2.63%
 128 | 55.1s |  1.41%

(*) WAL flush is the ratio of the need of fsync to buffer recycle.

# SET scan_recycle_buffers = 0;
# UPDATE accounts SET aid=aid WHERE random()  0.005;
# CHECKPOINT;
# SET scan_recycle_buffers = N;
# VACUUM FREEZE accounts;


BTW, does the patch change the default usage of buffer in vacuum? From what
I've seen, scan_recycle_buffers = 1 is the same as before. With the default
value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
just like existing sequential scans. Is this intended?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Simon Riggs
On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
 Simon Riggs [EMAIL PROTECTED] wrote:
 
  I've implemented buffer recycling, as previously described, patch being
  posted now to -patches as scan_recycle_buffers.
  
  - for VACUUMs of any size, with the objective of reducing WAL thrashing
  whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
  originally suggested by Itagaki-san, just with a different
  implementation).
 
 I tested your patch with VACUUM FREEZE. The performance was improved when
 I set scan_recycle_buffers  32. I used VACUUM FREEZE to increase WAL traffic,
 but this patch should be useful for normal VACUUMs with backgrond jobs!

Thanks.

N | time  | WAL flush(*)
 -+---+---
0 | 58.7s |  0.01%
1 | 80.3s | 81.76%
8 | 73.4s | 16.73%
   16 | 64.2s |  9.24%
   32 | 59.0s |  4.88%
   64 | 56.7s |  2.63%
  128 | 55.1s |  1.41%
 
 (*) WAL flush is the ratio of the need of fsync to buffer recycle.

Do you have the same measurement without patch applied? I'd be
interested in the current state also (the N=0 path is modified as well
for VACUUM, in this patch).

 # SET scan_recycle_buffers = 0;
 # UPDATE accounts SET aid=aid WHERE random()  0.005;
 # CHECKPOINT;
 # SET scan_recycle_buffers = N;
 # VACUUM FREEZE accounts;

Very good results, thanks. I'd be interested in the same thing for just
VACUUM and for varying ratios of dirty/clean blocks during vacuum.

 BTW, does the patch change the default usage of buffer in vacuum? From what
 I've seen, scan_recycle_buffers = 1 is the same as before. 

That was the intention.

 With the default
 value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
 just like existing sequential scans. Is this intended?

Yes, but its not very useful for testing to have done that. I'll do
another version within the hour that sets N=0 (only) back to current
behaviour for VACUUM.

One of the objectives of the patch was to prevent VACUUM from tripping
up other backends. I'm confident that it will improve that situation for
OLTP workloads running regular concurrent VACUUMs, but we will need to
wait a couple of days to get those results in also.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Simon Riggs
On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote:
 On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:

  With the default
  value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in 
  pool,
  just like existing sequential scans. Is this intended?
 
 Yes, but its not very useful for testing to have done that. I'll do
 another version within the hour that sets N=0 (only) back to current
 behaviour for VACUUM.

New test version enclosed, where scan_recycle_buffers = 0 doesn't change
existing VACUUM behaviour.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com

? .swo
? .swp
? GNUmakefile
? RUIP-Oct12.patch
? UIP-datetime.patch
? WAL_notes.txt
? autovac.patch
? backup_minor_changes.patch
? backup_single_transaction.patch
? backup_typo.patch
? bufmgr_cyclic_reuse.v1.patch
? bufmgr_cyclic_reuse.v2.patch
? bufmgr_cyclic_reuse.v3.patch
? c2
? cache_option.sql
? cacheinval.v1.patch
? check
? check2
? checl
? commit_nowait.wip1.patch
? commit_nowait.wip2.patch
? commit_options.wip.v1.patch
? commitsiblings.v1.patch
? config.log
? config.status
? copy_noWAL.patch
? copy_noWAL2.patch
? copy_nowal.v1.patch
? copy_nowal.v2.patch
? copy_nowal_prep.sql
? copy_nowal_test.sql
? crc
? d.out
? datatype_frequency.sql
? dbtValidate.sql
? ddl_caveats.patch
? deadlock.patch
? edbDataValidation.sql
? error.sql
? executestricttest
? explain_analyze_timer.v1.patch
? f.sql
? faq.patch
? faq_tab.patch
? fast_cluster.v1.patch
? fast_cluster.v2.patch
? fastcopytest.sql
? funcs
? heap_diag_funcs.v1.patch
? heap_page_func.v1.patch
? hot
? iub_doc.patch
? iubackup.patch
? keepcache.patch
? last_restart_point.v1.patch
? last_restart_point.v2.patch
? log_autovacuum.v1.patch
? log_autovacuum.v2.patch
? log_lock_waits.v1.patch
? logfile
? makefile.custom
? mi
? mj.sql
? mj_circular.v1.patch
? mj_circular.v2.patch
? mj_circular.v3.patch
? mk.out
? mk.tmp
? mki.out
? mkinfo
? mkt.tmp
? mlog
? notintransblock.patch
? num_commits++
? on.sql
? perf_doc.v1.patch
? pg_dump_analyze.v1.patch
? pg_dump_analyze.v1a.patch
? pg_standby.v2.tar
? pg_standby.v3.tar
? pg_standby.v4.tar
? pg_standby.v5.tar
? pg_standby.v6.tar
? pgbenchValidate.sql
? pgre.tar
? pgs.tar
? pgtrace_idle.REL8_3-20070104.v1.patch
? pgtrace_newprobes.REL8_3-20070104.v1.patch
? pgtrace_newprobes.v2.patch
? pitr_cleanup.patch
? portals_per_user.v1.patch
? ps.txt
? raw_page_funcs.v1.patch
? raw_page_funcs.v2.patch
? relationkeepcache.patch
? relopt
? restartablerecovery_docs.patch
? ri_initial_check.sql
? scan_recycle_buffers.v1.patch
? scan_recycle_buffers.v2.patch
? scan_recycle_buffers.v3.patch
? sel.patch
? singletransdocpatch.patch
? sss
? table_options.patch
? test_warm_standby.tar
? testcrc
? toast_options.sql
? toast_relopts.pgsql.v2.patch
? toast_relopts.v1.patch
? toast_tuning_ideas.patch
? toast_tuple_threshold_fixed.patch
? toastcache.patch
? trace_wal_flush.v1.patch
? tracedocs.patch
? transaction_guarantee.v1.patch
? transaction_guarantee.v2.patch
? transaction_guarantee.v3.patch
? transaction_guarantee.v4.patch
? transaction_guarantee.v5.patch
? transaction_guarantee.v6.patch
? uip_both.patch
? unlink_pg_internal_at_startup.patch
? vac_full_reindex.v1.patch
? vac_hint.v1.patch
? vac_hint.v2.patch
? vacstrategy.v1.patch
? vacstrategy.v2.patch
? vacstrategy_simple.v1.patch
? wal_checksum.v1.patch
? wal_checksum.v2.patch
? wrap_limit.patch
? xlog_clog_truncate.patch
? xlog_relcache.patch
? xlogswitchtuning.patch
? xlogswitchtuning2.patch
? xlogviewer.tar
? contrib/pg_relation_extend
? contrib/pgbench/.runtest.sh.swp
? contrib/pgbench/tpc_b.sql
? contrib/pgbench/truckin.pgb
? contrib/pgstattuple/pgstatheap.c
? doc/src/sgml/check
? doc/src/sgml/trace.sgml
? src/Makefile.global
? src/backend/postgres
? src/backend/access/common/rawpage.c
? src/backend/access/nbtree/nbttuple.c
? src/backend/catalog/.catalog.c.swp
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/commands/.tablespace.c.swp
? src/backend/postmaster/walwriter.c
? src/backend/storage/lmgr/.deadlock.c.swp
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/backend/utils/mb/conversion_procs/ascii_and_mic/libascii_and_mic.so.0.0
? src/backend/utils/mb/conversion_procs/cyrillic_and_mic/libcyrillic_and_mic.so.0.0
? src/backend/utils/mb/conversion_procs/euc_cn_and_mic/libeuc_cn_and_mic.so.0.0
? src/backend/utils/mb/conversion_procs/euc_jp_and_sjis/libeuc_jp_and_sjis.so.0.0
? src/backend/utils/mb/conversion_procs/euc_kr_and_mic/libeuc_kr_and_mic.so.0.0
? src/backend/utils/mb/conversion_procs/euc_tw_and_big5/libeuc_tw_and_big5.so.0.0
? src/backend/utils/mb/conversion_procs/latin2_and_win1250/liblatin2_and_win1250.so.0.0
? src/backend/utils/mb/conversion_procs/latin_and_mic/liblatin_and_mic.so.0.0
? src/backend/utils/mb/conversion_procs/utf8_and_ascii/libutf8_and_ascii.so.0.0
? src/backend/utils/mb/conversion_procs/utf8_and_big5/libutf8_and_big5.so.0.0
? 

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 I tested your patch with VACUUM FREEZE. The performance was improved when
 I set scan_recycle_buffers  32. I used VACUUM FREEZE to increase WAL traffic,
 but this patch should be useful for normal VACUUMs with backgrond jobs!

Proving that you can see a different in a worst-case scenario is not the
same as proving that the patch is useful in normal cases.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Simon Riggs
On Mon, 2007-03-12 at 10:30 -0400, Tom Lane wrote:
 ITAGAKI Takahiro [EMAIL PROTECTED] writes:
  I tested your patch with VACUUM FREEZE. The performance was improved when
  I set scan_recycle_buffers  32. I used VACUUM FREEZE to increase WAL 
  traffic,
  but this patch should be useful for normal VACUUMs with backgrond jobs!
 
 Proving that you can see a different in a worst-case scenario is not the
 same as proving that the patch is useful in normal cases.

I agree, but I think that this VACUUM FREEZE test does actually
represent the normal case, here's why:

The poor buffer manager behaviour occurs if the block is dirty as a
result of WAL-logged changes. It only takes the removal of a single row
for us to have WAL logged this and dirtied the block.

If we invoke VACUUM from autovacuum, we do this by default when 20% of
the rows have been updated, which means with many distributions of
UPDATEs we'll have touched a very large proportion of blocks before we
VACUUM. That isn't true for *all* distributions of UPDATEs, but it will
soon be. Dead Space Map will deliver only dirty blocks for us.

So running a VACUUM FREEZE is a reasonable simulation of running a large
VACUUM on a real production system with default autovacuum enabled, as
will be the case for 8.3. 

It is possible that we run VACUUM when fewer dirty blocks are generated,
but this won't be the common situation and not something we should
optimise for.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread ITAGAKI Takahiro

Simon Riggs [EMAIL PROTECTED] wrote:

   With the default
   value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in 
   pool,
   just like existing sequential scans. Is this intended?
  
 New test version enclosed, where scan_recycle_buffers = 0 doesn't change
 existing VACUUM behaviour.

This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM
with background load using slowdown-ed pgbench in this instance. I believe
the patch is useful in normal cases, not only for VACUUM FREEZE.

   N |  time  | WAL flush(*)
-++---
   0 | 112.8s | 44.3%
   1 | 148.9s | 52.1%
   8 | 105.1s | 17.6%
  16 |  96.9s |  8.7%
  32 | 103.9s |  6.3%
  64 |  89.4s |  6.6%
 128 |  80.0s |  3.8%

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Luke Lonergan
Simon,

You may know we've built something similar and have seen similar gains.
We're planning a modification that I think you should consider: when there
is a sequential scan of a table larger than the size of shared_buffers, we
are allowing the scan to write through the shared_buffers cache.

The hypothesis is that if a relation is of a size equal to or less than the
size of shared_buffers, it is cacheable and should use the standard LRU
approach to provide for reuse.

- Luke

On 3/12/07 3:08 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote:
 On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
 
 With the default
 value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in
 pool,
 just like existing sequential scans. Is this intended?
 
 Yes, but its not very useful for testing to have done that. I'll do
 another version within the hour that sets N=0 (only) back to current
 behaviour for VACUUM.
 
 New test version enclosed, where scan_recycle_buffers = 0 doesn't change
 existing VACUUM behaviour.



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-09 Thread Luke Lonergan
Cool!

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Simon Riggs [mailto:[EMAIL PROTECTED]
Sent:   Friday, March 09, 2007 02:32 PM Eastern Standard Time
To: Luke Lonergan; ITAGAKI Takahiro
Cc: Sherry Moore; Tom Lane; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; 
PGSQL Hackers; Doug Rady
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote:
 Incidentally, we tried triggering NTA (L2 cache bypass)
 unconditionally and in various patterns and did not see the
 substantial gain as with reducing the working set size.
 
 My conclusion: Fixing the OS is not sufficient to alleviate the issue.
 We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data
 rates due to this effect.
 
I've implemented buffer recycling, as previously described, patch being
posted now to -patches as scan_recycle_buffers.

This version includes buffer recycling

- for SeqScans larger than shared buffers, with the objective of
improving L2 cache efficiency *and* reducing the effects of shared
buffer cache spoiling (both as previously discussed on this thread)

- for VACUUMs of any size, with the objective of reducing WAL thrashing
whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
originally suggested by Itagaki-san, just with a different
implementation).

Behaviour is not activated by default in this patch. To request buffer
recycling, set the USERSET GUC
SET scan_recycle_buffers = N
tested with 1,4,8,16, but only  8 seems sensible, IMHO.

Patch effects StrategyGetBuffer, so only effects the disk-cache path.
The idea is that if its already in shared buffer cache then we get
substantial benefit already and nothing else is needed. So for the
general case, the patch adds a single if test into the I/O path.

The parameter is picked up at the start of SeqScan and VACUUM
(currently). Any change mid-scan will be ignored.

IMHO its possible to do this and to allow Synch Scans at the same time,
with some thought. There is no need for us to rely on cache spoiling
behaviour of scans to implement that feature as well.

Independent performance tests requested, so that we can discuss this
objectively.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com





Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-08 Thread Sherry Moore
Hi Simon,

 and what you haven't said
 
 - all of this is orthogonal to the issue of buffer cache spoiling in
 PostgreSQL itself. That issue does still exist as a non-OS issue, but
 we've been discussing in detail the specific case of L2 cache effects
 with specific kernel calls. All of the test results have been
 stand-alone, so we've not done any measurements in that area. I say this
 because you make the point that reducing the working set size of write
 workloads has no effect on the L2 cache issue, but ISTM its still
 potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
  difference with VACUUM was caused by whether the size of the
  user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
  to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
  the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-07 Thread Marko Kreen

On 3/7/07, Hannu Krosing [EMAIL PROTECTED] wrote:

Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
systems ?


Supposedly you could mmap() a file and then do mincore() on the
area to see which pages are cached.

But you were talking about postgres cache before, there it should
be easily implementable.

--
marko

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Simon Riggs
On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote:
 Simon Riggs wrote:

 But it would break the idea of letting a second seqscan follow in the
 first's hot cache trail, no?

No, but it would make it somewhat harder to achieve without direct
synchronization between scans. It could still work; lets see.

I'm not sure thats an argument against fixing the problem with the
buffer strategy though. We really want both, not just one or the other.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Sherry Moore
Hi Tom,

Sorry about the delay.  I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K).  The block
size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

81 tread ./readtest -h
Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count]
-v: Verbose mode
-N: Normalize results by number of reads
-s size:  Working set size (may specify K,M,G suffix)
-n iter:Number of test iterations
-f filename:Name of the file to read from
-d [+|-]delta:  Distance between subsequent reads
-c count:   Number of reads
-h: Print this help

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

82 tread ./readtest -s 16k -f boot_archive   
46445262
83 tread ./readtest -s 128M -f boot_archive  
118294230
84 tread ./readtest -s 16k -f boot_archive -n 100
4230210856
85 tread ./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

89 tread ./readtest -s 16k -f boot_archive
43606882
90 tread ./readtest -s 128M -f boot_archive 
100547909
91 tread ./readtest -s 16k -f boot_archive -n 100
4251823995
92 tread ./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller.  As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:

- Based on a lot of the benchmarks and workloads I traced, the
  target buffer of read operations are typically accessed again
  shortly after the read, while writes are usually not.  Therefore,
  the default operation mode is to bypass L2 for writes, but not
  for reads.

- The Opteron's L1 cache size is 64K.  If reads are larger than
  128KB, it would have displacement flushed itself anyway, so for
  large reads, I will also bypass L2. I am working on dynamically
  setting copyout_max_cached based on the L1 D-cache size on the
  system.

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations.  It is a calculated balance.

Thanks,
Sherry


On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
 Luke Lonergan [EMAIL PROTECTED] writes:
  Good info - it's the same in Solaris, the routine is uiomove (Sherry
  wrote it).
 
 Cool.  Maybe Sherry can comment on the question whether it's possible
 for a large-scale-memcpy to not take a hit on filling a cache line
 that wasn't previously in cache?
 
 I looked a bit at the Linux code that's being used here, but it's all
 x86_64 assembler which is something I've never studied :-(.
 
   regards, tom lane

-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym
#include stdlib.h
#include stdio.h
#include ctype.h
#include unistd.h
#include fcntl.h
#include sys/param.h
#include sys/time.h
#include sys/mman.h
#include errno.h
#include thread.h
#include signal.h
#include strings.h
#include libgen.h

#define KB(a)   (a*1024)
#define MB(a)   (KB(a)*1024)

static void
usage(char *s)
{
fprintf(stderr,
Usage: %s [-v] [-N] -s size -n iter 
[-d delta] [-c count]\n, s);
fprintf(stderr,
\t-v:\t\tVerbose mode\n
\t-N:\t\tNormalize results by number of reads\n
\t-s size:\tWorking set size (may specify K,M,G suffix)\n
\t-n iter:\tNumber of test iterations\n
\t-f filename:\tName of the file to read from\n
\t-d [+|-]delta:\tDistance between subsequent reads\n
\t-c count:\tNumber of reads\n
\t-h:\t\tPrint this help\n );
exit(1);
}

#define ABS(x) ((x) = 0 ? (x) : -(x))

static void
format_num(size_t v, size_t *new, char *code)
{
if (v % (1024 * 1024 * 1024) == 0) {
*new = v / (1024 * 1024 * 

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
 On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
  Another approach I proposed back in December is to not have a  
  variable like that at all, but scan the buffer cache for pages  
  belonging to the table you're scanning to initialize the scan.  
  Scanning all the BufferDescs is a fairly CPU and lock heavy  
  operation, but it might be ok given that we're talking about large  
  I/O bound sequential scans. It would require no DBA tuning and  
  would work more robustly in varying conditions. I'm not sure where  
  you would continue after scanning the in-cache pages. At the  
  highest in-cache block number, perhaps.
 
 If there was some way to do that, it'd be what I'd vote for.
 

I still don't know how to make this take advantage of the OS buffer
cache. 

However, no DBA tuning is a huge advantage, I agree with that.

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go. Can you guys give me some pointers about
how to walk through the shared buffers, reading the pages that I need,
while being sure not to read a page that's been evicted, and also not
potentially causing a performance regression somewhere else?

 Given the partitioning of the buffer lock that Tom did it might not  
 be that horrible for many cases, either, since you'd only need to  
 scan through one partition.
 
 We also don't need an exact count, either. Perhaps there's some way  
 we could keep a counter or something...

Exact count of what? The pages already in cache?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 If I were to implement this idea, I think Heikki's bitmap of pages
 already read is the way to go.

I think that's a good way to guarantee that you'll not finish in time
for 8.3.  Heikki's idea is just at the handwaving stage at this point,
and I'm not even convinced that it will offer any win.  (Pages in
cache will be picked up by a seqscan already.)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
  If I were to implement this idea, I think Heikki's bitmap of pages
  already read is the way to go.
 
 I think that's a good way to guarantee that you'll not finish in time
 for 8.3.  Heikki's idea is just at the handwaving stage at this point,
 and I'm not even convinced that it will offer any win.  (Pages in
 cache will be picked up by a seqscan already.)
 

I agree that it's a good idea stick with the current implementation
which is, as far as I can see, meeting all of my performance goals.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Heikki Linnakangas

Jeff Davis wrote:

On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:

On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
Another approach I proposed back in December is to not have a  
variable like that at all, but scan the buffer cache for pages  
belonging to the table you're scanning to initialize the scan.  
Scanning all the BufferDescs is a fairly CPU and lock heavy  
operation, but it might be ok given that we're talking about large  
I/O bound sequential scans. It would require no DBA tuning and  
would work more robustly in varying conditions. I'm not sure where  
you would continue after scanning the in-cache pages. At the  
highest in-cache block number, perhaps.

If there was some way to do that, it'd be what I'd vote for.



I still don't know how to make this take advantage of the OS buffer
cache. 


Yep, I don't see any way to do that. I think we could live with that, 
though. If we went with the sync_scan_offset approach, you'd have to 
leave a lot of safety margin in that as well.



However, no DBA tuning is a huge advantage, I agree with that.

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go. Can you guys give me some pointers about
how to walk through the shared buffers, reading the pages that I need,
while being sure not to read a page that's been evicted, and also not
potentially causing a performance regression somewhere else?


You could take a look at BufferSync, for example. It walks through the 
buffer cache, syncing all dirty buffers.


FWIW, I've attached a function I wrote some time ago when I was playing 
with the same idea for vacuums. A call to the new function loops through 
the buffer cache and returns the next buffer that belong to a certain 
relation. I'm not sure that it's correct and safe, and there's not much 
comments, but should work if you want to play with it...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/storage/buffer/bufmgr.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.214
diff -c -r1.214 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	5 Jan 2007 22:19:37 -	1.214
--- src/backend/storage/buffer/bufmgr.c	22 Jan 2007 16:38:37 -
***
*** 97,102 
--- 97,134 
  static void AtProcExit_Buffers(int code, Datum arg);
  
  
+ Buffer
+ ReadAnyBufferForRelation(Relation reln)
+ {
+ 	static int last_buf_id = 0;
+ 	int new_buf_id;
+ 	volatile BufferDesc *bufHdr;
+ 
+ 	/* Make sure we will have room to remember the buffer pin */
+ 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+ 
+ 	new_buf_id = last_buf_id;
+ 	do
+ 	{
+ 		if (++new_buf_id = NBuffers)
+ 			new_buf_id = 0;
+ 
+ 		bufHdr = BufferDescriptors[new_buf_id];
+ 		LockBufHdr(bufHdr);
+ 
+ 		if ((bufHdr-flags  BM_VALID)  RelFileNodeEquals(bufHdr-tag.rnode, reln-rd_node))
+ 		{
+ 			PinBuffer_Locked(bufHdr);
+ 			last_buf_id = new_buf_id;
+ 			return BufferDescriptorGetBuffer(bufHdr);
+ 		}
+ 		UnlockBufHdr(bufHdr);
+ 	} while(new_buf_id != last_buf_id);
+ 	last_buf_id = new_buf_id;
+ 	return InvalidBuffer;
+ }
+ 
+ 
  /*
   * ReadBuffer -- returns a buffer containing the requested
   *		block of the requested relation.  If the blknum

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Heikki Linnakangas

Tom Lane wrote:

Jeff Davis [EMAIL PROTECTED] writes:

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go.


I think that's a good way to guarantee that you'll not finish in time
for 8.3.  Heikki's idea is just at the handwaving stage at this point,
and I'm not even convinced that it will offer any win.  (Pages in
cache will be picked up by a seqscan already.)


The scenario that I'm worried about is that you have a table that's 
slightly larger than RAM. If you issue many seqscans on that table, one 
at a time, every seqscan will have to read the whole table from disk, 
even though say 90% of it is in cache when the scan starts.


This can be alleviated by using a large enough sync_scan_offset, but a 
single setting like that is tricky to tune, especially if your workload 
is not completely constant. Tune it too low, and you don't get much 
benefit, tune it too high and your scans diverge and you lose all benefit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Simon Riggs
On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote:

 - Based on a lot of the benchmarks and workloads I traced, the
   target buffer of read operations are typically accessed again
   shortly after the read, while writes are usually not.  Therefore,
   the default operation mode is to bypass L2 for writes, but not
   for reads.

Hi Sherry,

I'm trying to relate what you've said to how we should proceed from
here. My understanding of what you've said is:

- Tom's assessment that the observed performance quirk could be fixed in
the OS kernel is correct and you have the numbers to prove it

- currently Solaris only does NTA for 128K reads, which we don't
currently do. If we were to request 16 blocks at time, we would get this
benefit on Solaris, at least. The copyout_max_cached parameter can be
patched, but isn't a normal system tunable.

- other workloads you've traced *do* reuse the same buffer again very
soon afterwards when reading sequentially (not writes). Reducing the
working set size is an effective technique in improving performance if
we don't have a kernel that does NTA or we don't read in big enough
chunks (we need both to get NTA to kick in).

and what you haven't said

- all of this is orthogonal to the issue of buffer cache spoiling in
PostgreSQL itself. That issue does still exist as a non-OS issue, but
we've been discussing in detail the specific case of L2 cache effects
with specific kernel calls. All of the test results have been
stand-alone, so we've not done any measurements in that area. I say this
because you make the point that reducing the working set size of write
workloads has no effect on the L2 cache issue, but ISTM its still
potentially a cache spoiling issue.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 18:47 +, Heikki Linnakangas wrote:
 Tom Lane wrote:
  Jeff Davis [EMAIL PROTECTED] writes:
  If I were to implement this idea, I think Heikki's bitmap of pages
  already read is the way to go.
  
  I think that's a good way to guarantee that you'll not finish in time
  for 8.3.  Heikki's idea is just at the handwaving stage at this point,
  and I'm not even convinced that it will offer any win.  (Pages in
  cache will be picked up by a seqscan already.)
 
 The scenario that I'm worried about is that you have a table that's 
 slightly larger than RAM. If you issue many seqscans on that table, one 
 at a time, every seqscan will have to read the whole table from disk, 
 even though say 90% of it is in cache when the scan starts.
 

If you're issuing sequential scans one at a time, that 90% of the table
that was cached is probably not cached any more, unless the scans are
close together in time without overlapping (serial sequential scans).
And the problem you describe is no worse than current behavior, where
you have exactly the same problem.

 This can be alleviated by using a large enough sync_scan_offset, but a 
 single setting like that is tricky to tune, especially if your workload 
 is not completely constant. Tune it too low, and you don't get much 
 benefit, tune it too high and your scans diverge and you lose all benefit.
 

I see why you don't want to manually tune this setting, however it's
really not that tricky. You can be quite conservative and still use a
good fraction of your physical memory. I will come up with some numbers
and see how much we have to gain.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jim Nasby

On Mar 6, 2007, at 12:17 AM, Tom Lane wrote:

Jim Nasby [EMAIL PROTECTED] writes:

An idea I've been thinking about would be to have the bgwriter or
some other background process actually try and keep the free list
populated,


The bgwriter already tries to keep pages just in front of the clock
sweep pointer clean.


True, but that still means that each backend has to run the clock- 
sweep. AFAICT that's something that backends will serialize on (due  
to BufFreelistLock), so it would be best to make StrategyGetBuffer as  
fast as possible. It certainly seems like grabbing a buffer off the  
free list is going to be a lot faster than running the clock sweep.  
That's why I think it'd be better to have the bgwriter run the clock  
sweep and put enough buffers on the free list to try and keep up with  
demand.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jim Nasby

On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:

We also don't need an exact count, either. Perhaps there's some way
we could keep a counter or something...


Exact count of what? The pages already in cache?


Yes. The idea being if you see there's 10k pages in cache, you can  
likely start 9k pages behind the current scan point and still pick  
everything up.


But this is nowhere near as useful as the bitmap idea, so I'd only  
look at it if it's impossible to make the bitmaps work. And like  
others have said, that should wait until there's at least a first- 
generation patch that's going to make it into 8.3.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote:
 On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
  We also don't need an exact count, either. Perhaps there's some way
  we could keep a counter or something...
 
  Exact count of what? The pages already in cache?
 
 Yes. The idea being if you see there's 10k pages in cache, you can  
 likely start 9k pages behind the current scan point and still pick  
 everything up.
 
 But this is nowhere near as useful as the bitmap idea, so I'd only  
 look at it if it's impossible to make the bitmaps work. And like  
 others have said, that should wait until there's at least a first- 
 generation patch that's going to make it into 8.3.

You still haven't told me how we take advantage of the OS buffer cache
with the bitmap idea. What makes you think that my current
implementation is nowhere near as useful as the bitmap idea?

My current implementation is making use of OS buffers + shared memory;
the bitmap idea can only make use of shared memory, and is likely
throwing the OS buffers away completely.

I also suspect that the bitmap idea relies too much on the idea that
there's a contiguous cache trail in the shared buffers alone. Any
devation from that -- which could be caused by PG's page replacement
algorithm, especially in combination with a varied load pattern -- would
negate any benefit from the bitmap idea. I feel much more confident that
there will exist a trail of pages that are cached in *either* the PG
shared buffers *or* the OS buffer cache. There may be holes/gaps in
either one, but it's much more likely that they combine into a
contiguous series of cached pages. Do you have an idea how I might test
this claim?

Regards,
Jeff Davis



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote:
 Jeff Davis wrote:
  On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
  On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
  Another approach I proposed back in December is to not have a  
  variable like that at all, but scan the buffer cache for pages  
  belonging to the table you're scanning to initialize the scan.  
  Scanning all the BufferDescs is a fairly CPU and lock heavy  
  operation, but it might be ok given that we're talking about large  
  I/O bound sequential scans. It would require no DBA tuning and  
  would work more robustly in varying conditions. I'm not sure where  
  you would continue after scanning the in-cache pages. At the  
  highest in-cache block number, perhaps.
  If there was some way to do that, it'd be what I'd vote for.
 
  
  I still don't know how to make this take advantage of the OS buffer
  cache. 
 
 Yep, I don't see any way to do that. I think we could live with that, 
 though. If we went with the sync_scan_offset approach, you'd have to 
 leave a lot of safety margin in that as well.
 

Right, there would certainly have to be a safety margin with
sync_scan_offset. However, your plan only works when the shared buffers
are dominated by this sequential scan. Let's say you have 40% of
physical memory for shared buffers, and say that 50% are being used for
hot pages in other parts of the database. That means you have access to
only 20% of physical memory to optimize for this sequential scan, and
20% of the physical memory is basically unavailable (being used for
other parts of the database).

In my current implementation, you could set sync_scan_offset to 1.0
(meaning 1.0 x shared_buffers), giving you 40% of physical memory that
would be used for starting this sequential scan. In this case, that
should be a good margin of error, considering that as much as 80% of the
physical memory might actually be in cache (OS or PG cache).

This all needs to be backed up by testing, of course. I'm just
extrapolating some numbers that look vaguely reasonable to me.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Luke Lonergan
Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in 
various patterns and did not see the substantial gain as with reducing the 
working set size.

My conclusion: Fixing the OS is not sufficient to alleviate the issue.  We see 
a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this 
effect.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Sherry Moore [mailto:[EMAIL PROTECTED]
Sent:   Tuesday, March 06, 2007 10:05 PM Eastern Standard Time
To: Simon Riggs
Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; 
Gavin Sherry; PGSQL Hackers; Doug Rady
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Simon,

 and what you haven't said
 
 - all of this is orthogonal to the issue of buffer cache spoiling in
 PostgreSQL itself. That issue does still exist as a non-OS issue, but
 we've been discussing in detail the specific case of L2 cache effects
 with specific kernel calls. All of the test results have been
 stand-alone, so we've not done any measurements in that area. I say this
 because you make the point that reducing the working set size of write
 workloads has no effect on the L2 cache issue, but ISTM its still
 potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
  difference with VACUUM was caused by whether the size of the
  user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
  to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
  the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym



Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Hannu Krosing
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis:
 On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote:
  Jeff Davis wrote:
   On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
   On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
   Another approach I proposed back in December is to not have a  
   variable like that at all, but scan the buffer cache for pages  
   belonging to the table you're scanning to initialize the scan.  
   Scanning all the BufferDescs is a fairly CPU and lock heavy  
   operation, but it might be ok given that we're talking about large  
   I/O bound sequential scans. It would require no DBA tuning and  
   would work more robustly in varying conditions. I'm not sure where  
   you would continue after scanning the in-cache pages. At the  
   highest in-cache block number, perhaps.
   If there was some way to do that, it'd be what I'd vote for.
  
   
   I still don't know how to make this take advantage of the OS buffer
   cache. 

Maybe it should not ?

Mostly there can be use of OS cache only if it is much bigger than
shared buffer cache. It may make sense to forget about OS cache and just
tell those who can make use of sync scans to set most of memory aside
for shared buffers.

Then we can do better predictions/lookups of how much of a table is
actually in memory.

Dual caching is usually not very beneficial anyway, not to mention about
difficulties in predicting any doual-caching effects.

  Yep, I don't see any way to do that. I think we could live with that, 
  though. If we went with the sync_scan_offset approach, you'd have to 
  leave a lot of safety margin in that as well.
  
 
 Right, there would certainly have to be a safety margin with
 sync_scan_offset. However, your plan only works when the shared buffers
 are dominated by this sequential scan. Let's say you have 40% of
 physical memory for shared buffers, and say that 50% are being used for
 hot pages in other parts of the database. That means you have access to
 only 20% of physical memory to optimize for this sequential scan, and
 20% of the physical memory is basically unavailable (being used for
 other parts of the database).

The simplest thing in case table si much bigger than buffer cache usable
for it is to start the second scan at the point the first scan is
traversing *now*, and hope that the scans will stay together. Or start
at some fixed lag, which makes the first scan to be always the one
issuing reads and second just freerides on buffers already in cache. It
may even be a good idea to throttle the second scan to stay N pages
behind if the OS readahead gets confused when same file is read from
multiple processes.

If the table is smaller than the cache, then just scan it without
syncing.

Trying to read buffers in the same order starting from near the point
where ppages are still in shared buffer cache seems good mostly for case
where table is as big as or just a little larger than cache.

 In my current implementation, you could set sync_scan_offset to 1.0
 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that
 would be used for starting this sequential scan. In this case, that
 should be a good margin of error, considering that as much as 80% of the
 physical memory might actually be in cache (OS or PG cache).
 
 This all needs to be backed up by testing, of course. I'm just
 extrapolating some numbers that look vaguely reasonable to me.

If there is an easy way to tell PG give me this page only if it is in
shared cache already, then a good approach might be to start 2nd scan
at the point where 1st is now, and move in both directions
simultabeously, like this:

First scan is at page N.

Second scan:

M=N-1
WHILE NOT ALL PAGES ARE READ: 
IF PAGE N IS IN CACHE :   -- FOLLOW FIRST READER
READ PAGE N
N++
ELSE IF M=0 AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES
READ PAGE M
M--
ELSE IF FIRST READER STILL GOING: -- NO OLDER PAGES, WAIT FOR 1st
WAIT FOR PAGE N TO BECOME AVAILABLE
READ PAGE N
N++
ELSE:-- BECOME 1st reader
   READ PAGE N
   N++
PROCESS PAGE
--
IF N  PAGES_IF_TABLE: N=0
IF M  0: M=PAGES_IF_TABLE


This should work reasonably well for LRU caches and it may be made to
work with clock sweep scheme if the sweep arranges pages to purge in
file order.

If we could make the IF PAGE x IS IN CACHE part also know about OS cache
this could also make use of os cache.


Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
systems ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Grzegorz Jaskiewicz [EMAIL PROTECTED] writes:
 On Mar 5, 2007, at 2:36 AM, Tom Lane wrote:
 I'm also less than convinced that it'd be helpful for a big seqscan:
 won't reading a new disk page into memory via DMA cause that memory to
 get flushed from the processor cache anyway?

 Nope. DMA is writing directly into main memory. If the area was in  
 the L2/L1 cache, it will get invalidated. But if it isn't there, it  
 is okay.

So either way, it isn't in processor cache after the read.  So how can
there be any performance benefit?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
 So either way, it isn't in processor cache after the read.  
 So how can there be any performance benefit?

It's the copy from kernel IO cache to the buffer cache that is L2
sensitive.  When the shared buffer cache is polluted, it thrashes the L2
cache.  When the number of pages being written to in the kernel-user
space writes fits in L2, then the L2 lines are written through (see
the link below on page 264 for the write combining features of the
opteron for example) and the writes to main memory are deferred.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
25112.PDF

- Luke


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom,

 Now this may only prove that the disk subsystem on this 
 machine is too cheap to let the system show any CPU-related 
 issues. 

Try it with a warm IO cache.  As I posted before, we see double the
performance of a VACUUM from a table in IO cache when the shared buffer
cache isn't being polluted.  The speed with large buffer cache should be
about 450 MB/s and the speed with a buffer cache smaller than L2 should
be about 800 MB/s.

The real issue here isn't the L2 behavior, though that's important when
trying to reach very high IO speeds, the issue is that we're seeing the
buffer cache pollution in the first place.  When we instrument the
blocks selected by the buffer page selection algorithm, we see that they
iterate sequentially, filling the shared buffer cache.  That's the
source of the problem here.

Do we have a regression test somewhere for this?

- Luke


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Grzegorz Jaskiewicz


On Mar 5, 2007, at 2:36 AM, Tom Lane wrote:

n into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway?


Nope. DMA is writing directly into main memory. If the area was in  
the L2/L1 cache, it will get invalidated. But if it isn't there, it  
is okay.


--
Grzegorz Jaskiewicz
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 So either way, it isn't in processor cache after the read.  
 So how can there be any performance benefit?

 It's the copy from kernel IO cache to the buffer cache that is L2
 sensitive.  When the shared buffer cache is polluted, it thrashes the L2
 cache.  When the number of pages being written to in the kernel-user
 space writes fits in L2, then the L2 lines are written through (see
 the link below on page 264 for the write combining features of the
 opteron for example) and the writes to main memory are deferred.

That makes absolutely zero sense.  The data coming from the disk was
certainly not in processor cache to start with, and I hope you're not
suggesting that it matters whether the *target* page of a memcpy was
already in processor cache.  If the latter, it is not our bug to fix.

 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
 25112.PDF

Even granting that your conclusions are accurate, we are not in the
business of optimizing Postgres for a single CPU architecture.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
 
Hi Tom,

 Even granting that your conclusions are accurate, we are not 
 in the business of optimizing Postgres for a single CPU architecture.

I think you're missing my/our point:

The Postgres shared buffer cache algorithm appears to have a bug.  When
there is a sequential scan the blocks are filling the entire shared
buffer cache.  This should be fixed.

My proposal for a fix: ensure that when relations larger (much larger?)
than buffer cache are scanned, they are mapped to a single page in the
shared buffer cache.

- Luke


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Heikki Linnakangas

Luke Lonergan wrote:

The Postgres shared buffer cache algorithm appears to have a bug.  When
there is a sequential scan the blocks are filling the entire shared
buffer cache.  This should be fixed.

My proposal for a fix: ensure that when relations larger (much larger?)
than buffer cache are scanned, they are mapped to a single page in the
shared buffer cache.


It's not that simple. Using the whole buffer cache for a single seqscan 
is ok, if there's currently no better use for the buffer cache. Running 
a single select will indeed use the whole cache, but if you run any 
other smaller queries, the pages they need should stay in cache and the 
seqscan will loop through the other buffers.


In fact, the pages that are left in the cache after the seqscan finishes 
would be useful for the next seqscan of the same table if we were smart 
enough to read those pages first. That'd make a big difference for 
seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if 
Jeff's sync seqscan patch adresses that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Hannu Krosing
Ühel kenal päeval, E, 2007-03-05 kell 03:51, kirjutas Luke Lonergan:
  Hi Tom,
 
  Even granting that your conclusions are accurate, we are not 
  in the business of optimizing Postgres for a single CPU architecture.
 
 I think you're missing my/our point:
 
 The Postgres shared buffer cache algorithm appears to have a bug.  When
 there is a sequential scan the blocks are filling the entire shared
 buffer cache.  This should be fixed.
 
 My proposal for a fix: ensure that when relations larger (much larger?)
 than buffer cache are scanned, they are mapped to a single page in the
 shared buffer cache.

How will this approach play together with synchronized scan patches ?

Or should synchronized scan rely on systems cache only ?

 - Luke
 
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 I think you're missing my/our point:

 The Postgres shared buffer cache algorithm appears to have a bug.  When
 there is a sequential scan the blocks are filling the entire shared
 buffer cache.  This should be fixed.

No, this is not a bug; it is operating as designed.  The point of the
current bufmgr algorithm is to replace the page least recently used,
and that's what it's doing.

If you want to lobby for changing the algorithm, then you need to
explain why one test case on one platform justifies de-optimizing
for a lot of other cases.  In almost any concurrent-access situation
I think that what you are suggesting would be a dead loss --- for
instance we might as well forget about Jeff Davis' synchronized-scan
work.

In any case, I'm still not convinced that you've identified the problem
correctly, because your explanation makes no sense to me.  How can the
processor's L2 cache improve access to data that it hasn't got yet?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Florian Weimer
* Tom Lane:

 That makes absolutely zero sense.  The data coming from the disk was
 certainly not in processor cache to start with, and I hope you're not
 suggesting that it matters whether the *target* page of a memcpy was
 already in processor cache.  If the latter, it is not our bug to fix.

Uhm, if it's not in the cache, you typically need to evict some cache
lines to make room for the data, so I'd expect an indirect performance
hit.  I could be mistaken, though.

-- 
Florian Weimer[EMAIL PROTECTED]
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Hannu Krosing
Ühel kenal päeval, E, 2007-03-05 kell 04:15, kirjutas Tom Lane:
 Luke Lonergan [EMAIL PROTECTED] writes:
  I think you're missing my/our point:
 
  The Postgres shared buffer cache algorithm appears to have a bug.  When
  there is a sequential scan the blocks are filling the entire shared
  buffer cache.  This should be fixed.
 
 No, this is not a bug; it is operating as designed.  

Maybe he means that there is an oversight (aka bug) in the design ;)

 The point of the
 current bufmgr algorithm is to replace the page least recently used,
 and that's what it's doing.
 
 If you want to lobby for changing the algorithm, then you need to
 explain why one test case on one platform justifies de-optimizing
 for a lot of other cases. 

If you know beforehand that you will definitely overflow cache and not
reuse it anytime soon, then it seems quite reasonable to not even start
polluting the cache. Especially, if you get a noticable boost in
performance while doing so.

 In almost any concurrent-access situation
 I think that what you are suggesting would be a dead loss 

Only if the concurrent access patern is over data mostly fitting in
buffer cache. If we can avoid polluting buffer cache with data we know
we will use only once, more useful data will be available.

 --- for
 instance we might as well forget about Jeff Davis' synchronized-scan
 work.

Depends on ratio of system cache/shared buffer cache. I don't think
Jeff's patch is anywere near the point it needs to start worrying about
data swapping between system cache and shared burrers, or L2 cache usage

 In any case, I'm still not convinced that you've identified the problem
 correctly, because your explanation makes no sense to me.  How can the
 processor's L2 cache improve access to data that it hasn't got yet?
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at
 
 http://www.postgresql.org/about/donate
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
 

  The Postgres shared buffer cache algorithm appears to have a bug.  
  When there is a sequential scan the blocks are filling the entire 
  shared buffer cache.  This should be fixed.
 
 No, this is not a bug; it is operating as designed.  The 
 point of the current bufmgr algorithm is to replace the page 
 least recently used, and that's what it's doing.

At least we've established that for certain.
 
 If you want to lobby for changing the algorithm, then you 
 need to explain why one test case on one platform justifies 
 de-optimizing for a lot of other cases.  In almost any 
 concurrent-access situation I think that what you are 
 suggesting would be a dead loss --- for instance we might as 
 well forget about Jeff Davis' synchronized-scan work.

Instead of forgetting about it, we'd need to change it.
 
 In any case, I'm still not convinced that you've identified 
 the problem correctly, because your explanation makes no 
 sense to me.  How can the processor's L2 cache improve access 
 to data that it hasn't got yet?

The evidence seems to clearly indicate reduced memory writing due to an
L2 related effect.  The actual data shows a dramatic reduction in main
memory writing when the destination of the written data fits in the L2
cache.

I'll try to fit a hypothesis to explain it.  Assume you've got a warm IO
cache in the OS.

The heapscan algorithm now works like this:
0) select a destination user buffer
1) uiomove-kcopy memory from the IO cache to the user buffer
1A) step 1: read from kernel space
1B) step 2: write to user space
2) the user buffer is accessed many times by the executor nodes above
Repeat

There are two situations we are evaluating: one where the addresses of
the user buffer are scattered over a space larger than the size of L2
(caseA) and one where they are confined to the size of L2 (caseB).  Note
that we could also consider another situation where the addresses are
scattered over a space smaller than the TLB entries mapped by the L2
cache (512 max) and larger than the size of L2, but we've tried that and
it proved uninteresting.

For both cases step 1A is the same: each block (8KB) write from (1) will
read from IO cache into 128 L2 (64B each) lines, evicting the previous
data there.

In step 1B for caseA the destination for the writes is mostly an address
not currently mapped into L2 cache, so 128 victim L2 lines are found
(LRU), stored into, and writes are flushed to main memory.

In step 1B for caseB, the destination for the writes is located in L2
already.  The 128 L2 lines are stored into, and the write to main memory
is delayed under the assumption that these lines are hot as they were
already in L2.

I don't know enough to be sure this is the right answer, but it does fit
the experimental data.

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood

Gavin Sherry wrote:

On Mon, 5 Mar 2007, Mark Kirkwood wrote:


To add a little to this - forgetting the scan resistant point for the
moment... cranking down shared_buffers to be smaller than the L2 cache
seems to help *any* sequential scan immensely, even on quite modest HW:


(snipped)

When I've profiled this activity, I've seen a lot of time spent
searching for/allocating a new buffer for each page being fetched.
Obviously having less of them to search through will help, but having
less than the L2 cache-size worth of 'em seems to help a whole lot!


Could you demonstrate that point by showing us timings for shared_buffers
sizes from 512K up to, say, 2 MB? The two numbers you give there might
just have to do with managing a large buffer.


Yeah - good point:

PIII 1.26 Ghz 512Kb L2 cache 2G RAM

Test is elapsed time for: SELECT count(*) FROM lineitem

lineitem has 1535724 pages (11997 MB)

Shared Buffers  Elapsed  IO rate (from vmstat)
--  ---  -
400MB   101 s122 MB/s

2MB 100 s
1MB  97 s
768KB93 s
512KB86 s
256KB77 s
128KB74 s166 MB/s

I've added the observed IO rate for the two extreme cases (the rest can 
be pretty much deduced via interpolation).


Note that the system will do about 220 MB/s with the now (in)famous dd 
test, so we have a bit of headroom (not too bad for a PIII).


Cheers

Mark

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Mark,

 lineitem has 1535724 pages (11997 MB)
 
 Shared Buffers  Elapsed  IO rate (from vmstat)
 --  ---  -
 400MB   101 s122 MB/s
 
 2MB 100 s
 1MB  97 s
 768KB93 s
 512KB86 s
 256KB77 s
 128KB74 s166 MB/s
 
 I've added the observed IO rate for the two extreme cases 
 (the rest can be pretty much deduced via interpolation).
 
 Note that the system will do about 220 MB/s with the now 
 (in)famous dd test, so we have a bit of headroom (not too bad 
 for a PIII).

What's really interesting: try this with a table that fits into I/O
cache (say half your system memory), and run VACUUM on the table.  That
way the effect will stand out more dramatically.

- Luke


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Gregory Stark
Luke Lonergan [EMAIL PROTECTED] writes:

 The evidence seems to clearly indicate reduced memory writing due to an
 L2 related effect.  

You might try using valgrind's cachegrind tool which I understand can actually
emulate various processors' cache to show how efficiently code uses it. I
haven't done much with it though so I don't know how applicable it would be to
a large-scale effect like this. 

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 Shared Buffers  Elapsed  IO rate (from vmstat)
 --  ---  -
 400MB   101 s122 MB/s
 2MB 100 s
 1MB  97 s
 768KB93 s
 512KB86 s
 256KB77 s
 128KB74 s166 MB/s

Hm, that seems to blow the it's an L2 cache effect theory out of the
water.  If it were a cache effect then there should be a performance
cliff at the point where the cache size is exceeded.  I see no such
cliff, in fact the middle part of the curve is darn near a straight
line on a log scale ...

So I'm back to asking what we're really measuring here.  Buffer manager
inefficiency of some sort, but what?  Have you tried oprofile?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Pavan Deolasee

Tom Lane wrote:

Mark Kirkwood [EMAIL PROTECTED] writes:
  

Shared Buffers  Elapsed  IO rate (from vmstat)
--  ---  -
400MB   101 s122 MB/s
2MB 100 s
1MB  97 s
768KB93 s
512KB86 s
256KB77 s
128KB74 s166 MB/s



So I'm back to asking what we're really measuring here.  Buffer manager
inefficiency of some sort, but what?  Have you tried oprofile?
  

Isn't the size of the shared buffer pool itself acting as a performance
penalty in this case ? May be StrategyGetBuffer() needs to make multiple
passes over the buffers before the usage_count of any buffer is reduced
to zero and the buffer is chosen as replacement victim.

There is no real advantage of having larger shared buffer pool in this
particular test. A heap buffer is hardly accessed again once the seqscan
passes over it. Can we try with a smaller value for
BM_MAX_USAGE_COUNT and see if that has any positive impact
for large shared pool in this case ?

Thanks,
Pavan


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom,

On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote:

 Hm, that seems to blow the it's an L2 cache effect theory out of the
 water.  If it were a cache effect then there should be a performance
 cliff at the point where the cache size is exceeded.  I see no such
 cliff, in fact the middle part of the curve is darn near a straight
 line on a log scale ...
 
 So I'm back to asking what we're really measuring here.  Buffer manager
 inefficiency of some sort, but what?  Have you tried oprofile?

How about looking at the CPU performance counters directly using cpustat:
  cpustat -c BU_fill_into_L2,umask=0x1 1

This shows us how many L2 fills there are on all four cores (we use all
four).  In the case without buffer cache pollution, below is the trace of L2
fills.  In the pollution case we fill 27 million lines, in the pollution
case we fill 44 million lines.

VACUUM orders (no buffer pollution):
 51.006   1  tick   2754293
 51.006   2  tick   3159565
 51.006   3  tick   2971929
 51.007   0  tick   3577487
 52.006   1  tick   4214179
 52.006   3  tick   3650193
 52.006   2  tick   3905828
 52.007   0  tick   3465261
 53.006   1  tick   1818766
 53.006   3  tick   1546018
 53.006   2  tick   1709385
 53.007   0  tick   1483371

And here is the case with buffer pollution:
VACUUM orders (with buffer pollution)
 22.006   0  tick   1576114
 22.006   1  tick   1542604
 22.006   2  tick   1987366
 22.006   3  tick   1784567
 23.006   3  tick   2706059
 23.006   2  tick   2362048
 23.006   0  tick   2190719
 23.006   1  tick   2088827
 24.006   0  tick   2247473
 24.006   1  tick   2153850
 24.006   2  tick   2422730
 24.006   3  tick   2758795
 25.006   0  tick   2419436
 25.006   1  tick   2229602
 25.006   2  tick   2619333
 25.006   3  tick   2712332
 26.006   1  tick   1827923
 26.006   2  tick   1886556
 26.006   3  tick   2909746
 26.006   0  tick   1467164



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 Isn't the size of the shared buffer pool itself acting as a performance
 penalty in this case ? May be StrategyGetBuffer() needs to make multiple
 passes over the buffers before the usage_count of any buffer is reduced
 to zero and the buffer is chosen as replacement victim.

I read that and thought you were onto something, but it's not acting
quite the way I expect.  I made a quick hack in StrategyGetBuffer() to
count the number of buffers it looks at before finding a victim.
Running it with just 32 buffers on a large count(*), the behavior after
the initial startup transient is quite odd:

got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Josh Berkus
Tom,

 Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
 do either, so I'm a bit at a loss what's going on --- but in any case,
 it doesn't look like we are cycling through the entire buffer space
 for each fetch.

I'd be happy to DTrace it, but I'm a little lost as to where to look in the 
kernel.   I'll see if I can find someone who knows more about memory 
management than me (that ought to be easy).

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Tom,

On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote:

 Hm, that seems to blow the it's an L2 cache effect theory out of the
 water.  If it were a cache effect then there should be a performance
 cliff at the point where the cache size is exceeded.  I see no such
 cliff, in fact the middle part of the curve is darn near a straight
 line on a log scale ...

Here's that cliff you were looking for:

Size of Orders table: 7178MB
Blocksize: 8KB

Shared_buffers  Select CountVacuum
(KB)(s) (s)
===
248 5.522.46
368 4.772.40
552 5.822.40
824 6.202.43
12325.603.59
18486.023.14
27685.534.56

All of these were run three times and the *lowest* time reported.  Also, the
behavior of fast VACUUM after SELECT begins abruptly at 1232KB of
shared_buffers.

These are Opterons with 2MB of L2 cache shared between two cores.

- Luke




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
I wrote:
 Pavan Deolasee [EMAIL PROTECTED] writes:
 Isn't the size of the shared buffer pool itself acting as a performance
 penalty in this case ? May be StrategyGetBuffer() needs to make multiple
 passes over the buffers before the usage_count of any buffer is reduced
 to zero and the buffer is chosen as replacement victim.

 I read that and thought you were onto something, but it's not acting
 quite the way I expect.  I made a quick hack in StrategyGetBuffer() to
 count the number of buffers it looks at before finding a victim.
 ...
 Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
 do either, so I'm a bit at a loss what's going on --- but in any case,
 it doesn't look like we are cycling through the entire buffer space
 for each fetch.

Nope, Pavan's nailed it: the problem is that after using a buffer, the
seqscan leaves it with usage_count = 1, which means it has to be passed
over once by the clock sweep before it can be re-used.  I was misled in
the 32-buffer case because catalog accesses during startup had left the
buffer state pretty confused, so that there was no long stretch before
hitting something available.  With a large number of buffers, the
behavior is that the seqscan fills all of shared memory with buffers
having usage_count 1.  Once the clock sweep returns to the first of
these buffers, it will have to pass over all of them, reducing all of
their counts to 0, before it returns to the first one and finds it now
usable.  Subsequent tries find a buffer immediately, of course, until we
have again filled shared_buffers with usage_count 1 everywhere.  So the
problem is not so much the clock sweep overhead as that it's paid in a
very nonuniform fashion: with N buffers you pay O(N) once every N reads
and O(1) the rest of the time.  This is no doubt slowing things down
enough to delay that one read, instead of leaving it nicely I/O bound
all the time.  Mark, can you detect hiccups in the read rate using
your setup?

I seem to recall that we've previously discussed the idea of letting the
clock sweep decrement the usage_count before testing for 0, so that a
buffer could be reused on the first sweep after it was initially used,
but that we rejected it as being a bad idea.  But at least with large
shared_buffers it doesn't sound like such a bad idea.

Another issue nearby to this is whether to avoid selecting buffers that
are dirty --- IIRC someone brought that up again recently.  Maybe
predecrement for clean buffers, postdecrement for dirty ones would be a
cute compromise.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Here's four more points on the curve - I'd use a dirac delta function for
your curve fit ;-)

Shared_buffers  Select CountVacuum
(KB)(s) (s)
===
248 5.522.46
368 4.772.40
552 5.822.40
824 6.202.43
12325.603.59
18486.023.14
27685.534.56
55366.053.95
83045.804.37
12456   5.864.12
18680   5.834.10
28016   6.114.46

WRT what you found on the selection algorithm, it might also explain the L2
effects I think.

I'm also still of the opinion that polluting the shared buffer cache for a
seq scan does not make sense.

- Luke

On 3/5/07 10:21 AM, Luke Lonergan [EMAIL PROTECTED] wrote:

 Tom,
 
 On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote:
 
 Hm, that seems to blow the it's an L2 cache effect theory out of the
 water.  If it were a cache effect then there should be a performance
 cliff at the point where the cache size is exceeded.  I see no such
 cliff, in fact the middle part of the curve is darn near a straight
 line on a log scale ...
 
 Here's that cliff you were looking for:
 
 Size of Orders table: 7178MB
 Blocksize: 8KB
 
 Shared_buffers  Select CountVacuum
 (KB)(s) (s)
 ===
 248 5.522.46
 368 4.772.40
 552 5.822.40
 824 6.202.43
 12325.603.59
 18486.023.14
 27685.534.56
 
 All of these were run three times and the *lowest* time reported.  Also, the
 behavior of fast VACUUM after SELECT begins abruptly at 1232KB of
 shared_buffers.
 
 These are Opterons with 2MB of L2 cache shared between two cores.
 
 - Luke
 
 
 
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Josh Berkus
Tom,

 I seem to recall that we've previously discussed the idea of letting the
 clock sweep decrement the usage_count before testing for 0, so that a
 buffer could be reused on the first sweep after it was initially used,
 but that we rejected it as being a bad idea.  But at least with large
 shared_buffers it doesn't sound like such a bad idea.

We did discuss an number of formulas for setting buffers with different 
clock-sweep numbers, including ones with higher usage_count for indexes and 
starting numbers of 0 for large seq scans as well as vacuums.  However, we 
didn't have any way to prove that any of these complex algorithms would 
result in higher performance, so went with the simplest formula, with the 
idea of tinkering with it when we had more data.  So maybe now's the time.

Note, though, that the current algorithm is working very, very well for OLTP 
benchmarks, so we'd want to be careful not to gain performance in one area at 
the expense of another.  In TPCE testing, we've been able to increase 
shared_buffers to 10GB with beneficial performance effect (numbers posted 
when I have them) and even found that taking over RAM with the 
shared_buffers (ala Oracle) gave us equivalent performance to using the FS 
cache.  (yes, this means with a little I/O management engineering we could 
contemplate discarding use of the FS cache for a net performance gain.  Maybe 
for 8.4)

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Pavan Deolasee

Tom Lane wrote:


Nope, Pavan's nailed it: the problem is that after using a buffer, the
seqscan leaves it with usage_count = 1, which means it has to be passed
over once by the clock sweep before it can be re-used.  I was misled in
the 32-buffer case because catalog accesses during startup had left the
buffer state pretty confused, so that there was no long stretch before
hitting something available.  With a large number of buffers, the
behavior is that the seqscan fills all of shared memory with buffers
having usage_count 1.  Once the clock sweep returns to the first of
these buffers, it will have to pass over all of them, reducing all of
their counts to 0, before it returns to the first one and finds it now
usable.  Subsequent tries find a buffer immediately, of course, until we
have again filled shared_buffers with usage_count 1 everywhere.  So the
problem is not so much the clock sweep overhead as that it's paid in a
very nonuniform fashion: with N buffers you pay O(N) once every N reads
and O(1) the rest of the time.  This is no doubt slowing things down
enough to delay that one read, instead of leaving it nicely I/O bound
all the time.  Mark, can you detect hiccups in the read rate using
your setup?

  


Cool. You posted the same analysis before I could hit the send button :)

I am wondering whether seqscan would set the usage_count to 1 or to a higher
value. usage_count is  incremented while unpinning the buffer. Even if 
we use

page-at-a-time mode, won't the buffer itself would get pinned/unpinned
every time seqscan returns a tuple ? If thats the case, the overhead would
be O(BM_MAX_USAGE_COUNT * N) for every N reads.


I seem to recall that we've previously discussed the idea of letting the
clock sweep decrement the usage_count before testing for 0, so that a
buffer could be reused on the first sweep after it was initially used,
but that we rejected it as being a bad idea.  But at least with large
shared_buffers it doesn't sound like such a bad idea.

  

How about smaller value for BM_MAX_USAGE_COUNT ?


Another issue nearby to this is whether to avoid selecting buffers that
are dirty --- IIRC someone brought that up again recently.  Maybe
predecrement for clean buffers, postdecrement for dirty ones would be a
cute compromise.
  
Can we use a 2-bit counter where the higher bit is set if the buffer is 
dirty

and lower bit is set whenever the buffer is used. The clock-sweep then
decrement this counter and chooses a victim with counter value 0.
ISTM that we should optimize for large shared buffer pool case,
because that would be more common in the coming days. RAM is
getting cheaper everyday.

Thanks,
Pavan



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 I am wondering whether seqscan would set the usage_count to 1 or to a higher
 value. usage_count is  incremented while unpinning the buffer. Even if 
 we use
 page-at-a-time mode, won't the buffer itself would get pinned/unpinned
 every time seqscan returns a tuple ? If thats the case, the overhead would
 be O(BM_MAX_USAGE_COUNT * N) for every N reads.

No, it's only once per page.  There's a good deal of PrivateRefCount
thrashing that goes on while examining the individual tuples, but the
shared state only changes when we leave the page, because we hold pin
continuously on the current page of a seqscan.  If you don't believe
me, insert some debug printouts for yourself.

 How about smaller value for BM_MAX_USAGE_COUNT ?

This is not relevant to the problem: we are concerned about usage count
1 versus 0, not the other end of the range.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 I seem to recall that we've previously discussed the idea of letting the
 clock sweep decrement the usage_count before testing for 0, so that a
 buffer could be reused on the first sweep after it was initially used,
 but that we rejected it as being a bad idea.  But at least with large
 shared_buffers it doesn't sound like such a bad idea.

I seem to recall the classic clock sweep algorithm was to just use a single
bit. Either a buffer was touched recently or it wasn't.

I also vaguely recall a refinement involving keeping a bitmask and shifting it
right each time the clock hand comes around. So you have a precise history of
which recent clock sweeps the buffer was used and which it wasn't.

I think the coarseness of not caring how heavily it was used is a key part of
the algorithm. By not caring if it was lightly or heavily used during the
clock sweep, just that it was used at least once it avoids making sticking
with incorrect deductions about things like sequential scans even if multiple
sequential scans pass by. As soon as they stop seeing the buffer it
immediately reacts and discards the buffer.

I would check out my OS book from school but it's on another continent :(

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Simon Riggs
On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:
 Tom,
 
  I seem to recall that we've previously discussed the idea of letting the
  clock sweep decrement the usage_count before testing for 0, so that a
  buffer could be reused on the first sweep after it was initially used,
  but that we rejected it as being a bad idea.  But at least with large
  shared_buffers it doesn't sound like such a bad idea.

 Note, though, that the current algorithm is working very, very well for OLTP 
 benchmarks, so we'd want to be careful not to gain performance in one area at 
 the expense of another. 

Agreed.

What we should also add to the analysis is that this effect only occurs
when only uniform workloads is present, like SeqScan, VACUUM or COPY.
When you have lots of indexed access the scan workloads don't have as
much effect on the cache pollution as we are seeing in these tests.

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.

It would be simple to implement the ring buffer and enable/disable it
with a hint StrategyHintCyclicBufferReuse() in a similar manner to the
hint VACUUM provides now.

This would maintain the beneficial behaviour for OLTP, while keeping
data within the L2 cache for DSS and bulk workloads.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Itakgaki-san and I were discussing in January the idea of cache-looping,
 whereby a process begins to reuse its own buffers in a ring of ~32
 buffers. When we cycle back round, if usage_count==1 then we assume that
 we can reuse that buffer. This avoids cache swamping for read and write
 workloads, plus avoids too-frequent WAL writing for VACUUM.

 This would maintain the beneficial behaviour for OLTP,

Justify that claim.  It sounds to me like this would act very nearly the
same as having shared_buffers == 32 ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
This sounds like a good idea.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Simon Riggs [mailto:[EMAIL PROTECTED]
Sent:   Monday, March 05, 2007 02:37 PM Eastern Standard Time
To: Josh Berkus; Tom Lane; Pavan Deolasee; Mark Kirkwood; Gavin Sherry; 
Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore
Cc: pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:
 Tom,
 
  I seem to recall that we've previously discussed the idea of letting the
  clock sweep decrement the usage_count before testing for 0, so that a
  buffer could be reused on the first sweep after it was initially used,
  but that we rejected it as being a bad idea.  But at least with large
  shared_buffers it doesn't sound like such a bad idea.

 Note, though, that the current algorithm is working very, very well for OLTP 
 benchmarks, so we'd want to be careful not to gain performance in one area at 
 the expense of another. 

Agreed.

What we should also add to the analysis is that this effect only occurs
when only uniform workloads is present, like SeqScan, VACUUM or COPY.
When you have lots of indexed access the scan workloads don't have as
much effect on the cache pollution as we are seeing in these tests.

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.

It would be simple to implement the ring buffer and enable/disable it
with a hint StrategyHintCyclicBufferReuse() in a similar manner to the
hint VACUUM provides now.

This would maintain the beneficial behaviour for OLTP, while keeping
data within the L2 cache for DSS and bulk workloads.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com





Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Simon Riggs
On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Itakgaki-san and I were discussing in January the idea of cache-looping,
  whereby a process begins to reuse its own buffers in a ring of ~32
  buffers. When we cycle back round, if usage_count==1 then we assume that
  we can reuse that buffer. This avoids cache swamping for read and write
  workloads, plus avoids too-frequent WAL writing for VACUUM.
 
  This would maintain the beneficial behaviour for OLTP,
 
 Justify that claim.  It sounds to me like this would act very nearly the
 same as having shared_buffers == 32 ...

Sure. We wouldn't set the hint for IndexScans or Inserts, only for
SeqScans, VACUUM and COPY.

So OLTP-only workloads would be entirely unaffected. In the presence of
a mixed workload the scan tasks would have only a limited effect on the
cache, maintaining performance for the response time critical tasks. So
its an OLTP benefit because of cache protection and WAL-flush reduction
during VACUUM.

As we've seen, the scan tasks look like they'll go faster with this.

The assumption that we can reuse the buffer if usage_count=1 seems
valid. If another user had requested the block, then the usage_count
would be  1, unless the buffer has been used, unpinned and then a cycle
of the buffer cache had spun round, all within the time taken to process
32 blocks sequentially. We do have to reuse one of the buffers, so
cyclical reuse seems like a better bet most of the time than more
arbitrary block reuse, as we see in a larger cache.

Best way is to prove it though. Seems like not too much work to have a
private ring data structure when the hint is enabled. The extra
bookeeping is easily going to be outweighed by the reduction in mem-L2
cache fetches. I'll do it tomorrow, if no other volunteers.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 03:51 -0500, Luke Lonergan wrote:
 The Postgres shared buffer cache algorithm appears to have a bug.  When
 there is a sequential scan the blocks are filling the entire shared
 buffer cache.  This should be fixed.
 
 My proposal for a fix: ensure that when relations larger (much larger?)
 than buffer cache are scanned, they are mapped to a single page in the
 shared buffer cache.
 

I don't see why we should strictly limit sequential scans to one buffer
per scan. I assume you mean one buffer per scan, but that raises these
two questions:

(1) What happens when there are more seq scans than cold buffers
available?
(2) What happens when two sequential scans need the same page, do we
double-up?

Also, the first time we access any heap page of a big table, we are very
unsure whether we will access it again, regardless of whether it's part
of a seq scan or not.

In our current system of 4 LRU lists (depending on how many times a
buffer has been referenced), we could start more likely (e.g. system
catalogs, index pages) pages in higher list, and heap reads from big
tables in the lowest possible list. Assuming, of course, that has any
benefit (frequently accessed cache pages are likely to move up in the
lists very quickly anyway).

But I don't think we should eliminate caching of heap pages in big
tables all together. A few buffers might be polluted during the scan,
but most of the time they will be replacing other low-priority pages
(perhaps from another seq scan) and probably be replaced again quickly.
If that's not happening, and it's polluting frequently-accessed pages, I
agree that's a bug.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 11:10 +0200, Hannu Krosing wrote:
  My proposal for a fix: ensure that when relations larger (much larger?)
  than buffer cache are scanned, they are mapped to a single page in the
  shared buffer cache.
 
 How will this approach play together with synchronized scan patches ?
 

Thanks for considering my patch in this discussion. I will test by
turning shared_buffers down as low as I can, and see if that makes a big
difference.

 Or should synchronized scan rely on systems cache only ?
 

I don't know what the performance impact of that will be; still good
compared to reading from disk, but I assume much more CPU time.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Best way is to prove it though. Seems like not too much work to have a
 private ring data structure when the hint is enabled. The extra
 bookeeping is easily going to be outweighed by the reduction in mem-L2
 cache fetches. I'll do it tomorrow, if no other volunteers.

[ shrug... ]  No one has yet proven to my satisfaction that L2 cache has
anything to do with this.  The notion that you can read a new disk page
into a shared buffer and have that buffer still be live in the processor
cache is so obviously bogus that I think there must be some other effect
at work.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 09:09 +, Heikki Linnakangas wrote:
 In fact, the pages that are left in the cache after the seqscan finishes 
 would be useful for the next seqscan of the same table if we were smart 
 enough to read those pages first. That'd make a big difference for 
 seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if 
 Jeff's sync seqscan patch adresses that.
 

Absolutely. I've got a parameter in my patch sync_scan_offset that
starts a seq scan N pages before the position of the last seq scan
running on that table (or a current seq scan if there's still a scan
going). 

If the last scan is not still in progress, the pages are less likely to
be in cache. If the pages are in cache, great; if not, it doesn't matter
where we start anyway.

If the last scan is still in progress, those recently-read pages are
very likely to be in cache (shared buffers or OS buffer cache).

Regards,
Jeff Davis


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 Absolutely. I've got a parameter in my patch sync_scan_offset that
 starts a seq scan N pages before the position of the last seq scan
 running on that table (or a current seq scan if there's still a scan
 going). 

Strikes me that expressing that parameter as a percentage of
shared_buffers might make it less in need of manual tuning ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
  Absolutely. I've got a parameter in my patch sync_scan_offset that
  starts a seq scan N pages before the position of the last seq scan
  running on that table (or a current seq scan if there's still a scan
  going). 
 
 Strikes me that expressing that parameter as a percentage of
 shared_buffers might make it less in need of manual tuning ...
 

The original patch was a percentage of effective_cache_size, because in
theory it may be helpful to have this parameter larger than shared
buffers. Synchronized Scannning can take advantage of OS buffer cache as
well.

Someone convinced me to change it to be an independent variable.

I don't have a strong opinion, but now I have three different opinions:
(1) Independent parameter
(2) Percentage of shared_buffers
(3) Percentage of effective_cache_size

Regards,
Jeff Davis


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
 Strikes me that expressing that parameter as a percentage of
 shared_buffers might make it less in need of manual tuning ...

 The original patch was a percentage of effective_cache_size, because in
 theory it may be helpful to have this parameter larger than shared
 buffers. Synchronized Scannning can take advantage of OS buffer cache as
 well.

I didn't say you couldn't allow it to be more than 100% ;-).  But basing
it on effective_cache_size strikes me as a bad idea because that parameter
is seldom better than a wild guess.  shared_buffers at least means
something.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Heikki Linnakangas

Jeff Davis wrote:

On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:

Jeff Davis [EMAIL PROTECTED] writes:

Absolutely. I've got a parameter in my patch sync_scan_offset that
starts a seq scan N pages before the position of the last seq scan
running on that table (or a current seq scan if there's still a scan
going). 

Strikes me that expressing that parameter as a percentage of
shared_buffers might make it less in need of manual tuning ...



The original patch was a percentage of effective_cache_size, because in
theory it may be helpful to have this parameter larger than shared
buffers. Synchronized Scannning can take advantage of OS buffer cache as
well.

Someone convinced me to change it to be an independent variable.

I don't have a strong opinion, but now I have three different opinions:
(1) Independent parameter
(2) Percentage of shared_buffers
(3) Percentage of effective_cache_size


Another approach I proposed back in December is to not have a variable 
like that at all, but scan the buffer cache for pages belonging to the 
table you're scanning to initialize the scan. Scanning all the 
BufferDescs is a fairly CPU and lock heavy operation, but it might be ok 
given that we're talking about large I/O bound sequential scans. It 
would require no DBA tuning and would work more robustly in varying 
conditions. I'm not sure where you would continue after scanning the 
in-cache pages. At the highest in-cache block number, perhaps.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood

Tom Lane wrote:


So the
problem is not so much the clock sweep overhead as that it's paid in a
very nonuniform fashion: with N buffers you pay O(N) once every N reads
and O(1) the rest of the time.  This is no doubt slowing things down
enough to delay that one read, instead of leaving it nicely I/O bound
all the time.  Mark, can you detect hiccups in the read rate using
your setup?



I think so, here's the vmstat output for 400MB of shared_buffers during
the scan:

procs ---memory-- ---swap-- -io --system-- 
cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us 
sy id wa
 1  0764  51772  0 199068800 120422 2 1546  1755 16 
37 46  1
 1  0764  53640  0 198879200 120422 2 1544  1446 14 
40 46  1
 1  0788  54900  0 198756400 11674615 1470  3067 15 
39 44  2
 1  0788  52800  0 198955200 11919920 1488  2216 14 
37 47  1
 1  0788  52372  0 19900 122880 7 1532  1203 15 
39 45  1
 1  0788  54592  0 198787205 124928 5 1557  1058 17 
38 46  0
 2  0788  54052  0 198783600 118787 0 1500  2469 16 
36 47  1
 1  0788  52552  0 198989200 120419 0 1506  2531 15 
36 48  1
 1  0788  53452  0 198935600 119195 2 1501  1698 15 
37 47  1
 1  0788  52680  0 198979600 120424 2 1521  1610 16 
37 47  1



Cheers

Mark




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Mark, can you detect hiccups in the read rate using
 your setup?

 I think so, here's the vmstat output for 400MB of shared_buffers during
 the scan:

Hm, not really a smoking gun there.  But just for grins, would you try
this patch and see if the numbers change?

regards, tom lane

*** src/backend/storage/buffer/freelist.c.orig  Fri Jan  5 18:02:12 2007
--- src/backend/storage/buffer/freelist.c   Mon Mar  5 16:33:11 2007
***
*** 104,116 
 * it; decrement the usage_count and keep scanning.
 */
LockBufHdr(buf);
-   if (buf-refcount == 0  buf-usage_count == 0)
-   return buf;
if (buf-usage_count  0)
{
buf-usage_count--;
!   trycounter = NBuffers;
}
else if (--trycounter == 0)
{
/*
--- 104,116 
 * it; decrement the usage_count and keep scanning.
 */
LockBufHdr(buf);
if (buf-usage_count  0)
{
buf-usage_count--;
!   trycounter = NBuffers + 1;
}
+   if (buf-refcount == 0  buf-usage_count == 0)
+   return buf;
else if (--trycounter == 0)
{
/*

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood

Tom Lane wrote:



Hm, not really a smoking gun there.  But just for grins, would you try
this patch and see if the numbers change?



Applied to 8.2.3 (don't have lineitem loaded in HEAD yet) - no change 
that I can see:


procs ---memory-- ---swap-- -io --system-- 
cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us 
sy id wa
 1  0764  55300  0 198803200 11879732 1532  1775 15 
39 44  2
 1  0764  54476  0 198972000 115507 9 1456  3970 15 
39 45  1
 1  0788  54540  0 198959200 121651 0 1508  3221 16 
37 47  0
 1  0788  52808  0 199132000 124109 0 1532  1236 15 
38 46  0
 1  0788  52504  0 199178400 124518 0 1547  1005 16 
39 45  0
 2  0788  54544  0 198974005 117965 5 1491  2012 15 
36 47  2
 1  0788  53596  0 199118400 120424 0 1504  1910 16 
37 46  1


Elapsed time is exactly the same (101 s). Is is expected that HEAD would 
behave differently?


Cheers

Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 Elapsed time is exactly the same (101 s). Is is expected that HEAD would 
 behave differently?

Offhand I don't think so.  But what I wanted to see was the curve of
elapsed time vs shared_buffers?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 21:03 +, Heikki Linnakangas wrote:
 Another approach I proposed back in December is to not have a variable 
 like that at all, but scan the buffer cache for pages belonging to the 
 table you're scanning to initialize the scan. Scanning all the 
 BufferDescs is a fairly CPU and lock heavy operation, but it might be ok 
 given that we're talking about large I/O bound sequential scans. It 
 would require no DBA tuning and would work more robustly in varying 
 conditions. I'm not sure where you would continue after scanning the 
 in-cache pages. At the highest in-cache block number, perhaps.
 

I assume you're referring to this:

each backend keeps a bitmap of pages it has processed during the scan,
and read the pages in the order they're available in cache.

which I think is a great idea. However, I was unable to devise a good
answer to all these questions at once:

* How do we attempt to maintain sequential reads on the underlying I/O
layer?

* My current implementation takes advantage of the OS buffer cache, how
could we maintain that advantage from PostgreSQL-specific cache logic?

* How do I test to see whether it actually helps in a realistic
scenario? It seems like it would help the most when scans are
progressing at different rates, but how often do people have CPU-bound
queries on tables that don't fit into physical memory (and how long
would it take for me to benchmark such a query)?

It seems like your idea is more analytical, and my current
implementation is more guesswork. I like the analytical approach, but I
don't know that we have enough information to pull it off because we're
missing what's in the OS buffer cache. The OS buffer cache is crucial to
Synchronized Scanning, because shared buffers are evicted based on a
more complex set of circumstances, whereas the OS buffer cache is
usually LRU and forms a nicer cache trail (upon which Synchronized
Scanning is largely based). 

If you have some tests you'd like me to run, I'm planning to do some
benchmarks this week and next. I can see if my current patch holds up
under the scenarios you're worried about.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Florian G. Pflug

Simon Riggs wrote:

On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.
This would maintain the beneficial behaviour for OLTP,

Justify that claim.  It sounds to me like this would act very nearly the
same as having shared_buffers == 32 ...


Sure. We wouldn't set the hint for IndexScans or Inserts, only for
SeqScans, VACUUM and COPY.

So OLTP-only workloads would be entirely unaffected. In the presence of
a mixed workload the scan tasks would have only a limited effect on the
cache, maintaining performance for the response time critical tasks. So
its an OLTP benefit because of cache protection and WAL-flush reduction
during VACUUM.

As we've seen, the scan tasks look like they'll go faster with this.


But it would break the idea of letting a second seqscan follow in the
first's hot cache trail, no?

greetings, Florian Pflug




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood

Tom Lane wrote:


But what I wanted to see was the curve of
elapsed time vs shared_buffers?



Of course! (lets just write that off to me being pre coffee...).

With the patch applied:

Shared Buffers  Elapsed  vmstat IO rate
--  ---  --
400MB   101 s122 MB/s
2MB 101 s
1MB  97 s
768KB94 s
512KB84 s
256KB79 s
128KB75 s166 MB/s

Looks *very* similar.

Cheers

Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 But what I wanted to see was the curve of
 elapsed time vs shared_buffers?
 ...
 Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself.  I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so).  Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can.  Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;
 shared_buffers

 128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem;
  count
--
 1024
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior.  (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
GLOBAL_POWER_E...|
  samples|  %|
--
   141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26368 13.7984 /home/tgl/testversion/bin/postgres
12765  6.6799 /libata
 2238  1.1711 /lib64/libc-2.4.so
 1112  0.5819 /dm_mod

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
GLOBAL_POWER_E...|
  samples|  %|
--
   113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26284 16.3070 /home/tgl/testversion/bin/postgres
12004  7.4475 /libata
 2093  1.2985 /lib64/libc-2.4.so
  996  0.6179 /dm_mod

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
samples  %symbol name
5777940.9591  copy_user_generic
1817512.8841  __delay
3994  2.8313  _raw_spin_lock
2388  1.6928  put_page
2184  1.5482  mwait_idle
2083  1.4766  _raw_write_unlock
1909  1.3533  _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
samples  %symbol name
3743733.0783  copy_user_generic
1789115.8080  __delay
3372  2.9794  _raw_spin_lock
2218  1.9598  mwait_idle
2067  1.8263  _raw_write_unlock
1837  1.6231  _raw_write_lock
1531  1.3527  put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache.  This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior?  A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block 336765 into buffer 9403
read block 336766 into buffer 9402
read block 336767 into buffer 9403
read block 336768 into buffer 9402
read block 336769 into buffer 9403
read block 336770 into buffer 9402

What's going on is that VACUUM puts each buffer it's finished with on
the tail of the freelist.  In the post-SELECT state, there are just two
buffers cycling through the freelist (not sure why not only one, but it
doesn't matter) and so the cache 

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 I don't see any good reason why overwriting a whole cache line oughtn't be
 the same speed either way.

I can think of a couple theories, but I don't know if they're reasonable. The
one the comes to mind is the inter-processor cache coherency protocol. When
writing to a cache line the processor already owns maybe it can skip having to
check for other processors owning that cache line?

What happens if VACUUM comes across buffers that *are* already in the buffer
cache. Does it throw those on the freelist too? That seems like it would be
dangerous if they were in the buffer cache for a reason.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom,

Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it).


- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Monday, March 05, 2007 07:43 PM Eastern Standard Time
To: Mark Kirkwood
Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; 
Sherry Moore
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Mark Kirkwood [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 But what I wanted to see was the curve of
 elapsed time vs shared_buffers?
 ...
 Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself.  I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so).  Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can.  Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;
 shared_buffers

 128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem;
  count
--
 1024
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior.  (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
GLOBAL_POWER_E...|
  samples|  %|
--
   141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26368 13.7984 /home/tgl/testversion/bin/postgres
12765  6.6799 /libata
 2238  1.1711 /lib64/libc-2.4.so
 1112  0.5819 /dm_mod

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
GLOBAL_POWER_E...|
  samples|  %|
--
   113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26284 16.3070 /home/tgl/testversion/bin/postgres
12004  7.4475 /libata
 2093  1.2985 /lib64/libc-2.4.so
  996  0.6179 /dm_mod

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
samples  %symbol name
5777940.9591  copy_user_generic
1817512.8841  __delay
3994  2.8313  _raw_spin_lock
2388  1.6928  put_page
2184  1.5482  mwait_idle
2083  1.4766  _raw_write_unlock
1909  1.3533  _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 24
samples  %symbol name
3743733.0783  copy_user_generic
1789115.8080  __delay
3372  2.9794  _raw_spin_lock
2218  1.9598  mwait_idle
2067  1.8263  _raw_write_unlock
1837  1.6231  _raw_write_lock
1531  1.3527  put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache.  This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior?  A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 What happens if VACUUM comes across buffers that *are* already in the buffer
 cache. Does it throw those on the freelist too?

Not unless they have usage_count 0, in which case they'd be subject to
recycling by the next clock sweep anyway.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 Good info - it's the same in Solaris, the routine is uiomove (Sherry
 wrote it).

Cool.  Maybe Sherry can comment on the question whether it's possible
for a large-scale-memcpy to not take a hit on filling a cache line
that wasn't previously in cache?

I looked a bit at the Linux code that's being used here, but it's all
x86_64 assembler which is something I've never studied :-(.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Tom,

On 3/5/07 7:58 PM, Tom Lane [EMAIL PROTECTED] wrote:

 I looked a bit at the Linux code that's being used here, but it's all
 x86_64 assembler which is something I've never studied :-(.

Here's the C wrapper routine in Solaris:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/
move.c

Here's the x86 assembler routine for Solaris:
  
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32
/ml/copy.s

The actual uiomove routine is a simple wrapper that calls the assembler
kcopy or xcopyout routines.  There are two versions (for Opteron), one that
uses the NTA instructions that bypass the L2 cache on writing to avoid L2
cache pollution, and the second writes normally - through the L2 cache.
Which one is used depends on a parameter (global) based on the size of the
I/O. It is tuned to identify operations that might pollute the L2 cache
(sound familiar?) 

I think what we're seeing is a generic artifact of the write-through
behavior of the cache.  I wouldn't expect this to get any better with
DIRECTIO to the shared_buffers in pgsql - if we iterate over a large number
of user space buffers we'll still hit the increased L2 thrashing.

I think we're best off with a hybrid approach - when we detect a seq scan
larger (much larger?) than buffer cache, we can switch into the cache
bypass behavior, much like the above code uses the NTA instruction when
appropriate.

We can handle syncscan using a small buffer space.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 Here's the x86 assembler routine for Solaris:
 http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32
 /ml/copy.s
 The actual uiomove routine is a simple wrapper that calls the assembler
 kcopy or xcopyout routines.  There are two versions (for Opteron), one that
 uses the NTA instructions that bypass the L2 cache on writing to avoid L2
 cache pollution, and the second writes normally - through the L2 cache.

Hm.  If it were bypassing the L2 cache then the hit would be taken when
PG later tries to read the data.  This is clearly not what's happening
in the Linux system, but I've not seen any oprofile-equivalent data for
Solaris?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jim Nasby

On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote:

Tom,

I seem to recall that we've previously discussed the idea of  
letting the

clock sweep decrement the usage_count before testing for 0, so that a
buffer could be reused on the first sweep after it was initially  
used,

but that we rejected it as being a bad idea.  But at least with large
shared_buffers it doesn't sound like such a bad idea.


We did discuss an number of formulas for setting buffers with  
different
clock-sweep numbers, including ones with higher usage_count for  
indexes and
starting numbers of 0 for large seq scans as well as vacuums.   
However, we
didn't have any way to prove that any of these complex algorithms  
would
result in higher performance, so went with the simplest formula,  
with the
idea of tinkering with it when we had more data.  So maybe now's  
the time.


Note, though, that the current algorithm is working very, very well  
for OLTP
benchmarks, so we'd want to be careful not to gain performance in  
one area at

the expense of another.  In TPCE testing, we've been able to increase
shared_buffers to 10GB with beneficial performance effect (numbers  
posted

when I have them) and even found that taking over RAM with the
shared_buffers (ala Oracle) gave us equivalent performance to using  
the FS
cache.  (yes, this means with a little I/O management engineering  
we could
contemplate discarding use of the FS cache for a net performance  
gain.  Maybe

for 8.4)


An idea I've been thinking about would be to have the bgwriter or  
some other background process actually try and keep the free list  
populated, so that backends needing to grab a page would be much more  
likely to find one there (and not have to wait to scan through the  
entire buffer pool, perhaps multiple times).


My thought is to keep track of how many page requests occurred during  
a given interval, and use that value (probably averaged over time) to  
determine how many pages we'd like to see on the free list. The  
background process would then run through the buffers decrementing  
usage counts until it found enough for the free list. Before putting  
a buffer on the 'free list', it would write the buffer out; I'm not  
sure if it would make sense to de-associate the buffer with whatever  
it had been storing or not, though. If we don't do that, that would  
mean that we could pull pages back off the free list if we wanted to.  
That would be helpful if the background process got a bit over-zealous.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jim Nasby

On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
Another approach I proposed back in December is to not have a  
variable like that at all, but scan the buffer cache for pages  
belonging to the table you're scanning to initialize the scan.  
Scanning all the BufferDescs is a fairly CPU and lock heavy  
operation, but it might be ok given that we're talking about large  
I/O bound sequential scans. It would require no DBA tuning and  
would work more robustly in varying conditions. I'm not sure where  
you would continue after scanning the in-cache pages. At the  
highest in-cache block number, perhaps.


If there was some way to do that, it'd be what I'd vote for.

Given the partitioning of the buffer lock that Tom did it might not  
be that horrible for many cases, either, since you'd only need to  
scan through one partition.


We also don't need an exact count, either. Perhaps there's some way  
we could keep a counter or something...

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Jim Nasby [EMAIL PROTECTED] writes:
 An idea I've been thinking about would be to have the bgwriter or  
 some other background process actually try and keep the free list  
 populated,

The bgwriter already tries to keep pages just in front of the clock
sweep pointer clean.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
I'm putting this out there before we publish a fix so that we can discuss
how best to fix it.

Doug and Sherry recently found the source of an important performance issue
with the Postgres shared buffer cache.

The issue is summarized like this: the buffer cache in PGSQL is not scan
resistant as advertised.  A sequential scan of a table larger than cache
will pollute the buffer cache in almost all circumstances.

Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500
(thumper-3) with 4 cores where bigtable is a table 2x the size of RAM and
memtable is a table that fits into I/O cache:

With our default setting of shared_buffers (16MB):

Operation   memtablebigtable
---
SELECT COUNT(*) 1221 MB/s   973 MB/s
VACUUM  1709 MB/s   1206 MB/s

We had observed that VACUUM would perform better when done right after a
SELECT.  In the above example, the faster rate from disk was 1608 MB/s,
compared to the normal rate of 1206 MB/s.

We verified this behavior on Postgres 8.2 as well.  The buffer selection
algorithm is choosing buffer pages scattered throughout the buffer cache in
almost all circumstances.

Sherry traced the behavior to the processor repeatedly flushing the L2
cache.  Doug found that we weren't using the Postgres buffer cache the way
we expected, instead we were loading the scanned data from disk into the
cache even though there was no possibility of reusing it.  In addition to
pushing other, possibly useful pages from the cache, it has the additional
behavior of invalidating the L2 cache for the remainder of the executor path
that uses the data.

To prove that the buffer cache was the source of the problem, we dropped the
shared buffer size to fit into L2 cache (1MB per Opteron core), and this is
what we saw (drop size of shared buffers to 680KB):

Operation   memtablebigtable
---
SELECT COUNT(*) 1320 MB/s   1059 MB/s
VACUUM  3033 MB/s   1597 MB/s

These results do not vary with the order of operations.

Thoughts on the best way to fix the buffer selection algorithm?  Ideally,
one page would be used in the buffer cache in circumstances where the table
to be scanned is (significantly?) larger than the size of the buffer cache.
 
- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 The issue is summarized like this: the buffer cache in PGSQL is not scan
 resistant as advertised.

Sure it is.  As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache.  That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway?  I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
When we instrument the page selections made within the buffer cache, they are 
sequential and span the entire address space of the cache.

With respect to whether it's L2, it's a conclusion based on the experimental 
results.  It's not the TLB, as we also tested for the 512 entries for each L2.

One thing I left out of the previous post: the difference between fast and slow 
behavior was that in the fast case, the buffer selection alternated between two 
buffer pages.  This was the case only when the preceding statement was a SELECT 
and the statement was VACUUM.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Sunday, March 04, 2007 08:36 PM Eastern Standard Time
To: Luke Lonergan
Cc: PGSQL Hackers; Doug Rady; Sherry Moore
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Luke Lonergan [EMAIL PROTECTED] writes:
 The issue is summarized like this: the buffer cache in PGSQL is not scan
 resistant as advertised.

Sure it is.  As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache.  That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway?  I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane



Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
One more thing: the L2 is invalidated when re-written from the kernel IO cache, 
but the pages addressed in L2 retain their values when 'writeen thru' which 
allows the new data to be re-used up the executor chain.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Tom Lane [mailto:[EMAIL PROTECTED]
Sent:   Sunday, March 04, 2007 08:36 PM Eastern Standard Time
To: Luke Lonergan
Cc: PGSQL Hackers; Doug Rady; Sherry Moore
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Luke Lonergan [EMAIL PROTECTED] writes:
 The issue is summarized like this: the buffer cache in PGSQL is not scan
 resistant as advertised.

Sure it is.  As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache.  That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway?  I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane



Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Mark Kirkwood

Tom Lane wrote:

Luke Lonergan [EMAIL PROTECTED] writes:

The issue is summarized like this: the buffer cache in PGSQL is not scan
resistant as advertised.


Sure it is.  As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache.  That's not a consideration that we've ever taken into account.



To add a little to this - forgetting the scan resistant point for the
moment... cranking down shared_buffers to be smaller than the L2 cache
seems to help *any* sequential scan immensely, even on quite modest HW:

e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,

SELECT count(*) FROM lineitem (which is about 11GB) performance:

Shared_buffers  Elapsed
--  ---
400MB   101 s
128KB74 s

When I've profiled this activity, I've seen a lot of time spent
searching for/allocating a new buffer for each page being fetched.
Obviously having less of them to search through will help, but having
less than the L2 cache-size worth of 'em seems to help a whole lot!

Cheers

Mark





---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Gavin Sherry
On Mon, 5 Mar 2007, Mark Kirkwood wrote:

 To add a little to this - forgetting the scan resistant point for the
 moment... cranking down shared_buffers to be smaller than the L2 cache
 seems to help *any* sequential scan immensely, even on quite modest HW:

 e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,

 SELECT count(*) FROM lineitem (which is about 11GB) performance:

 Shared_buffers  Elapsed
 --  ---
 400MB   101 s
 128KB74 s

 When I've profiled this activity, I've seen a lot of time spent
 searching for/allocating a new buffer for each page being fetched.
 Obviously having less of them to search through will help, but having
 less than the L2 cache-size worth of 'em seems to help a whole lot!

Could you demonstrate that point by showing us timings for shared_buffers
sizes from 512K up to, say, 2 MB? The two numbers you give there might
just have to do with managing a large buffer.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
Gavin, Mark,

 Could you demonstrate that point by showing us timings for 
 shared_buffers sizes from 512K up to, say, 2 MB? The two 
 numbers you give there might just have to do with managing a 
 large buffer.

I suggest two experiments that we've already done:
1) increase shared buffers to double the L2 cache size, you should see
that the behavior reverts to the slow performance and is constant at
larger sizes

2) instrument the calls to BufferGetPage() (a macro) and note that the
buffer block numbers returned increase sequentially during scans of
tables larger than the buffer size

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes:
 Could you demonstrate that point by showing us timings for shared_buffers
 sizes from 512K up to, say, 2 MB? The two numbers you give there might
 just have to do with managing a large buffer.

Using PG CVS HEAD on 64-bit Intel Xeon (1MB L2 cache), Fedora Core 5,
I don't measure any noticeable difference in seqscan speed for
shared_buffers set to 32MB or 256kB.  I note that the code would
not let me choose the latter setting without a large decrease in
max_connections, which might be expected to cause some performance
changes in itself.

Now this may only prove that the disk subsystem on this machine is
too cheap to let the system show any CPU-related issues.  I'm seeing
a scan rate of about 43MB/sec for both count(*) and plain ol' wc,
which is a factor of 4 or so less than Mark's numbers suggest...
but top shows CPU usage of less than 5%, so even with a 4x faster
disk I'd not really expect that CPU speed would become interesting.

(This is indeed a milestone, btw, because it wasn't so long ago that
count(*) was nowhere near disk speed.)

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq