Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-08 Thread Sherry Moore
Hi Simon,

 and what you haven't said
 
 - all of this is orthogonal to the issue of buffer cache spoiling in
 PostgreSQL itself. That issue does still exist as a non-OS issue, but
 we've been discussing in detail the specific case of L2 cache effects
 with specific kernel calls. All of the test results have been
 stand-alone, so we've not done any measurements in that area. I say this
 because you make the point that reducing the working set size of write
 workloads has no effect on the L2 cache issue, but ISTM its still
 potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
  difference with VACUUM was caused by whether the size of the
  user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
  to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
  the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] NTA access on Solaris

2007-03-06 Thread Sherry Moore
 With copyout_max_cached being 8K:
^^^
4K

 Working
 set   16K 32K 64K 128K256K512K1M  2M  128M
 
 Seconds   4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 
 5.1

Sherry


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] NTA access on Solaris

2007-03-06 Thread Sherry Moore
On a 1P system system with 512K L2, it is more obvious why we shouldn't
bypass L2 for small reads:

The same readtest as my previous mail invoked as following:

./readtest -s working-set-size -f /platform/i86pc/boot_archive -n 100

With copyout_max_cached being 128K:

Working
set 16K 32K 64K 128K256K512K1M  2M  128M

Seconds 4.2 4.0 4.1 4.1 5.7 7.0 7.1 7.0 7.5

With copyout_max_cached being 8K:

Working
set 16K 32K 64K 128K256K512K1M  2M  128M

Seconds 4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 5.1


Sherry

On Mon, Mar 05, 2007 at 09:41:14PM -0800, Sherry Moore wrote:
 - Forwarded message from Sherry Moore [EMAIL PROTECTED] -
 
 Date: Mon, 5 Mar 2007 21:34:19 -0800
 From: Sherry Moore [EMAIL PROTECTED]
 To: Tom Lane [EMAIL PROTECTED]
 Cc: Luke Lonergan [EMAIL PROTECTED],
   Mark Kirkwood [EMAIL PROTECTED],
   Pavan Deolasee [EMAIL PROTECTED],
   Gavin Sherry [EMAIL PROTECTED],
   PGSQL Hackers pgsql-hackers@postgresql.org,
   Doug Rady [EMAIL PROTECTED],
   Sherry Moore [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant
 
 Hi Tom,
 
 Sorry about the delay.  I have been away from computers all day.
 
 In the current Solaris release in development (Code name Nevada,
 available for download at http://opensolaris.org), I have implemented
 non-temporal access (NTA) which bypasses L2 for most writes, and reads
 larger than copyout_max_cached (patchable, default to 128K).  The block
 size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
 trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
 are very close.
 
 I wrote readtest to simulate the access pattern of VACUUM (attached).
 tread is a 4-socket dual-core Opteron box.
 
 81 tread ./readtest -h
 Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count]
 -v: Verbose mode
 -N: Normalize results by number of reads
 -s size:  Working set size (may specify K,M,G suffix)
 -n iter:Number of test iterations
 -f filename:Name of the file to read from
 -d [+|-]delta:  Distance between subsequent reads
 -c count:   Number of reads
 -h: Print this help
 
 With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):
 
 82 tread ./readtest -s 16k -f boot_archive   
 46445262
 83 tread ./readtest -s 128M -f boot_archive  
 118294230
 84 tread ./readtest -s 16k -f boot_archive -n 100
 4230210856
 85 tread ./readtest -s 128M -f boot_archive -n 100
 6343619546
 
 With copyout_max_cached at 4K (in nanoseconds, NTA triggered):
 
 89 tread ./readtest -s 16k -f boot_archive
 43606882
 90 tread ./readtest -s 128M -f boot_archive 
 100547909
 91 tread ./readtest -s 16k -f boot_archive -n 100
 4251823995
 92 tread ./readtest -s 128M -f boot_archive -n 100
 4205491984
 
 When the iteration is 1 (the default), the timing difference between
 using 16k buffer and 128M buffer is much bigger for both
 copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
 the iteration count is bigger, most of the page tables would be in Page
 Descriptor Cache for the later page accesses so the overhead of TLB
 misses become smaller.  As you can see, when we do bypass L2, the
 performance with either buffer size is comparable.
 
 I am sure your next question is why the 128K limitation for reads.
 Here are the main reasons:
 
 - Based on a lot of the benchmarks and workloads I traced, the
   target buffer of read operations are typically accessed again
   shortly after the read, while writes are usually not.  Therefore,
   the default operation mode is to bypass L2 for writes, but not
   for reads.
 
 - The Opteron's L1 cache size is 64K.  If reads are larger than
   128KB, it would have displacement flushed itself anyway, so for
   large reads, I will also bypass L2. I am working on dynamically
   setting copyout_max_cached based on the L1 D-cache size on the
   system.
 
 The above heuristic should have worked well in Luke's test case.
 However, due to the fact that the reads was done as 16,000 8K reads
 rather than one 128MB read, the NTA code was not triggered.
 
 Since the OS code has to be general enough to handle with most
 workloads, we have to pick some defaults that might not work best for
 some specific operations.  It is a calculated balance.
 
 Thanks,
 Sherry
 
 
 On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
  Luke Lonergan [EMAIL PROTECTED] writes:
   Good info - it's the same in Solaris, the routine is uiomove (Sherry
   wrote it).
  
  Cool.  Maybe Sherry can comment on the question whether it's

[HACKERS] NTA access on Solaris

2007-03-06 Thread Sherry Moore
- Forwarded message from Sherry Moore [EMAIL PROTECTED] -

Date: Mon, 5 Mar 2007 21:34:19 -0800
From: Sherry Moore [EMAIL PROTECTED]
To: Tom Lane [EMAIL PROTECTED]
Cc: Luke Lonergan [EMAIL PROTECTED],
Mark Kirkwood [EMAIL PROTECTED],
Pavan Deolasee [EMAIL PROTECTED],
Gavin Sherry [EMAIL PROTECTED],
PGSQL Hackers pgsql-hackers@postgresql.org,
Doug Rady [EMAIL PROTECTED],
Sherry Moore [EMAIL PROTECTED]
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Tom,

Sorry about the delay.  I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K).  The block
size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

81 tread ./readtest -h
Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count]
-v: Verbose mode
-N: Normalize results by number of reads
-s size:  Working set size (may specify K,M,G suffix)
-n iter:Number of test iterations
-f filename:Name of the file to read from
-d [+|-]delta:  Distance between subsequent reads
-c count:   Number of reads
-h: Print this help

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

82 tread ./readtest -s 16k -f boot_archive   
46445262
83 tread ./readtest -s 128M -f boot_archive  
118294230
84 tread ./readtest -s 16k -f boot_archive -n 100
4230210856
85 tread ./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

89 tread ./readtest -s 16k -f boot_archive
43606882
90 tread ./readtest -s 128M -f boot_archive 
100547909
91 tread ./readtest -s 16k -f boot_archive -n 100
4251823995
92 tread ./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller.  As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:

- Based on a lot of the benchmarks and workloads I traced, the
  target buffer of read operations are typically accessed again
  shortly after the read, while writes are usually not.  Therefore,
  the default operation mode is to bypass L2 for writes, but not
  for reads.

- The Opteron's L1 cache size is 64K.  If reads are larger than
  128KB, it would have displacement flushed itself anyway, so for
  large reads, I will also bypass L2. I am working on dynamically
  setting copyout_max_cached based on the L1 D-cache size on the
  system.

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations.  It is a calculated balance.

Thanks,
Sherry


On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
 Luke Lonergan [EMAIL PROTECTED] writes:
  Good info - it's the same in Solaris, the routine is uiomove (Sherry
  wrote it).
 
 Cool.  Maybe Sherry can comment on the question whether it's possible
 for a large-scale-memcpy to not take a hit on filling a cache line
 that wasn't previously in cache?
 
 I looked a bit at the Linux code that's being used here, but it's all
 x86_64 assembler which is something I've never studied :-(.
 
   regards, tom lane

-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym

#include stdlib.h
#include stdio.h
#include ctype.h
#include unistd.h
#include fcntl.h
#include sys/param.h
#include sys/time.h
#include sys/mman.h
#include errno.h
#include thread.h
#include signal.h
#include strings.h
#include libgen.h

#define KB(a)   (a*1024)
#define MB(a)   (KB(a)*1024)

static void
usage(char *s)
{
fprintf(stderr,
Usage: %s [-v] [-N] -s size -n iter 
[-d delta] [-c count]\n, s);
fprintf(stderr,
\t-v:\t\tVerbose mode\n
\t-N:\t\tNormalize results by number of reads\n
\t

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Sherry Moore
Hi Tom,

Sorry about the delay.  I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K).  The block
size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

81 tread ./readtest -h
Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count]
-v: Verbose mode
-N: Normalize results by number of reads
-s size:  Working set size (may specify K,M,G suffix)
-n iter:Number of test iterations
-f filename:Name of the file to read from
-d [+|-]delta:  Distance between subsequent reads
-c count:   Number of reads
-h: Print this help

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

82 tread ./readtest -s 16k -f boot_archive   
46445262
83 tread ./readtest -s 128M -f boot_archive  
118294230
84 tread ./readtest -s 16k -f boot_archive -n 100
4230210856
85 tread ./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

89 tread ./readtest -s 16k -f boot_archive
43606882
90 tread ./readtest -s 128M -f boot_archive 
100547909
91 tread ./readtest -s 16k -f boot_archive -n 100
4251823995
92 tread ./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller.  As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:

- Based on a lot of the benchmarks and workloads I traced, the
  target buffer of read operations are typically accessed again
  shortly after the read, while writes are usually not.  Therefore,
  the default operation mode is to bypass L2 for writes, but not
  for reads.

- The Opteron's L1 cache size is 64K.  If reads are larger than
  128KB, it would have displacement flushed itself anyway, so for
  large reads, I will also bypass L2. I am working on dynamically
  setting copyout_max_cached based on the L1 D-cache size on the
  system.

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations.  It is a calculated balance.

Thanks,
Sherry


On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
 Luke Lonergan [EMAIL PROTECTED] writes:
  Good info - it's the same in Solaris, the routine is uiomove (Sherry
  wrote it).
 
 Cool.  Maybe Sherry can comment on the question whether it's possible
 for a large-scale-memcpy to not take a hit on filling a cache line
 that wasn't previously in cache?
 
 I looked a bit at the Linux code that's being used here, but it's all
 x86_64 assembler which is something I've never studied :-(.
 
   regards, tom lane

-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym
#include stdlib.h
#include stdio.h
#include ctype.h
#include unistd.h
#include fcntl.h
#include sys/param.h
#include sys/time.h
#include sys/mman.h
#include errno.h
#include thread.h
#include signal.h
#include strings.h
#include libgen.h

#define KB(a)   (a*1024)
#define MB(a)   (KB(a)*1024)

static void
usage(char *s)
{
fprintf(stderr,
Usage: %s [-v] [-N] -s size -n iter 
[-d delta] [-c count]\n, s);
fprintf(stderr,
\t-v:\t\tVerbose mode\n
\t-N:\t\tNormalize results by number of reads\n
\t-s size:\tWorking set size (may specify K,M,G suffix)\n
\t-n iter:\tNumber of test iterations\n
\t-f filename:\tName of the file to read from\n
\t-d [+|-]delta:\tDistance between subsequent reads\n
\t-c count:\tNumber of reads\n
\t-h:\t\tPrint this help\n );
exit(1);
}

#define ABS(x) ((x) = 0 ? (x) : -(x))

static void
format_num(size_t v, size_t *new, char *code)
{
if (v % (1024 * 1024 * 1024) == 0) {
*new = v / (1024 * 1024