Re: [HACKERS] O_DIRECT in freebsd

2003-10-30 Thread Jordan Henderson
Personally, I think it is useful to have features.  I quite understand the 
difficulties in maintaining some features however.  Also having worked on 
internals for commercial DB engines, I have specifically how code/data paths 
can be shortened.  I would not make the choice for someone to be forced into 
using a product in a specific manner.  Instead, I would let them decide 
whether to choose a simple setup or, if they are up to it, something with 
better performance.  I would not prune the options out.  In doing so, we 
limit what a knowledgeable person can do a priori.

Jordan Henderson

On Thursday 30 October 2003 19:59, Sailesh Krishnamurthy wrote:
> > "Jordan" == Jordan Henderson <[EMAIL PROTECTED]> writes:
>
> Jordan> significantly better results.  I would not say it requires
> Jordan> considerable tuning, but an understanding of data, storage
> Jordan> and access patterns.  Additionally, these features did not
> Jordan> cause our group considerable administrative overhead.
>
> I won't dispute the specifics. I have only worked on the DB2 engine -
> never written an app for it nor administered it. You're right - the
> bottomline is that you can get a significant performance advantage
> provided you care enough to understand what's going on.
>
> Anyway, I merely responded to provide a data point. Will PostgreSQL
> users/administrators care for additional knobs or is there a
> preference for "keep it simple, stupid" ?


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] O_DIRECT in freebsd

2003-10-30 Thread Sailesh Krishnamurthy
> "Jordan" == Jordan Henderson <[EMAIL PROTECTED]> writes:

Jordan> significantly better results.  I would not say it requires
Jordan> considerable tuning, but an understanding of data, storage
Jordan> and access patterns.  Additionally, these features did not
Jordan> cause our group considerable administrative overhead.

I won't dispute the specifics. I have only worked on the DB2 engine -
never written an app for it nor administered it. You're right - the
bottomline is that you can get a significant performance advantage
provided you care enough to understand what's going on. 

Anyway, I merely responded to provide a data point. Will PostgreSQL
users/administrators care for additional knobs or is there a
preference for "keep it simple, stupid" ?

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-10-30 Thread Dann Corbit
> -Original Message-
> From: Jordan Henderson [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, October 30, 2003 4:31 PM
> To: [EMAIL PROTECTED]; Doug McNaught
> Cc: Christopher Kings-Lynne; PostgreSQL-development
> Subject: Re: [HACKERS] O_DIRECT in freebsd
> 
> My experience with DB2 showed that properly setup DMS 
> tablespaces provided a 
> significant performance benefit.  I have also seen that the 
> average DBA does 
> not generally understand the data or access patterns in the 
> database.  Given 
> that, they don't correctly setup table spaces in general, 
> filesystem or raw.  
> Likewise, where it is possible to tie a tablespace to a 
> memory buffer pool, 
> the average DBA does not setup it up to a performance 
> advantage either.  
> However, are we talking about well tuned setups by someone who does 
> understand the data and the general access patterns?  For a 
> DBA like that, 
> they should be able to take advantage of these features and 
> get significantly 
> better results.  I would not say it requires considerable 
> tuning, but an 
> understanding of data, storage and access patterns.  
> Additionally, these 
> features did not cause our group considerable administrative overhead.

If it is possible for a human with knowledge of this domain to make good
decisions, it ought to be possible to store the same information into an
algorithm that operates off of collected statistics.  After some time
has elapsed, and an average access pattern of some sort has been
reached, the available resources could be divided in a fairly efficient
way.  It might be nice to be able to tweak it, but I would rather have
the computer make the calculations for me.

Just a thought.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] O_DIRECT in freebsd

2003-10-30 Thread Jordan Henderson
My experience with DB2 showed that properly setup DMS tablespaces provided a 
significant performance benefit.  I have also seen that the average DBA does 
not generally understand the data or access patterns in the database.  Given 
that, they don't correctly setup table spaces in general, filesystem or raw.  
Likewise, where it is possible to tie a tablespace to a memory buffer pool, 
the average DBA does not setup it up to a performance advantage either.  
However, are we talking about well tuned setups by someone who does 
understand the data and the general access patterns?  For a DBA like that, 
they should be able to take advantage of these features and get significantly 
better results.  I would not say it requires considerable tuning, but an 
understanding of data, storage and access patterns.  Additionally, these 
features did not cause our group considerable administrative overhead.

Jordan Henderson

On Thursday 30 October 2003 12:55, Sailesh Krishnamurthy wrote:
> DB2 supports cooked and raw file systems - SMS (System Manged Space)
> and DMS (Database Managed Space) tablespaces.
>
> The DB2 experience is that DMS tends to outperform SMS but requires
> considerable tuning and administrative overhead to see these wins.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-10-30 Thread Sailesh Krishnamurthy

DB2 supports cooked and raw file systems - SMS (System Manged Space)
and DMS (Database Managed Space) tablespaces. 

The DB2 experience is that DMS tends to outperform SMS but requires
considerable tuning and administrative overhead to see these wins. 

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] O_DIRECT in freebsd

2003-10-30 Thread Manfred Spraul
Greg Stark wrote:

Manfred Spraul <[EMAIL PROTECTED]> writes:

 

One problem for WAL is that O_DIRECT would disable the write cache -
each operation would block until the data arrived on disk, and that might block
other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.
   

aio seems a better fit.

 

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
Linux supports posix_fadvise, it seems to be part of xopen2k.
   

Odd, I don't see it anywhere in the kernel. I don't know what syscall it's
using to do this tweaking.
 

At least in 2.6: linux/mm/fadvise.c, the syscall is fadvise64 or 64_64

This is the only option that seems useful for postgres for both the WAL and
vacuum (though in other threads it seems the problems with vacuum lie
elsewhere):
  POSIX_FADV_DONTNEED attempts to free cached pages associated with the
  specified region. This is useful, for example, while streaming large
  files. A program may periodically request the kernel to free cached
  data that has already been used, so that more useful cached pages are
  not discarded instead.
  Pages that have not yet been written out will be unaffected, so if the
  application wishes to guarantee that pages will be released, it should
  call fsync or fdatasync first.
 

I agree. Either immediately after each flush syscall, or just before 
closing a log file and switching to the next.

Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a
backend before starting a sequential scan or index scan, but I kind of doubt
it.
 

IIRC the recommendation is ~20% total memory for the postgres user space 
buffers. That's quite a lot - it might be sufficient to protect that 
cache from vacuum or sequential scans. AddBufferToFreeList already 
contains a comment that this is the right place to try buffer 
replacement strategies.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] O_DIRECT in freebsd

2003-10-29 Thread Greg Stark

Manfred Spraul <[EMAIL PROTECTED]> writes:

> One problem for WAL is that O_DIRECT would disable the write cache -
> each operation would block until the data arrived on disk, and that might block
> other backends that try to access WALWriteLock.
> Perhaps a dedicated backend that does the writeback could fix that.

aio seems a better fit.

> Has anyone tried to use posix_fadvise for the wal logs?
> http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
> 
> Linux supports posix_fadvise, it seems to be part of xopen2k.

Odd, I don't see it anywhere in the kernel. I don't know what syscall it's
using to do this tweaking.

This is the only option that seems useful for postgres for both the WAL and
vacuum (though in other threads it seems the problems with vacuum lie
elsewhere):

   POSIX_FADV_DONTNEED attempts to free cached pages associated with the
   specified region. This is useful, for example, while streaming large
   files. A program may periodically request the kernel to free cached
   data that has already been used, so that more useful cached pages are
   not discarded instead.

   Pages that have not yet been written out will be unaffected, so if the
   application wishes to guarantee that pages will be released, it should
   call fsync or fdatasync first.

Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a
backend before starting a sequential scan or index scan, but I kind of doubt
it.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] O_DIRECT in freebsd

2003-10-29 Thread Manfred Spraul
Tom Lane wrote:

Not for WAL --- we never read the WAL at all in normal operation. (If

it works for writes, then we would want to use it for writing WAL, but
that's not apparent from what Christopher quoted.)
At least under Linux, it works for writes. Oracle uses O_DIRECT to 
access (both read and write) disks that are shared between multiple 
nodes in a cluster - their database kernel must know when the data is 
visible to the other nodes.
One problem for WAL is that O_DIRECT would disable the write cache - 
each operation would block until the data arrived on disk, and that 
might block other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
Linux supports posix_fadvise, it seems to be part of xopen2k.

--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] O_DIRECT in freebsd

2003-10-29 Thread Tom Lane
Doug McNaught <[EMAIL PROTECTED]> writes:
> Christopher Kings-Lynne <[EMAIL PROTECTED]> writes:
>> A new DIRECTIO kernel option enables support for read operations that
>> bypass the buffer cache and put data directly into a userland
>> buffer. This feature requires that the O_DIRECT flag is set on the
>> file descriptor and that both the offset and length for the read
>> operation are multiples of the physical media sector size.

> Linux and Solaris have had this for a while.  I'm pretty sure it's
> been discussed before--search the archives.  I think the consensus
> was that it might be useful for WAL writes, but would be a fair amount
> of work and would introduce portability issues...

Not for WAL --- we never read the WAL at all in normal operation.  (If
it works for writes, then we would want to use it for writing WAL, but
that's not apparent from what Christopher quoted.)

IIRC there was speculation that this would be useful for large seqscans
and for vacuuming.  It'd take some hacking to propagate the knowledge of
that context down to where the fopen occurs, though.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] O_DIRECT in freebsd

2003-10-29 Thread Doug McNaught
"scott.marlowe" <[EMAIL PROTECTED]> writes:

> I would think the biggest savings could come from using directIO for 
> vacuuming, so it doesn't cause the kernel to flush buffers.
> 
> Would that be just as hard to implement?  

Two words: "cache coherency".

-Doug

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] O_DIRECT in freebsd

2003-10-29 Thread scott.marlowe
On 29 Oct 2003, Doug McNaught wrote:

> Christopher Kings-Lynne <[EMAIL PROTECTED]> writes:
> 
> > FreeBSD 4.9 was released today.  In the release notes was:
> > 
> > 2.2.6 File Systems
> > 
> > A new DIRECTIO kernel option enables support for read operations that
> > bypass the buffer cache and put data directly into a userland
> > buffer. This feature requires that the O_DIRECT flag is set on the
> > file descriptor and that both the offset and length for the read
> > operation are multiples of the physical media sector size.
> > 
> > Is that of any use?
> 
> Linux and Solaris have had this for a while.  I'm pretty sure it's
> been discussed before--search the archives.  I think the consensus
> was that it might be useful for WAL writes, but would be a fair amount
> of work and would introduce portability issues...

I would think the biggest savings could come from using directIO for 
vacuuming, so it doesn't cause the kernel to flush buffers.

Would that be just as hard to implement?  


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-10-29 Thread Doug McNaught
Christopher Kings-Lynne <[EMAIL PROTECTED]> writes:

> FreeBSD 4.9 was released today.  In the release notes was:
> 
> 2.2.6 File Systems
> 
> A new DIRECTIO kernel option enables support for read operations that
> bypass the buffer cache and put data directly into a userland
> buffer. This feature requires that the O_DIRECT flag is set on the
> file descriptor and that both the offset and length for the read
> operation are multiples of the physical media sector size.
> 
> Is that of any use?

Linux and Solaris have had this for a while.  I'm pretty sure it's
been discussed before--search the archives.  I think the consensus
was that it might be useful for WAL writes, but would be a fair amount
of work and would introduce portability issues...

-Doug

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian
Sean Chittenden wrote:
> > > Nor could it ever be a win unless the cache was populated via
> > > O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once
> > > in the kernel and once in PG.  Doing caching at the kernel level,
> > > however means only one copy of data (for the most part).  Only
> > > problem with this being that it's not always that easy or an
> > > option to reconfig a kernel to have a bigger FS cache.  That said,
> > > tripple copying a chunk of mem is generally faster than even a
> > > single disk read.  If PostgreSQL ever wanted to have a platform
> > > agnostic way of doing efficient caching, it'd likely have to be in
> > > the userland and would require the use of O_DIRECT.
> > 
> > Actually, I think of O_DIRECT as platform-dependent.
> 
> FreeBSD, IRIX, and AIX, implement it, and ... *smiles with pride*
> looks like Linux does too given the number of security vulnerabilities
> associated with the call.  :-]

OK, that's 4 of 15 platforms.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> > Nor could it ever be a win unless the cache was populated via
> > O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once
> > in the kernel and once in PG.  Doing caching at the kernel level,
> > however means only one copy of data (for the most part).  Only
> > problem with this being that it's not always that easy or an
> > option to reconfig a kernel to have a bigger FS cache.  That said,
> > tripple copying a chunk of mem is generally faster than even a
> > single disk read.  If PostgreSQL ever wanted to have a platform
> > agnostic way of doing efficient caching, it'd likely have to be in
> > the userland and would require the use of O_DIRECT.
> 
> Actually, I think of O_DIRECT as platform-dependent.

FreeBSD, IRIX, and AIX, implement it, and ... *smiles with pride*
looks like Linux does too given the number of security vulnerabilities
associated with the call.  :-]

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> >> it doesn't seem totally out of the question.  I'd kinda like to
> >> see some experimental evidence that it's worth doing though.
> >> Anyone care to make a quick-hack prototype and do some
> >> measurements?
> 
> > What would you like to measure?  Overall system performance when a
> > query is using O_DIRECT or are you looking for negative/postitve
> > impact of read() not using the FS cache?  The latter is much
> > easier to do than the former...  recreating a valid load
> > environment that'd let any O_DIRECT benchmark be useful isn't
> > trivial.
> 
> If this stuff were easy, we'd have done it already ;-).

What do you mean?  Bits don't just hit the tree randomly because of a
possible speed improvement hinted at by a man page reference?  :-]

> The first problem is to figure out what makes sense to measure.

Egh, yeah, and this isn't trivial either benchmarking around vfs
caching makes it hard to get good results (been down that prim rose
path before with sendfile() happiness).

> Given that the request is for a quick-and-dirty test, I'd be willing
> to cut you some slack on the measurement process.  That is, it's
> okay to pick something easier to measure over something harder to
> measure, as long as you can make a fair argument that what you're
> measuring is of any interest at all...

hrm, well, given the easy part is thumping out the code, how's the
following sound as a test procedure:

1) Write out several files at varying sizes using O_DIRECT (512KB,
   1MB, 5MB, 10MB, 50MB, 100MB, 512MB, 1GB) to avoid having the FS
   cache polluted by the writes.

2) Open two new procs that read the above created files with and
   without O_DIRECT (each test iteration must rewrite the files
   above).

3) Before each read() call (does PostgreSQL use fread(3) or read(2)?),
   use gettimeofday(2) to get high resolution timing of time required
   to perform each system call.

4) Perform each of the tests above 4 times, averaging the last three
   and throwing out the 1st case (though reporting its value may be of
   interest).


I'm not that wild about writing anything threaded unless there's
strong enough interest in a write() to an O_DIRECT'ed fd to see what
happens.  I'm not convinced we'll see anything worth while unless I
setup an example that's doing a ton of write disk io.

As things stand, because O_DIRECT is an execution fast path through
the vfs subsystem, I expect the speed difference to be greater on
faster HDDs with high RPMs than on slower IDE machines at only
5400RPM... thus trivializing any benchmark I'll do on my laptop.  And
actually, if the app can't keep up with the disk, I bet the fs cache
case will be faster.  If the read()'s are able to keep up at the rate
of the HDD, however, this could be a big win in the speed dept, but if
things lag for an instant, the platter will have to make another
rotation before the call comes back to the userland.

Now that I think about it, the optimal case would be to anonymously
mmap() a private buffer that does the read() writes into that way the
HDD could just DMA the data into the mmap()'ed buffer making it a
zero-copy read operation though stirring any interest with my
mmap() benchmarks from a while back seems to me have been lost in the
fray.  :)

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian
Sean Chittenden wrote:
> Nor could it ever be a win unless the cache was populated via
> O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once in
> the kernel and once in PG.  Doing caching at the kernel level, however
> means only one copy of data (for the most part).  Only problem with
> this being that it's not always that easy or an option to reconfig a
> kernel to have a bigger FS cache.  That said, tripple copying a chunk
> of mem is generally faster than even a single disk read.  If
> PostgreSQL ever wanted to have a platform agnostic way of doing
> efficient caching, it'd likely have to be in the userland and would
> require the use of O_DIRECT.

Actually, I think of O_DIRECT as platform-dependent.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> > True, it is a cost/benefit issue.  My assumption was that once we have
> > free-behind in the PostgreSQL shared buffer cache, the kernel cache
> > issues would be minimal, but I am willing to be found wrong.
> 
> If you are running on the
> small-shared-buffers-and-large-kernel-cache theory, then getting the
> kernel cache to behave right is much more important than making the
> PG cache behave right.  If you favor the other theory then
> free-behind in the PG cache is the important thing.  However, I've
> not yet seen any convincing evidence that large PG cache with small
> kernel cache is the way to go.

Nor could it ever be a win unless the cache was populated via
O_DIRECT, actually.  Big PG cache == 2 extra copies of data, once in
the kernel and once in PG.  Doing caching at the kernel level, however
means only one copy of data (for the most part).  Only problem with
this being that it's not always that easy or an option to reconfig a
kernel to have a bigger FS cache.  That said, tripple copying a chunk
of mem is generally faster than even a single disk read.  If
PostgreSQL ever wanted to have a platform agnostic way of doing
efficient caching, it'd likely have to be in the userland and would
require the use of O_DIRECT.

-sc


PS Tripple copy == disk buffer into kernel (data is normally DMA'ed,
   not technically a copy), fs cache into user land, userland into PG
   cache, pg cache into application.  O_DIRECT eliminates one of these
   copies: nevermind the doubling up of data in RAM.

-- 
Sean Chittenden

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane
Sean Chittenden <[EMAIL PROTECTED]> writes:
>> it doesn't seem totally out of the question.  I'd kinda like to see
>> some experimental evidence that it's worth doing though.  Anyone
>> care to make a quick-hack prototype and do some measurements?

> What would you like to measure?  Overall system performance when a
> query is using O_DIRECT or are you looking for negative/postitve
> impact of read() not using the FS cache?  The latter is much easier to
> do than the former...  recreating a valid load environment that'd let
> any O_DIRECT benchmark be useful isn't trivial.

If this stuff were easy, we'd have done it already ;-).  The first
problem is to figure out what makes sense to measure.

Given that the request is for a quick-and-dirty test, I'd be willing to
cut you some slack on the measurement process.  That is, it's okay to
pick something easier to measure over something harder to measure, as
long as you can make a fair argument that what you're measuring is of
any interest at all...

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> True, it is a cost/benefit issue.  My assumption was that once we have
> free-behind in the PostgreSQL shared buffer cache, the kernel cache
> issues would be minimal, but I am willing to be found wrong.

If you are running on the small-shared-buffers-and-large-kernel-cache
theory, then getting the kernel cache to behave right is much more
important than making the PG cache behave right.  If you favor the
other theory then free-behind in the PG cache is the important thing.
However, I've not yet seen any convincing evidence that large PG
cache with small kernel cache is the way to go.

The scenario that I'm envisioning would require passing a this-is-a-
seqscan-read flag down through bufmgr to smgr to fd.c.  So the info
would be available to bufmgr to implement a free-behind policy.  But
I suspect that's not really where any big win would come from.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> > > What about cache coherency problems with other backends not
> > > opening with O_DIRECT?
> > 
> > If O_DIRECT introduces cache coherency problems against other
> > processes not using O_DIRECT then the whole idea is a nonstarter,
> > but I can't imagine any kernel hackers would have been stupid
> > enough to allow that to happen ...

Cache coherency isn't an issue, read() called on a fd that has
O_DIRECT set on it will pull data from the FS cache if a write has
occurred to the file pointed to by the reading/writing fd (though the
write of that page gets done immediately and isn't postponed,
rationale for that is still pending).

> Seeing how the buffer exists in user space, I would be interested
> how they prevent coherency problems with good performance --- maybe
> they map the same page into multiple processes --- that would be
> interesting, though it would require some locking.

It's in src/kern/vfs_bio.c, src/ufs/ffs_vnops.c, src/ufs/ffs_rawread.c

http://lists.freebsd.org/pipermail/cvs-all/2003-March/000226.html

phk's likely to revamp some of the FS buffer cache in FreeBSD 6, but
the userland API will remain stable (it'll just get more efficient in
6, not that it's not fast as is).

http://lists.freebsd.org/pipermail/cvs-all/2003-March/000261.html

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> > Basically, I think we need free-behind rather than O_DIRECT.
> 
> There are two separate issues here --- one is what's happening in
> our own cache, and one is what's happening in the kernel disk cache.
> Implementing our own free-behind code would help in our own cache
> but does nothing for the kernel cache.
> 
> My thought on this is that for large seqscans we could think about
> doing reads through a file descriptor that's opened with O_DIRECT.
> But writes should never go through O_DIRECT.  In some scenarios this
> would mean having two FDs open for the same relation file.  This'd
> require moderately extensive changes to the smgr-related APIs, but
> it doesn't seem totally out of the question.  I'd kinda like to see
> some experimental evidence that it's worth doing though.  Anyone
> care to make a quick-hack prototype and do some measurements?

What would you like to measure?  Overall system performance when a
query is using O_DIRECT or are you looking for negative/postitve
impact of read() not using the FS cache?  The latter is much easier to
do than the former...  recreating a valid load environment that'd let
any O_DIRECT benchmark be useful isn't trivial.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > _That_ is an excellent point.  However, do we know at the time we open
> > the file descriptor if we will be doing this?
> 
> We'd have to say on a per-read basis whether we want O_DIRECT or not,
> and fd.c would need to provide a suitable file descriptor.

OK

> > What about cache
> > coherency problems with other backends not opening with O_DIRECT?
> 
> If O_DIRECT introduces cache coherency problems against other processes
> not using O_DIRECT then the whole idea is a nonstarter, but I can't
> imagine any kernel hackers would have been stupid enough to allow that
> to happen ...

Seeing how the buffer exists in user space, I would be interested how
they prevent coherency problems with good performance --- maybe they map
the same page into multiple processes --- that would be interesting,
though it would require some locking.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Basically, I think we need free-behind rather than O_DIRECT.
> 
> There are two separate issues here --- one is what's happening in our
> own cache, and one is what's happening in the kernel disk cache.
> Implementing our own free-behind code would help in our own cache but
> does nothing for the kernel cache.

Right.

> My thought on this is that for large seqscans we could think about
> doing reads through a file descriptor that's opened with O_DIRECT.
> But writes should never go through O_DIRECT.  In some scenarios this
> would mean having two FDs open for the same relation file.  This'd
> require moderately extensive changes to the smgr-related APIs, but
> it doesn't seem totally out of the question.  I'd kinda like to see
> some experimental evidence that it's worth doing though.  Anyone
> care to make a quick-hack prototype and do some measurements?

True, it is a cost/benefit issue.  My assumption was that once we have
free-behind in the PostgreSQL shared buffer cache, the kernel cache
issues would be minimal, but I am willing to be found wrong.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> _That_ is an excellent point.  However, do we know at the time we open
> the file descriptor if we will be doing this?

We'd have to say on a per-read basis whether we want O_DIRECT or not,
and fd.c would need to provide a suitable file descriptor.

> What about cache
> coherency problems with other backends not opening with O_DIRECT?

If O_DIRECT introduces cache coherency problems against other processes
not using O_DIRECT then the whole idea is a nonstarter, but I can't
imagine any kernel hackers would have been stupid enough to allow that
to happen ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> Basically, I think we need free-behind rather than O_DIRECT.

There are two separate issues here --- one is what's happening in our
own cache, and one is what's happening in the kernel disk cache.
Implementing our own free-behind code would help in our own cache but
does nothing for the kernel cache.

My thought on this is that for large seqscans we could think about
doing reads through a file descriptor that's opened with O_DIRECT.
But writes should never go through O_DIRECT.  In some scenarios this
would mean having two FDs open for the same relation file.  This'd
require moderately extensive changes to the smgr-related APIs, but
it doesn't seem totally out of the question.  I'd kinda like to see
some experimental evidence that it's worth doing though.  Anyone
care to make a quick-hack prototype and do some measurements?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> > > Basically, we don't know when we read a buffer whether this is a
> > > read-only or read/write.  In fact, we could read it in, and
> > > another backend could write it for us.
> > 
> > Um, wait.  The cache is shared between backends?  I don't think
> > so, but it shouldn't matter because there has to be a semaphore
> > locking the cache to prevent the coherency issue you describe.  If
> > PostgreSQL didn't, it'd be having problems with this now.  I'd
> > also think that MVCC would handle the case of updated data in the
> > cache as that has to be a common case.  At what point is the
> > cached result invalidated and fetched from the OS?
> 
> Uh, it's called the _shared_ buffer cache in postgresql.conf, and we
> lock pages only while we are reading/writing them, not for the duration
> they are in the cache.

*smacks forhead* Duh, you're right.  I always just turn up the FS
cache in the OS instead.

The shared buffer cache has got to have enormous churn though if
everything ends up in the userland cache.  Is it really an exhaustive
cache?  I thought the bulk of the caching happened in the kernel and
not in the userland.  Is the userland cache just for the SysCache and
friends, or does it cache everything that moves through PostgreSQL?

> > > The big issue is that when we do a write, we don't wait for it
> > > to get to disk.
> > 
> > Only in the case when fsync() is turned off, but again, that's up to
> > the OS to manage that can of worms, which I think BSD takes care of
> > that.  From conf/NOTES:
> 
> Nope.  When you don't have a kernel buffer cache, and you do a
> write, where do you expect it to go?  I assume it goes to the drive,
> and you have to wait for that.

Correct, a write call blocks until the bits hit the disk in the
absence of lack of enough buffer space.  In the event of enough
buffer, however, the buffer houses the bits until written to disk and
the kernel returns control to the userland app.

Consencus is that FreeBSD does the right thing and hands back data
from the FS buffer even though the fd was marked O_DIRECT (see
bottom).

> > I don't see how this'd be an issue as buffers populated via a
> > read(), that are updated, and then written out, would occupy a new
> > chunk of disk to satisfy MVCC.  Why would we need to mark a buffer
> > as read only and carry around/check its state?
> 
> We update the expired flags on the tuple during update/delete.

*nods* Okay, I don't see where the problem would be then with
O_DIRECT.  I'm going to ask Dillion about O_DIRECT since he
implemented it, likely for the backplane database that he's writing.
I'll let 'ya know what he says.

-sc


Here's a snip from the conv I had with someone that has mega vfs foo
in FreeBSD:

17:58  * seanc has a question about O_DIRECT
17:58 <@zb^3> ask
17:59 <@seanc> assume two procs have a file open, one proc writes using
   buffered IO, the other uses O_DIRECT to read from the file, is
   read() smart enough to hand back the data in the buffer that
   hasn't hit the disk yet or will there be syncing issues?
18:00 <@zb^3> O_DIRECT in the incarnation from matt dillon will break shit
18:00 <@zb^3> basically, any data read will be set non-cacheable
18:01 <@zb^3> and you'll experience writes earlier than you should
18:01 <@seanc> zb^3: hrm, I don't want to write to the fd + O_DIRECT though
18:02 <@seanc> zb^3: basically you're saying an O_DIRECT fd doesn't consult the
   FS cache before reading from disk?
18:03 <@zb^3> no, it does
18:03 <@zb^3> but it immediately puts any read blocks on the ass end of the LRU
18:03 <@zb^3> so if you write a block, then read it with O_DIRECT it will get
  written out early :(
18:04 <@seanc> zb^3: ah, got it...  it's not a data coherency issue, it's a
   priority issue and O_DIRECT makes writes jump the gun
18:04 <@seanc> got it
18:05 <@seanc> zb^3: is that required in the implementation or is it a bug?
18:06  * seanc is wondering whether or not he should bug dillion about this to
  get things working correctly
18:07 <@zb^3> it's a bug in the implementation
18:08 <@zb^3> to fix it you have to pass flags all the way down into the
  getblk-like layer
18:08 <@zb^3> and dillon was opposed to that
18:09 <@seanc> zb^3: hrm, thx... I'll go bug him about it now and see what's up
   in backplane land

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian
Sean Chittenden wrote:
> > Basically, we don't know when we read a buffer whether this is a
> > read-only or read/write.  In fact, we could read it in, and another
> > backend could write it for us.
> 
> Um, wait.  The cache is shared between backends?  I don't think so,
> but it shouldn't matter because there has to be a semaphore locking
> the cache to prevent the coherency issue you describe.  If PostgreSQL
> didn't, it'd be having problems with this now.  I'd also think that
> MVCC would handle the case of updated data in the cache as that has to
> be a common case.  At what point is the cached result invalidated and
> fetched from the OS?

Uh, it's called the _shared_ buffer cache in postgresql.conf, and we
lock pages only while we are reading/writing them, not for the duration
they are in the cache.

> > The big issue is that when we do a write, we don't wait for it to
> > get to disk.
> 
> Only in the case when fsync() is turned off, but again, that's up to
> the OS to manage that can of worms, which I think BSD takes care of
> that.  From conf/NOTES:

Nope.  When you don't have a kernel buffer cache, and you do a write,
where do you expect it to go?  I assume it goes to the drive, and you
have to wait for that.

> 
> # Attempt to bypass the buffer cache and put data directly into the
> # userland buffer for read operation when O_DIRECT flag is set on the
> # file.  Both offset and length of the read operation must be
> # multiples of the physical media sector size.
> #
> #optionsDIRECTIO
> 
> The offsets and length bit kinda bothers me though, but I thin that's
> stuff that the ernel has to take into account, not the userland calls,
> I wonder if that's actually accurate any more or affects userland
> calls...  seems like that'd be a bit too much work to have the user
> do, esp given the lack of documentation on the flag... should be just
> drop in additional flag, afaict.
> 
> > It seems to use O_DIRECT, we would have to read the buffer in a
> > special way to mark it as read-only, which seems kind of strange.  I
> > see no reason we can't use free-behind in the PostgreSQL buffer
> > cache to handle most of the benefits of O_DIRECT, without the
> > read-only buffer restriction.
> 
> I don't see how this'd be an issue as buffers populated via a read(),
> that are updated, and then written out, would occupy a new chunk of
> disk to satisfy MVCC.  Why would we need to mark a buffer as read only
> and carry around/check its state?

We update the expired flags on the tuple during update/delete.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> Basically, we don't know when we read a buffer whether this is a
> read-only or read/write.  In fact, we could read it in, and another
> backend could write it for us.

Um, wait.  The cache is shared between backends?  I don't think so,
but it shouldn't matter because there has to be a semaphore locking
the cache to prevent the coherency issue you describe.  If PostgreSQL
didn't, it'd be having problems with this now.  I'd also think that
MVCC would handle the case of updated data in the cache as that has to
be a common case.  At what point is the cached result invalidated and
fetched from the OS?

> The big issue is that when we do a write, we don't wait for it to
> get to disk.

Only in the case when fsync() is turned off, but again, that's up to
the OS to manage that can of worms, which I think BSD takes care of
that.  From conf/NOTES:

# Attempt to bypass the buffer cache and put data directly into the
# userland buffer for read operation when O_DIRECT flag is set on the
# file.  Both offset and length of the read operation must be
# multiples of the physical media sector size.
#
#optionsDIRECTIO

The offsets and length bit kinda bothers me though, but I thin that's
stuff that the ernel has to take into account, not the userland calls,
I wonder if that's actually accurate any more or affects userland
calls...  seems like that'd be a bit too much work to have the user
do, esp given the lack of documentation on the flag... should be just
drop in additional flag, afaict.

> It seems to use O_DIRECT, we would have to read the buffer in a
> special way to mark it as read-only, which seems kind of strange.  I
> see no reason we can't use free-behind in the PostgreSQL buffer
> cache to handle most of the benefits of O_DIRECT, without the
> read-only buffer restriction.

I don't see how this'd be an issue as buffers populated via a read(),
that are updated, and then written out, would occupy a new chunk of
disk to satisfy MVCC.  Why would we need to mark a buffer as read only
and carry around/check its state?

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

Basically, we don't know when we read a buffer whether this is a
read-only or read/write.  In fact, we could read it in, and another
backend could write it for us.

The big issue is that when we do a write, we don't wait for it to get to
disk.

It seems to use O_DIRECT, we would have to read the buffer in a special
way to mark it as read-only, which seems kind of strange.  I see no
reason we can't use free-behind in the PostgreSQL buffer cache to handle
most of the benefits of O_DIRECT, without the read-only buffer restriction.

---

Sean Chittenden wrote:
> > _That_ is an excellent point.  However, do we know at the time we
> > open the file descriptor if we will be doing this?
> 
> Doesn't matter, it's an option to fcntl().
> 
> > What about cache coherency problems with other backends not opening
> > with O_DIRECT?
> 
> That's a problem for the kernel VM, if you mean cache coherency in the
> VM.  If you mean inside of the backend, that could be a stickier
> issue, I think.  I don't know enough of the internals yet to know if
> this is a problem or not, but you're right, it's certainly something
> to consider.  Is the cache a write behind cache or is it a read
> through cache?  If it's a read through cache, which I think it is,
> then the backend would have to dirty all cache entries pertaining to
> the relations being opened with O_DIRECT.  The use case for that
> being:
> 
> 1) a transaction begins
> 2) a few rows out of the huge table are read
> 3) a huge query is performed that triggers the use of O_DIRECT
> 4) the rows selected in step 2 are updated (this step should poison or
>update the cache, actually, and act as a write through cache if the
>data is in the cache)
> 5) the same few rows are read in again
> 6) transaction is committed
> 
> Provided the cache is poisoned or updated in step 4, I can't see how
> or where this would be a problem.  Please enlighten if there's a
> different case that would need to be taken into account.  I can't
> imagine ever wanting to write out data using O_DIRECT and think that
> it's a read only optimization in an attempt to minimize the turn over
> in the OS's cache.  From fcntl(2):
> 
>  O_DIRECT Minimize or eliminate the cache effects of reading and writ-
>   ing.  The system will attempt to avoid caching the data you
>   read or write.  If it cannot avoid caching the data, it will
>   minimize the impact the data has on the cache.  Use of this
>   flag can drastically reduce performance if not used with
>   care.
> 
> 
> > And finally, how do we deal with the fact that writes to O_DIRECT
> > files will wait until the data hits the disk because there is no
> > kernel buffer cache?
> 
> Well, two things.
> 
> 1) O_DIRECT should never be used on writes... I can't think of a case
>where you'd want it off, even when COPY'ing data and restoring a
>DB, it just doesn't make sense to use it.  The write buffer is
>emptied as soon as the pages hit the disk unless something is
>reading those bits, but I'd imagine the write buffer would be used
>to make sure that as much writing is done to the platter in a
>single write by the OS as possible, circumventing that would be
>insane (though useful possibly for embedded devices with low RAM,
>solid state drives, or some super nice EMC fiber channel storage
>device that basically has its own huge disk cache).
> 
> 2) Last I checked PostgreSQL wasn't a threaded app and doesn't use
>non-blocking IO.  The backend would block until the call returns,
>where's the problem?  :)
> 
> If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
> code, if there are any.  -sc
> 
> -- 
> Sean Chittenden
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> > > What you really want is Solaris's free-behind, where it detects
> > > if a scan is exceeding a certain percentage of the OS cache and
> > > moves the pages to the _front_ of the to-be-reused list.  I am
> > > not sure what other OS's support this, but we need this on our
> > > own buffer manager code as well.
> > > 
> > > Our TODO already has:
> > > 
> > >   * Add free-behind capability for large sequential scans (Bruce)
> > > 
> > > Basically, I think we need free-behind rather than O_DIRECT.
> > 
> > I suppose, but you've already polluted the cache by the time the
> > above mentioned mechanism kicks in and takes effect.  Given that
> > the planner has an idea of how much data it's going to read in in
> > order to complete the query, seems easier/better to mark the fd
> > O_DIRECT.  *shrug*
> 
> _That_ is an excellent point.  However, do we know at the time we
> open the file descriptor if we will be doing this?

Doesn't matter, it's an option to fcntl().

> What about cache coherency problems with other backends not opening
> with O_DIRECT?

That's a problem for the kernel VM, if you mean cache coherency in the
VM.  If you mean inside of the backend, that could be a stickier
issue, I think.  I don't know enough of the internals yet to know if
this is a problem or not, but you're right, it's certainly something
to consider.  Is the cache a write behind cache or is it a read
through cache?  If it's a read through cache, which I think it is,
then the backend would have to dirty all cache entries pertaining to
the relations being opened with O_DIRECT.  The use case for that
being:

1) a transaction begins
2) a few rows out of the huge table are read
3) a huge query is performed that triggers the use of O_DIRECT
4) the rows selected in step 2 are updated (this step should poison or
   update the cache, actually, and act as a write through cache if the
   data is in the cache)
5) the same few rows are read in again
6) transaction is committed

Provided the cache is poisoned or updated in step 4, I can't see how
or where this would be a problem.  Please enlighten if there's a
different case that would need to be taken into account.  I can't
imagine ever wanting to write out data using O_DIRECT and think that
it's a read only optimization in an attempt to minimize the turn over
in the OS's cache.  From fcntl(2):

 O_DIRECT Minimize or eliminate the cache effects of reading and writ-
  ing.  The system will attempt to avoid caching the data you
  read or write.  If it cannot avoid caching the data, it will
  minimize the impact the data has on the cache.  Use of this
  flag can drastically reduce performance if not used with
  care.


> And finally, how do we deal with the fact that writes to O_DIRECT
> files will wait until the data hits the disk because there is no
> kernel buffer cache?

Well, two things.

1) O_DIRECT should never be used on writes... I can't think of a case
   where you'd want it off, even when COPY'ing data and restoring a
   DB, it just doesn't make sense to use it.  The write buffer is
   emptied as soon as the pages hit the disk unless something is
   reading those bits, but I'd imagine the write buffer would be used
   to make sure that as much writing is done to the platter in a
   single write by the OS as possible, circumventing that would be
   insane (though useful possibly for embedded devices with low RAM,
   solid state drives, or some super nice EMC fiber channel storage
   device that basically has its own huge disk cache).

2) Last I checked PostgreSQL wasn't a threaded app and doesn't use
   non-blocking IO.  The backend would block until the call returns,
   where's the problem?  :)

If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
code, if there are any.  -sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian
Sean Chittenden wrote:
> > What you really want is Solaris's free-behind, where it detects if a
> > scan is exceeding a certain percentage of the OS cache and moves the
> > pages to the _front_ of the to-be-reused list.  I am not sure what
> > other OS's support this, but we need this on our own buffer manager
> > code as well.
> > 
> > Our TODO already has:
> > 
> > * Add free-behind capability for large sequential scans (Bruce)
> > 
> > Basically, I think we need free-behind rather than O_DIRECT.
> 
> I suppose, but you've already polluted the cache by the time the above
> mentioned mechanism kicks in and takes effect.  Given that the planner
> has an idea of how much data it's going to read in in order to
> complete the query, seems easier/better to mark the fd O_DIRECT.
> *shrug*

_That_ is an excellent point.  However, do we know at the time we open
the file descriptor if we will be doing this?  What about cache
coherency problems with other backends not opening with O_DIRECT?  And
finally, how do we deal with the fact that writes to O_DIRECT files will
wait until the data hits the disk because there is no kernel buffer cache?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> What you really want is Solaris's free-behind, where it detects if a
> scan is exceeding a certain percentage of the OS cache and moves the
> pages to the _front_ of the to-be-reused list.  I am not sure what
> other OS's support this, but we need this on our own buffer manager
> code as well.
> 
> Our TODO already has:
> 
>   * Add free-behind capability for large sequential scans (Bruce)
> 
> Basically, I think we need free-behind rather than O_DIRECT.

I suppose, but you've already polluted the cache by the time the above
mentioned mechanism kicks in and takes effect.  Given that the planner
has an idea of how much data it's going to read in in order to
complete the query, seems easier/better to mark the fd O_DIRECT.
*shrug*

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Bruce Momjian

What you really want is Solaris's free-behind, where it detects if a
scan is exceeding a certain percentage of the OS cache and moves the
pages to the _front_ of the to-be-reused list.  I am not sure what other
OS's support this, but we need this on our own buffer manager code as
well.

Our TODO already has:

* Add free-behind capability for large sequential scans (Bruce)

Basically, I think we need free-behind rather than O_DIRECT.

---

Sean Chittenden wrote:
> > > > > The reason I mention it is that Postgres already supports
> > > > > O_DIRECT I think on some other platforms (for whatever
> > > > > reason).
> > > > 
> > > > [ sounds of grepping... ]  No.  The only occurrence of O_DIRECT in the
> > > > source tree is in TODO:
> > > > 
> > > > * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
> > > > 
> > > > I personally disagree with this TODO item for the same reason
> > > > Sean cited: Postgres is designed and tuned to rely on OS-level
> > > > disk caching, and bypassing that cache is far more likely to
> > > > hurt our performance than help it.
> > > 
> > > DB2 and Oracle, from memory, allow users to pass hints to the
> > > planner to use/not use file system caching. This could be useful
> > > if you had an application retrieving a large amount of data on an
> > > adhoc basis. The large retrieval would empty out the disk cache
> > > there by negatively impacting upon other applications operating on
> > > data which should be cached.
> >  
> > I've recently been bitten by this. On DB2, I could change what
> > bufferpool the large tables were using and set it fairly small, but
> > obviously not an option with PGSQL. But, if pgsql could stop caching
> > from occuring on user-specified queries, large table or index scans,
> > etc., it would be very helpful.
> 
> Actually, now that I think about this, if the planner is going to read
> more than X number of bytes as specified in a GUC, it would be useful
> to have the fd marked as O_DIRECT to avoid polluting the disk
> cache... I have a few tables with about 300M rows (~9GB on disk) that
> I have to perform nightly seq scans over for reports and it does wipe
> out some of the other fast movers that come through and depend on the
> disk cache to be there for their speed.  Because they're performed in
> the middle of the night, I don't care that much, but my avg query
> times during that period of time are slower... whether it's load or
> the disk buffer being emptied and having to be refilled, I'm not sure,
> but thinking about it, use of a GUC threshold to have an FD marked as
> O_DIRECT does make sense (0 == disabled and the default, but tunable
> in Kbytes as an admin sees fit) and could be nice for big queries that
> have lots of smaller queries running around at the same time.
> 
> -sc
> 
> -- 
> Sean Chittenden
> 
> ---(end of broadcast)---
> TIP 5: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faqs/FAQ.html
> 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] O_DIRECT in freebsd

2003-06-22 Thread Sean Chittenden
> > > > The reason I mention it is that Postgres already supports
> > > > O_DIRECT I think on some other platforms (for whatever
> > > > reason).
> > > 
> > > [ sounds of grepping... ]  No.  The only occurrence of O_DIRECT in the
> > > source tree is in TODO:
> > > 
> > > * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
> > > 
> > > I personally disagree with this TODO item for the same reason
> > > Sean cited: Postgres is designed and tuned to rely on OS-level
> > > disk caching, and bypassing that cache is far more likely to
> > > hurt our performance than help it.
> > 
> > DB2 and Oracle, from memory, allow users to pass hints to the
> > planner to use/not use file system caching. This could be useful
> > if you had an application retrieving a large amount of data on an
> > adhoc basis. The large retrieval would empty out the disk cache
> > there by negatively impacting upon other applications operating on
> > data which should be cached.
>  
> I've recently been bitten by this. On DB2, I could change what
> bufferpool the large tables were using and set it fairly small, but
> obviously not an option with PGSQL. But, if pgsql could stop caching
> from occuring on user-specified queries, large table or index scans,
> etc., it would be very helpful.

Actually, now that I think about this, if the planner is going to read
more than X number of bytes as specified in a GUC, it would be useful
to have the fd marked as O_DIRECT to avoid polluting the disk
cache... I have a few tables with about 300M rows (~9GB on disk) that
I have to perform nightly seq scans over for reports and it does wipe
out some of the other fast movers that come through and depend on the
disk cache to be there for their speed.  Because they're performed in
the middle of the night, I don't care that much, but my avg query
times during that period of time are slower... whether it's load or
the disk buffer being emptied and having to be refilled, I'm not sure,
but thinking about it, use of a GUC threshold to have an FD marked as
O_DIRECT does make sense (0 == disabled and the default, but tunable
in Kbytes as an admin sees fit) and could be nice for big queries that
have lots of smaller queries running around at the same time.

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] O_DIRECT in freebsd

2003-06-20 Thread Jim C. Nasby
On Wed, Jun 18, 2003 at 10:01:37AM +1000, Gavin Sherry wrote:
> On Tue, 17 Jun 2003, Tom Lane wrote:
> 
> > "Christopher Kings-Lynne" <[EMAIL PROTECTED]> writes:
> > > The reason I mention it is that Postgres already supports O_DIRECT I think
> > > on some other platforms (for whatever reason).
> > 
> > [ sounds of grepping... ]  No.  The only occurrence of O_DIRECT in the
> > source tree is in TODO:
> > 
> > * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
> > 
> > I personally disagree with this TODO item for the same reason Sean
> > cited: Postgres is designed and tuned to rely on OS-level disk caching,
> > and bypassing that cache is far more likely to hurt our performance than
> > help it.
> 
> DB2 and Oracle, from memory, allow users to pass hints to the planner to
> use/not use file system caching. This could be useful if you had an
> application retrieving a large amount of data on an adhoc basis. The large
> retrieval would empty out the disk cache there by negatively impacting
> upon other applications operating on data which should be cached.
 
I've recently been bitten by this. On DB2, I could change what
bufferpool the large tables were using and set it fairly small, but
obviously not an option with PGSQL. But, if pgsql could stop caching
from occuring on user-specified queries, large table or index scans,
etc., it would be very helpful.
-- 
Jim C. Nasby (aka Decibel!)[EMAIL PROTECTED]
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] O_DIRECT in freebsd

2003-06-18 Thread Bruce Momjian

Also, keep in mind writes to O_DIRECT devices have to wait for the data
to get on the platters rather than into the kernel cache.

---

Tom Lane wrote:
> "Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> >> DB2 and Oracle, from memory, allow users to pass hints to the planner to
> >> use/not use file system caching.
>  
> > Might it make sense to do this for on-disk sorts, since sort_mem is
> > essentially being used as a disk cache (at least for reads)?
> 
> If sort_mem were actually being used that way, it might be ... but it
> isn't, and so I doubt O_DIRECT would be an improvement.  It seems more
> likely to force disk I/O that otherwise might not happen at all, if the
> kernel happens to have sufficient buffer space on hand.
> 
> I'll concede though that a large sort would probably have the effect of
> blowing out the kernel's disk cache.  So while O_DIRECT might be a net
> pessimization as far as the sort itself is concerned, it would probably
> be more friendly to the rest of the system, by leaving disk buffers free
> for more productive uses.  It'd all depend on your workload ...
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] O_DIRECT in freebsd

2003-06-18 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
>> DB2 and Oracle, from memory, allow users to pass hints to the planner to
>> use/not use file system caching.
 
> Might it make sense to do this for on-disk sorts, since sort_mem is
> essentially being used as a disk cache (at least for reads)?

If sort_mem were actually being used that way, it might be ... but it
isn't, and so I doubt O_DIRECT would be an improvement.  It seems more
likely to force disk I/O that otherwise might not happen at all, if the
kernel happens to have sufficient buffer space on hand.

I'll concede though that a large sort would probably have the effect of
blowing out the kernel's disk cache.  So while O_DIRECT might be a net
pessimization as far as the sort itself is concerned, it would probably
be more friendly to the rest of the system, by leaving disk buffers free
for more productive uses.  It'd all depend on your workload ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-06-17 Thread Jim C. Nasby
On Wed, Jun 18, 2003 at 10:01:37AM +1000, Gavin Sherry wrote:
> On Tue, 17 Jun 2003, Tom Lane wrote:
> > * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
> > 
> > I personally disagree with this TODO item for the same reason Sean
> > cited: Postgres is designed and tuned to rely on OS-level disk caching,
> > and bypassing that cache is far more likely to hurt our performance than
> > help it.
> 
> DB2 and Oracle, from memory, allow users to pass hints to the planner to
> use/not use file system caching. This could be useful if you had an
> application retrieving a large amount of data on an adhoc basis. The large
> retrieval would empty out the disk cache there by negatively impacting
> upon other applications operating on data which should be cached.
 
Might it make sense to do this for on-disk sorts, since sort_mem is
essentially being used as a disk cache (at least for reads)?
-- 
Jim C. Nasby (aka Decibel!)[EMAIL PROTECTED]
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] O_DIRECT in freebsd

2003-06-17 Thread Gavin Sherry
On Tue, 17 Jun 2003, Tom Lane wrote:

> "Christopher Kings-Lynne" <[EMAIL PROTECTED]> writes:
> > The reason I mention it is that Postgres already supports O_DIRECT I think
> > on some other platforms (for whatever reason).
> 
> [ sounds of grepping... ]  No.  The only occurrence of O_DIRECT in the
> source tree is in TODO:
> 
> * Consider use of open/fcntl(O_DIRECT) to minimize OS caching
> 
> I personally disagree with this TODO item for the same reason Sean
> cited: Postgres is designed and tuned to rely on OS-level disk caching,
> and bypassing that cache is far more likely to hurt our performance than
> help it.

DB2 and Oracle, from memory, allow users to pass hints to the planner to
use/not use file system caching. This could be useful if you had an
application retrieving a large amount of data on an adhoc basis. The large
retrieval would empty out the disk cache there by negatively impacting
upon other applications operating on data which should be cached.

Gavin


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] O_DIRECT in freebsd

2003-06-17 Thread Tom Lane
"Christopher Kings-Lynne" <[EMAIL PROTECTED]> writes:
> The reason I mention it is that Postgres already supports O_DIRECT I think
> on some other platforms (for whatever reason).

[ sounds of grepping... ]  No.  The only occurrence of O_DIRECT in the
source tree is in TODO:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason Sean
cited: Postgres is designed and tuned to rely on OS-level disk caching,
and bypassing that cache is far more likely to hurt our performance than
help it.

However, if someone wants to do some experimentation with O_DIRECT, I'd
be as interested as anyone to find out what happens...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] O_DIRECT in freebsd

2003-06-17 Thread Curt Sampson
On Tue, 17 Jun 2003, Christopher Kings-Lynne wrote:

> "A new DIRECTIO kernel option enables support for read operations that
> bypass the buffer cache and put data directly into a userland buffer
>
> Will PostgreSQL pick this up automatically, or do we need to add extra
> checks?

You don't want it to. It's more efficent just to use mmap, because then
all the paging and caching issues are taken care of for you.

cjs
-- 
Curt Sampson  <[EMAIL PROTECTED]>   +81 90 7737 2974   http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light.  --XTC

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] O_DIRECT in freebsd

2003-06-17 Thread Christopher Kings-Lynne
> > Will PostgreSQL pick this up automatically, or do we need to add
> > extra checks?
>
> Extra checks, though I'm not sure why you'd want this.  This is the
> equiv of a nice way of handling raw IO for read only
> operations... which would be bad.  Call me crazy, but unless you're on

The reason I mention it is that Postgres already supports O_DIRECT I think
on some other platforms (for whatever reason).

Chris


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] O_DIRECT in freebsd

2003-06-17 Thread Sean Chittenden
> I noticed this in the FreeBSD 5.1 release notes:
> 
> "A new DIRECTIO kernel option enables support for read operations
> that bypass the buffer cache and put data directly into a userland
> buffer. This feature requires that the O_DIRECT flag is set on the
> file descriptor and that both the offset and length for the read
> operation are multiples of the physical media sector size. [MERGED]"
> 
> MERGED means that it should also appear in FreeBSD 4.9.
> 
> Will PostgreSQL pick this up automatically, or do we need to add
> extra checks?

Extra checks, though I'm not sure why you'd want this.  This is the
equiv of a nice way of handling raw IO for read only
operations... which would be bad.  Call me crazy, but unless you're on
an embedded device with a non-existent FS cache or are Oracle and are
handling the IO buffer in user space, the buffer cache is what helps
speed up PostgreSQL since PostgreSQL leaves all of the caching
operations and optimization/alignment of pages up to the OS (much to
the behest of madvise() which could potentially speed up access of the
VM, but I won't get into that... see TODO.mmap).

-sc

-- 
Sean Chittenden

---(end of broadcast)---
TIP 8: explain analyze is your friend