Re: Prereading using posix_fadvise (was Re: [HACKERS] Commitfest patches)

2008-03-28 Thread Zeugswetter Andreas OSB SD

Heikki wrote:
 It seems that the worst case for this patch is a scan on a table that 
 doesn't fit in shared_buffers, but is fully cached in the OS cache. In

 that case, the posix_fadvise calls would be a certain waste of time.

I think this is a misunderstanding, the fadvise is not issued to read
the
whole table and is not issued for table scans at all (and if it were it
would 
only advise for the next N pages).

So it has nothing to do with table size. The fadvise calls need to be
(and are) 
limited by what can be used in the near future, and not for the whole
statement.

e.g. N next level index pages that are relevant, or N relevant heap
pages one 
index leaf page points at. Maybe in the index case N does not need to be
limited,
since we have a natural limit on how many pointers fit on one page.

In general I think separate reader processes (or threads :-) that
directly read
into the bufferpool would be a more portable and efficient
implementation. 
E.g. it could use ScatterGather IO. So I think we should look, that the
fadvise 
solution is not obstruing that path, but I think it does not.

Gregory wrote:
 A more invasive form of this patch would be to assign and pin a
buffer when
 the preread is done. That would men subsequently we would have a
pinned buffer
 ready to go and not need to go back to the buffer manager a second
time. We
 would instead just complete the i/o by issuing a normal read call.

I guess you would rather need to mark the buffer for use for this page,
but let any backend that needs it first, pin it and issue the read.
I think the fadviser should not pin it in advance, since he cannot
guarantee to
actually read the page [soon]. Rather remember the buffer and later
check and pin 
it for the read. Else you might be blocking the buffer.
But I think doing something like this might be good since it avoids
issuing duplicate
fadvises.

Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Prereading using posix_fadvise (was Re: [HACKERS] Commitfest patches)

2008-03-28 Thread Heikki Linnakangas

Zeugswetter Andreas OSB SD wrote:

Heikki wrote:
It seems that the worst case for this patch is a scan on a table that 
doesn't fit in shared_buffers, but is fully cached in the OS cache. In



that case, the posix_fadvise calls would be a certain waste of time.


I think this is a misunderstanding, the fadvise is not issued to read
the
whole table and is not issued for table scans at all (and if it were it
would 
only advise for the next N pages).


So it has nothing to do with table size. The fadvise calls need to be
(and are) 
limited by what can be used in the near future, and not for the whole

statement.


Right, I was sloppy. Instead of table size, what matters is the amount 
of data the scan needs to access. The point remains that if the data is 
already in OS cache, the posix_fadvise calls are a waste of time, 
regardless of how many pages ahead you advise.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Prereading using posix_fadvise (was Re: [HACKERS] Commitfest patches)

2008-03-28 Thread Bruce Momjian
Heikki Linnakangas wrote:
  So it has nothing to do with table size. The fadvise calls need to be
  (and are) 
  limited by what can be used in the near future, and not for the whole
  statement.
 
 Right, I was sloppy. Instead of table size, what matters is the amount 
 of data the scan needs to access. The point remains that if the data is 
 already in OS cache, the posix_fadvise calls are a waste of time, 
 regardless of how many pages ahead you advise.

I now understand what posix_fadvise() is allowing us to do. 
posix_fadvise(POSIX_FADV_WILLNEED) allows us to tell the kernel we will
need a certain block in the future --- this seems much cheaper than a
background reader.

We know we will need the blocks, and telling the kernel can't hurt,
except that there is overhead in telling the kernel.  Has anyone
measured how much overhead?  I would be interested in a test program
that read the same page over and over again from the kernel, with and
without a posix_fadvise() call.

Should we consider only telling the kernel X pages ahead, meaning when
we are on page 10 we tell it about page 16?

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Prereading using posix_fadvise (was Re: [HACKERS] Commitfest patches)

2008-03-28 Thread Martijn van Oosterhout
On Fri, Mar 28, 2008 at 11:41:58AM -0400, Bruce Momjian wrote:
 Should we consider only telling the kernel X pages ahead, meaning when
 we are on page 10 we tell it about page 16?

It's not so interesting for sequential reads, the kernel can work that
out for itself. Disk reads are usually in blocks of at least 128K
anyway, so there's a real good chance you have block 16 already.

The interesting case is index scan, where you so a posix_fadvise() on
the next block *before* returning the items in the current block. Then
by the time you've processed these tuples, the next block will
hopefully have been read in and we can proceed without delay.

Or fadvising all the tuples referred to from an index page at once so
the kernel can determine the optimal order to fetch them. The read()
will still be in order of the tuple, but the delay will (hopefully) be
less.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: Prereading using posix_fadvise (was Re: [HACKERS] Commitfest patches)

2008-03-28 Thread Heikki Linnakangas

Bruce Momjian wrote:

Heikki Linnakangas wrote:

So it has nothing to do with table size. The fadvise calls need to be
(and are) 
limited by what can be used in the near future, and not for the whole

statement.
Right, I was sloppy. Instead of table size, what matters is the amount 
of data the scan needs to access. The point remains that if the data is 
already in OS cache, the posix_fadvise calls are a waste of time, 
regardless of how many pages ahead you advise.


I now understand what posix_fadvise() is allowing us to do. 
posix_fadvise(POSIX_FADV_WILLNEED) allows us to tell the kernel we will

need a certain block in the future --- this seems much cheaper than a
background reader.


Yep.


We know we will need the blocks, and telling the kernel can't hurt,
except that there is overhead in telling the kernel.  Has anyone
measured how much overhead?  I would be interested in a test program
that read the same page over and over again from the kernel, with and
without a posix_fadvise() call.


Agreed, that needs to be benchmarked next. There's also some overhead in 
doing the buffer manager hash table lookup to check whether the page is 
in shared_buffers. We could reduce that by the more complex approach 
Greg mentioned of allocating a buffer in shared_buffers when we do 
posix_fadvise.



Should we consider only telling the kernel X pages ahead, meaning when
we are on page 10 we tell it about page 16?


Yes. You don't want to fire off thousands of posix_fadvise calls 
upfront. That'll just flood the kernel, and it will most likely ignore 
any advise after the first few hundred or so. I'm not sure what the 
appropriate amount of read ahead would be, though. Probably depends a 
lot on the OS and hardware, and needs to be a adjustable.


In some cases we can't easily read ahead more than a certain number of 
pages. For example, in a regular index scan, we can easily fire off 
posix_advise calls for all the heap pages referenced by a single index 
page, but reading ahead more than that becomes much more complex.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Prereading using posix_fadvise (was Re: [HACKERS] Commitfest patches)

2008-03-28 Thread Bruce Momjian
Heikki Linnakangas wrote:
  Should we consider only telling the kernel X pages ahead, meaning when
  we are on page 10 we tell it about page 16?
 
 Yes. You don't want to fire off thousands of posix_fadvise calls 
 upfront. That'll just flood the kernel, and it will most likely ignore 
 any advise after the first few hundred or so. I'm not sure what the 
 appropriate amount of read ahead would be, though. Probably depends a 
 lot on the OS and hardware, and needs to be a adjustable.
 
 In some cases we can't easily read ahead more than a certain number of 
 pages. For example, in a regular index scan, we can easily fire off 
 posix_advise calls for all the heap pages referenced by a single index 
 page, but reading ahead more than that becomes much more complex.

And if you read-ahead too far the pages might get pushed out of the
kernel cache before you ask to read them.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers