Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-06 Thread John Lumby

Bruce Momjian wrote:
>
> On Fri, Nov 2, 2012 at 09:59:08AM -0400, John Lumby wrote:
> > However,the OP describes an implementation based on libaio.
> > Today what we have (for linux) is librt,  which is quite different.
>
> Well, good thing we didn't switch to using libaio, now that it is gone.
>
Yes,  I think you are correct.    Although I should correct myself about
status of libaio -  it seems many distros continue to provide it and at least
one other popular database (MySQL) uses it,   but as far as I can tell
the content has not been updated by the original authors for around 10 years.
That is perhaps not surprising since it does very little other than wrap
the linux kernel syscalls.

Set against the CPU-overhead disadvantage of librt,  I think the three
main advantages of librt vs libaio/kernel-aio for postgresql are :
  .   posix standard,  and probably easier to provide very similar
  implementation on windows  (I see at least one posix aio lib for windows)
  .   no restrictions on the way files are accessed  (kernel-aio imposes 
restrictions
  on open() flags and buffer alignment etc)
  .   it seems (from the recent postings about the earlier attempt to implement
  async io using libaio) that the posix threads style lends itself better 
to 
  fitting in with the postgresql backend model.

John

  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-05 Thread Bruce Momjian
On Fri, Nov  2, 2012 at 09:59:08AM -0400, John Lumby wrote:
> Thanks for the mentioning this posting.    Interesting.
> However,    the OP describes an implementation based on libaio. 
> Today what we have (for linux) is librt,  which is quite different.
> It is arguable worse than libaio (well actually I am sure it is worse)
> since it is essentially just an encapsulation of using threads to do
> synchronous ios  -  you can look at it as making it easier to do what the 
> application could do itself if it set up its own pthreads. The linux
> kernel does not know about it and so the CPU overhead of checking for
> completion is higher.

Well, good thing we didn't switch to using libaio, now that it is gone.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-02 Thread John Lumby

Claudio Freire wrote:
>
> On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith  wrote:
> > On 11/1/12 6:13 PM, Claudio Freire wrote:
> >
> >> posix_fadvise what's the trouble there, but the fact that the kernel
> >> stops doing read-ahead when a call to posix_fadvise comes. I noticed
> >> the performance hit, and checked the kernel's code. It effectively
> >> changes the prediction mode from sequential to fadvise, negating the
> >> (assumed) kernel's prefetch logic.
> >
> > The Linux posix_fadvise implementation never seemed like it was well liked
> > by the kernel developers. Quirky stuff like this popped up all the time
> > during that period, when effective_io_concurrency was being added. I wonder
> > how far back the fadvise/read-ahead conflict goes back.
>
> Well, to be precise it's not so much as a problem in posix_fadvise
> itself, it's a problem in how it interacts with readahead. Since
> readahead works at the memory mapper level, and only when actually
> performing I/O (which would seem at first glance quite sensible), it
> doesn't get to see fadvise activity.
>
> FADV_WILLNEED is implemented as a forced readahead, which doesn't
> update any of the readahead context structures. Again, at first
> glance, this would seem sensible (explicit hints shouldn't interfere
> with pattern detection logic). However, since those pages are (after
> the fadvise call) under async I/O, next time the memory mapper needs
> that page, instead of requesting I/O through readahead logic, it will
> wait for async I/O to complete.
>
> IOW, what was sequential in fact, became invisible to readahead,
> indistinguishable from random I/O. Whatever page fadvise failed to
> predict will be treated as random I/O, and here the trouble lies.

And this may be one other advantage of async io over posix_fadvise in
the linux environment (with the present mmap behaviour) :
that async io achives the same objective of improving disk/processing overlap
without the mentioned interference with read-ahead.
Although to confirm this would ideally require 3-way comparing
   posix-fadvise + existing readahead behaviour
   posix-fadvise + modify existing readahead behaviour 
  to not force waiting for current async io
                              (i.e. just check the aio and continue normal 
readahead if in progress)
   async io wth no posix-fadvise

It seems in general to be preferable to have an implementation that is
less dependent on specific behaviour of the OS read-head mechanism.

>
> >> I've mused about the possibility to batch async_io requests, and use
> >> the scatter/gather API instead of sending tons of requests to the
> >
> >> kernel. I think doing so would enable a zero-copy path that could very
> >> possibly imply big speed improvements when memory bandwidth is the
> >> bottleneck.
> >
> > Another possibly useful bit of history here for you. Greg Stark wrote a
> > test program that used async I/O effectively on both Linux and Solaris.
> > Unfortunately, it was hard to get that to work given how Postgres does its
> > buffer I/O, and using processes instead of threads. This looks like the
> > place he commented on why:
> >
> > http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html
> >
> > The part I think was relevant there from him:
> >
> > "In the libaio view of the world you initiate io and either get a
> > callback or call another syscall to test if it's complete. Either
> > approach has problems for Postgres. If the process that initiated io
> > is in the middle of a long query it might take a long time, or not even
> > never get back to complete the io. The callbacks use threads...
> >
> > And polling for completion has the problem that another process could
> > be waiting on the io and can't issue a read as long as the first
> > process has the buffer locked and io in progress. I think aio makes a
> > lot more sense if you're using threads so you can start a thread to
> > wait for the io to complete."
>
> I noticed that. I always envisioned async I/O as managed by some
> dedicated process. One that could check for completion or receive
> callbacks. Postmaster, for instance.

Thanks for the mentioning this posting.    Interesting.
However,    the OP describes an implementation based on libaio. 
Today what we have (for linux) is librt,  which is quite different.
It is arguable worse than libaio (well actually I am sure it is worse)
since it is essentially just an encapsulation of using threads to do
synchronous ios  -  you can look at it as making it easier to do what the 
application could do itself if it set up its own pthreads. The linux
kernel does not know about it and so the CPU overhead of checking for
completion is higher.
But if async io is used *only* for prefetching, and not for the actual
ReadBuffer itself   (which is what I did),   then the problem
mentioned by the OP 
 "If the process that initiated io is in the middle of a long 

Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread Claudio Freire
On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith  wrote:
> On 11/1/12 6:13 PM, Claudio Freire wrote:
>
>> posix_fadvise what's the trouble there, but the fact that the kernel
>> stops doing read-ahead when a call to posix_fadvise comes. I noticed
>> the performance hit, and checked the kernel's code. It effectively
>> changes the prediction mode from sequential to fadvise, negating the
>> (assumed) kernel's prefetch logic.
>
>
...
>
> The Linux posix_fadvise implementation never seemed like it was well liked
> by the kernel developers.  Quirky stuff like this popped up all the time
> during that period, when effective_io_concurrency was being added.  I wonder
> how far back the fadvise/read-ahead conflict goes back.

Well, to be precise it's not so much as a problem in posix_fadvise
itself, it's a problem in how it interacts with readahead. Since
readahead works at the memory mapper level, and only when actually
performing I/O (which would seem at first glance quite sensible), it
doesn't get to see fadvise activity.

FADV_WILLNEED is implemented as a forced readahead, which doesn't
update any of the readahead context structures. Again, at first
glance, this would seem sensible (explicit hints shouldn't interfere
with pattern detection logic). However, since those pages are (after
the fadvise call) under async I/O, next time the memory mapper needs
that page, instead of requesting I/O through readahead logic, it will
wait for async I/O to complete.

IOW, what was sequential in fact, became invisible to readahead,
indistinguishable from random I/O. Whatever page fadvise failed to
predict will be treated as random I/O, and here the trouble lies.

>> I've mused about the possibility to batch async_io requests, and use
>> the scatter/gather API instead of sending tons of requests to the
>
>> kernel. I think doing so would enable a zero-copy path that could very
>> possibly imply big speed improvements when memory bandwidth is the
>> bottleneck.
>
> Another possibly useful bit of history here for you.  Greg Stark wrote a
> test program that used async I/O effectively on both Linux and Solaris.
> Unfortunately, it was hard to get that to work given how Postgres does its
> buffer I/O, and using processes instead of threads.  This looks like the
> place he commented on why:
>
> http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html
>
> The part I think was relevant there from him:
>
> "In the libaio view of the world you initiate io and either get a
> callback or call another syscall to test if it's complete. Either
> approach has problems for Postgres. If the process that initiated io
> is in the middle of a long query it might take a long time, or not even
> never get back to complete the io. The callbacks use threads...
>
> And polling for completion has the problem that another process could
> be waiting on the io and can't issue a read as long as the first
> process has the buffer locked and io in progress. I think aio makes a
> lot more sense if you're using threads so you can start a thread to
> wait for the io to complete."

I noticed that. I always envisioned async I/O as managed by some
dedicated process. One that could check for completion or receive
callbacks. Postmaster, for instance.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread Greg Smith

On 11/1/12 6:13 PM, Claudio Freire wrote:


posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.


That's really interesting.  There was a patch submitted at one point to 
use POSIX_FADV_SEQUENTIAL on sequential scans, and that wasn't a 
repeatable improvement either, so it was canned at

http://archives.postgresql.org/pgsql-hackers/2008-10/msg01611.php

The Linux posix_fadvise implementation never seemed like it was well 
liked by the kernel developers.  Quirky stuff like this popped up all 
the time during that period, when effective_io_concurrency was being 
added.  I wonder how far back the fadvise/read-ahead conflict goes back.


> I've mused about the possibility to batch async_io requests, and use
> the scatter/gather API instead of sending tons of requests to the
> kernel. I think doing so would enable a zero-copy path that could very
> possibly imply big speed improvements when memory bandwidth is the
> bottleneck.

Another possibly useful bit of history here for you.  Greg Stark wrote a 
test program that used async I/O effectively on both Linux and Solaris. 
 Unfortunately, it was hard to get that to work given how Postgres does 
its buffer I/O, and using processes instead of threads.  This looks like 
the place he commented on why:


http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html

The part I think was relevant there from him:

"In the libaio view of the world you initiate io and either get a
callback or call another syscall to test if it's complete. Either
approach has problems for Postgres. If the process that initiated io
is in the middle of a long query it might take a long time, or not even 
never get back to complete the io. The callbacks use threads...


And polling for completion has the problem that another process could
be waiting on the io and can't issue a read as long as the first
process has the buffer locked and io in progress. I think aio makes a
lot more sense if you're using threads so you can start a thread to
wait for the io to complete."

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


FW: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread John Lumby

Claudio wrote :
>
> Check the latest patch, it contains heap page prefetching too.
>

Oh yes I see. I missed that - I was looking in the wrong place.
I do have one question about the way you did it : by placing the
prefetch heap-page calls in _bt_next, which effectively means inside
a call from the index am index_getnext_tid to btgettuple, are you sure
you are synchronizing your prefetches of heap pages with the index am's
ReadBuffer's of heap pages? I.e. are you complying with this comment
from nodeBitmapHeapscan.c for prefetching its bitmap heap pages in
the bitmap-index-scan case:

* We issue prefetch requests *after* fetching the current page to try
* to avoid having prefetching interfere with the main I/O.

I can't really tell whether your design conforms to this and nor do I
know whether it is important, but I decided to do it in the same manner,
and so implemented the heap-page fetching in index_fetch_heap

>
> async_io indeed may make that logic obsolete, but it's not redundant
> posix_fadvise what's the trouble there, but the fact that the kernel
> stops doing read-ahead when a call to posix_fadvise comes. I noticed
> the performance hit, and checked the kernel's code. It effectively
> changes the prediction mode from sequential to fadvise, negating the
> (assumed) kernel's prefetch logic.
>
I did not know that. Very interesting.


>
> I've mused about the possibility to batch async_io requests, and use
> the scatter/gather API insead of sending tons of requests to the
> kernel. I think doing so would enable a zero-copy path that could very
> possibly imply big speed improvements when memory bandwidth is the
> bottleneck.

I think you are totally correct on this point. If I recall, the 
glic (librt) aio does have an lio_listio but it is either a noop
or just loops over the list, I forget which (don't have its source right now),
but in any case I am sure there is a potential for implementing such a facility.
But to be really effective, it should be implemented in the kernel itself,
which we don't have today.

John
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread Claudio Freire
On Thu, Nov 1, 2012 at 2:00 PM, Andres Freund  wrote:
>> > I agree. I'm a bit hesitant to subscribe to yet another mailing list
>>
>> FYI you can send messages to linux-kernel without subscribing (there's
>> no moderation either).
>>
>> Subscribing to linux-kernel is like drinking from a firehose :)
>
> linux-fsdevel is more reasonable though...

readahead logic is not at the filesystem level, but the memory mapper's.

I'll consider posting without subscribing.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread Claudio Freire
On Thu, Nov 1, 2012 at 1:37 PM, John Lumby  wrote:
>
> Claudio wrote :
>>
>> Oops - forgot to effectively attach the patch.
>>
>
> I've read through your patch and the earlier posts by you and Cédric.
>
> This is very interesting.  You chose to prefetch index btree (key-ptr) 
> pages
> whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
> Never mind why  --  I think they should work very well together  -  as both 
> have
> been demonstrated to produce improvements.   I will see if I can combine them,
> git permitting  (as of course their changed file lists overlap).

Check the latest patch, it contains heap page prefetching too.

> I was surprised by this design decision :
> /* start prefetch on next page, but not if we're reading sequentially 
> already, as it's counterproductive in those cases */
> Is it really?Are you assuming the it's redundant with posix_fadvise for 
> this case?
> I think possibly when async_io is also in use by the postgresql prefetcher,
> this decision could change.

async_io indeed may make that logic obsolete, but it's not redundant
posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.

> However I think in some environments the async-io has significant benefits 
> over
> posix-fadvise,  especially (of course!)   where access is very non-sequential,
> but even also for sequential if there are many concurrent conflicting sets of 
> sequential
> command streams from different backends
> (always assuming the RAID can manage them concurrently).

I've mused about the possibility to batch async_io requests, and use
the scatter/gather API insead of sending tons of requests to the
kernel. I think doing so would enable a zero-copy path that could very
possibly imply big speed improvements when memory bandwidth is the
bottleneck.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread Andres Freund
On Thursday, November 01, 2012 05:53:39 PM Marti Raudsepp wrote:
> On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire  
wrote:
> > On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
> > 
> >  wrote:
> >> Well, informing linux hackers may help.
> > 
> > I agree. I'm a bit hesitant to subscribe to yet another mailing list
> 
> FYI you can send messages to linux-kernel without subscribing (there's
> no moderation either).
> 
> Subscribing to linux-kernel is like drinking from a firehose :)

linux-fsdevel is more reasonable though...

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread Marti Raudsepp
On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire  wrote:
> On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
>  wrote:
>> Well, informing linux hackers may help.
>
> I agree. I'm a bit hesitant to subscribe to yet another mailing list

FYI you can send messages to linux-kernel without subscribing (there's
no moderation either).

Subscribing to linux-kernel is like drinking from a firehose :)

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread John Lumby

Claudio wrote :
>
> Oops - forgot to effectively attach the patch.
>

I've read through your patch and the earlier posts by you and Cédric.

This is very interesting.      You chose to prefetch index btree (key-ptr) pages
whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
Never mind why  --  I think they should work very well together  -  as both have
been demonstrated to produce improvements.   I will see if I can combine them,
git permitting  (as of course their changed file lists overlap).

I was surprised by this design decision :
    /* start prefetch on next page, but not if we're reading sequentially 
already, as it's counterproductive in those cases */
Is it really?    Are you assuming the it's redundant with posix_fadvise for 
this case?
I think possibly when async_io is also in use by the postgresql prefetcher,
this decision could change.

Cédric wrote:
> If the gain is visible mostly for the backward and not for other access
>
> building the latest kernel with that patch included, replicating the
>

I  found improvement from forward scans. 
Actually I did not even try backward but only because I did not have time.
It should help both.

>> I don't even think windows supports posix_fadvise, but if async_io is
>> used (as hinted by the link Lumby posted), it would probably also work
>> in windows.

windows has async io and I think it would not be hard to extend my 
implementation
to windows  (although I don't plan it myself). Actually about 95% of the 
code I wrote
to implement async-io in postgresql concerns not the async io,  which is 
trivial,
but the buffer management.   With async io,   PrefetchBuffer must allocate and 
pin a
buffer,  (not too hard),   but now also every other part of buf mgr must know 
about the 
possibility that a buffer may be in async_io_in_progress state and be prepared 
to
determine the possible completion (quite hard)   -  and also if and when the 
prefetch requester
comes again with ReadBuffer,  buf mgr has to remember that this buffer was 
pinned by 
this backend during previous prefetch and must not be re-pinned a second time
(very hard without increasing size of the shared descriptor,  which was 
important
since there could be a very large number of these).
It ended up with a major change to bufmgr.c plus one new file for handling
buffer management aspects of starting, checking and terminating async io.

However I think in some environments the async-io has significant benefits over
posix-fadvise,  especially (of course!)   where access is very non-sequential,
but even also for sequential if there are many concurrent conflicting sets of 
sequential
command streams from different backends
(always assuming the RAID can manage them concurrently).

I've attached a snapshot patch of just the non-bitmap-index-scan changes I've 
made.
You can't compile it as is because I had to change the interface to 
PrefetchBuffer
and add a new DiscardBuffer which I did not include in this snapshot to avoid 
confusing.

John

  --- src/backend/executor/nodeIndexscan.c.orig	2012-10-31 15:24:12.083163547 -0400
+++ src/backend/executor/nodeIndexscan.c	2012-11-01 11:45:16.244967963 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans */
+#endif   /* USE_PREFETCH */
 
 /* 
  *		IndexNext
@@ -418,7 +423,17 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+{
 		index_endscan(indexScanDesc);
+
+#ifdef USE_PREFETCH
+if (   indexScanDesc->do_prefetch
+&& ( (struct BlockIdData*)0 != indexScanDesc->pfch_list )
+   ) {
+ pfree(indexScanDesc->pfch_list);
+}
+#endif   /* USE_PREFETCH */
+}
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +624,25 @@ ExecInitIndexScan(IndexScan *node, EStat
 			   indexstate->iss_NumScanKeys,
 			 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+/*  initialize prefetching   */
+		if (prefetch_index_scans
+ &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+ && ((   (prefetch_dbOid > 0)
+   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+ )
+ ||  (prefetch_dbOid == 0)
+)
+   ) {
+			indexstate->iss_ScanDesc->pfch_list = palloc( t

Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-29 Thread Claudio Freire
On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
 wrote:
>> But it also looks forgotten. Bringing it back to life would mean
>> building the latest kernel with that patch included, replicating the
>> benchmarks I ran here, sans pg patch, but with patched kernel, and
>> reporting the (hopefully equally dramatic) performance improvements in
>> the kernel ML. That would take me quite some time (not used to playing
>> with kernels, though it wouldn't be my first time either), though it
>> might be worth the effort.
>
> Well, informing linux hackers may help.

I agree. I'm a bit hesitant to subscribe to yet another mailing list,
but I happen to agree.

>> > I don't know how others (BSD, windows, ...) handle this case.
>>
>> I don't even think windows supports posix_fadvise, but if async_io is
>> used (as hinted by the link Lumby posted), it would probably also work
>> in windows.
>>
>> BSD probably supports it the same way linux does.
>
> I though of the opposite way: how do other kernels handle the backwards
> prefetch.

From what I saw (while reasearching that statement above), BSD's
read-ahead and fadvise implementations are way behind linux's.
Functional, though. I haven't been able to find the code responsible
for readahead in FreeBSD yet to confirm whether they have anything
supporting back-sequential patterns.

>> > Maybe the strategy to use our own prefetch is better, then I would like
>> > to use it also in places where we used to hack to make linux understand
>> > that we will benefits from prefetching.
>>
>> It would at least benefit those installations without the
>> latest-in-the-future kernel-with-backwards-readahead.
>
> We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old
> kernel in end 2013... Maybe it won't be so latest-in-the-future at this time.

Good point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-29 Thread Cédric Villemain
> But it also looks forgotten. Bringing it back to life would mean
> building the latest kernel with that patch included, replicating the
> benchmarks I ran here, sans pg patch, but with patched kernel, and
> reporting the (hopefully equally dramatic) performance improvements in
> the kernel ML. That would take me quite some time (not used to playing
> with kernels, though it wouldn't be my first time either), though it
> might be worth the effort.

Well, informing linux hackers may help. 

> > I don't know how others (BSD, windows, ...) handle this case.
> 
> I don't even think windows supports posix_fadvise, but if async_io is
> used (as hinted by the link Lumby posted), it would probably also work
> in windows.
> 
> BSD probably supports it the same way linux does.

I though of the opposite way: how do other kernels handle the backwards 
prefetch. 

> 
> > Maybe the strategy to use our own prefetch is better, then I would like
> > to use it also in places where we used to hack to make linux understand
> > that we will benefits from prefetching.
> 
> It would at least benefit those installations without the
> latest-in-the-future kernel-with-backwards-readahead.

We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old 
kernel in end 2013... Maybe it won't be so latest-in-the-future at this time.

Btw the improvements you are doing looks good, I just add some information 
regarding what is achieved around us.

> 
> To which places are you referring to, btw?

Maintenance tasks.

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-29 Thread Claudio Freire
On Mon, Oct 29, 2012 at 4:17 PM, Cédric Villemain
 wrote:
>> Ok, this is the best I could come up with, without some real test hardware.
>>
>> The only improvement I see in single-disk scenarios:
>>   * Huge speedup of back-sequential index-only scans
>>   * Marginal speedup on forward index-only scans (5% or less)
>>   * No discernible difference in heap-including scans (even with heap
>> prefetch), but I'm pretty sure a real RAID setup would change this
>>   * No change in pgbench (so I guess no regression for small transactions)
>
> If the gain is visible mostly for the backward and not for other access
> patterns I suggest to check the work done in backward-prefecthing in linux.
>
> http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example

That patch seems very similar (but in kernel terms) to this patch, so
I imagine it would also do the job.

But it also looks forgotten. Bringing it back to life would mean
building the latest kernel with that patch included, replicating the
benchmarks I ran here, sans pg patch, but with patched kernel, and
reporting the (hopefully equally dramatic) performance improvements in
the kernel ML. That would take me quite some time (not used to playing
with kernels, though it wouldn't be my first time either), though it
might be worth the effort.

> I don't know how others (BSD, windows, ...) handle this case.

I don't even think windows supports posix_fadvise, but if async_io is
used (as hinted by the link Lumby posted), it would probably also work
in windows.

BSD probably supports it the same way linux does.

> Maybe the strategy to use our own prefetch is better, then I would like to use
> it also in places where we used to hack to make linux understand that we will
> benefits from prefetching.

It would at least benefit those installations without the
latest-in-the-future kernel-with-backwards-readahead.

To which places are you referring to, btw?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-29 Thread Cédric Villemain
> Ok, this is the best I could come up with, without some real test hardware.
> 
> The only improvement I see in single-disk scenarios:
>   * Huge speedup of back-sequential index-only scans
>   * Marginal speedup on forward index-only scans (5% or less)
>   * No discernible difference in heap-including scans (even with heap
> prefetch), but I'm pretty sure a real RAID setup would change this
>   * No change in pgbench (so I guess no regression for small transactions)

If the gain is visible mostly for the backward and not for other access 
patterns I suggest to check the work done in backward-prefecthing in linux. 

http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example

I don't know how others (BSD, windows, ...) handle this case.


Maybe the strategy to use our own prefetch is better, then I would like to use 
it also in places where we used to hack to make linux understand that we will 
benefits from prefetching.

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-29 Thread Claudio Freire
On Mon, Oct 29, 2012 at 12:53 PM, Claudio Freire  wrote:
>> Yes, I've seen that, though I thought it was only an improvement on
>> PrefetchBuffer. That patch would interact quite nicely with mine.
>>
>> I'm now trying to prefetch heap tuples, and I got to a really nice
>> place where I get an extra 30% speedup even on forward scans, but the
>> patch is rather green now for a review.
>>
>>> I am not where I have a proper setup this week but will reply at greater
>>> length next week.
>>
>> Great - will go on improving the patch in the meanwhile ;-)
>
> Ok, this is the best I could come up with, without some real test hardware.

Oops - forgot to effectively attach the patch.


postgresql-git-bt_fullfetch.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-29 Thread Claudio Freire
On Tue, Oct 23, 2012 at 10:54 AM, Claudio Freire  wrote:
>> Indeed not.   I also looked into prefetching for pure index scans for
>> b-trees  (and extension to use async io).
>> http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl
>
> Yes, I've seen that, though I thought it was only an improvement on
> PrefetchBuffer. That patch would interact quite nicely with mine.
>
> I'm now trying to prefetch heap tuples, and I got to a really nice
> place where I get an extra 30% speedup even on forward scans, but the
> patch is rather green now for a review.
>
>> I am not where I have a proper setup this week but will reply at greater
>> length next week.
>
> Great - will go on improving the patch in the meanwhile ;-)

Ok, this is the best I could come up with, without some real test hardware.

The only improvement I see in single-disk scenarios:
  * Huge speedup of back-sequential index-only scans
  * Marginal speedup on forward index-only scans (5% or less)
  * No discernible difference in heap-including scans (even with heap
prefetch), but I'm pretty sure a real RAID setup would change this
  * No change in pgbench (so I guess no regression for small transactions)

If I manage to get my hands on test hardware, I'll post results. But
we just had to bring some machines offline in our testing datacenter,
which effectively shrank my options, rather than expanding them. I
don't see that improving soon, so I'll post the patch and hope someone
else tests.

PS: should I add it to the commit fest? should we compare notes with
John Limby's patch first?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-23 Thread Claudio Freire
On Tue, Oct 23, 2012 at 9:44 AM, John Lumby  wrote:
>> From: Claudio Freire 
>> I hope I'm not talking to myself.
>
> Indeed not.   I also looked into prefetching for pure index scans for
> b-trees  (and extension to use async io).
> http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

Yes, I've seen that, though I thought it was only an improvement on
PrefetchBuffer. That patch would interact quite nicely with mine.

I'm now trying to prefetch heap tuples, and I got to a really nice
place where I get an extra 30% speedup even on forward scans, but the
patch is rather green now for a review.

> I am not where I have a proper setup this week but will reply at greater
> length next week.

Great - will go on improving the patch in the meanwhile ;-)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-23 Thread John Lumby

> From: Claudio Freire 
> I hope I'm not talking to myself.

Indeed not.   I also looked into prefetching for pure index scans for b-trees  
(and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

I am not where I have a proper setup this week but will reply at greater length 
next week.

John
  

Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-22 Thread Claudio Freire
On Thu, Oct 18, 2012 at 7:42 PM, Claudio Freire  wrote:
> Fun. That didn't take long.
>
> With the attached anti-sequential scan patch, and effective_io_concurrency=8:
>
>
>   QUERY PLAN
> -
>  GroupAggregate  (cost=0.00..4149039.04 rows=90257289 width=4) (actual
> time=26.964..84299.789 rows=9001 loops=1)
>->  Index Only Scan Backward using pgbench_accounts_pkey on
> pgbench_accounts  (cost=0.00..2795179.71 rows=90257289 width=4)
> (actual time=26.955..62761.774 rows=9001 loops=1)
>  Index Cond: ((aid >= 1000) AND (aid <= 2))
>  Heap Fetches: 0
>  Total runtime: 87170.355 ms
> I/O thoughput 22MB/s (twice as fast)
> I/O utilization 95% (I was expecting 100% but... hey... good enough)
>
> With e_i_c=24, it gets to 100% utilization and 30MB/s (that's 3 times
> faster). So, I'd like to know what you think, but maybe for
> back-sequential scans, prefetch should be set to a multiple (ie: x24)
> of e_i_c, in order to exploit read request merges.


Earlier patch had a regression for heap-including scans backwards with
RAID, so I made the back-sequential optimization index-only-only and
now I can find no regression. Make check runs fine, btw.

I hope I'm not talking to myself.


postgresql-git-bt_prefetch_backseq.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-18 Thread Claudio Freire
On Thu, Oct 18, 2012 at 5:30 PM, Claudio Freire  wrote:
> Backward:
>
>QUERY PLAN
> --
>  GroupAggregate  (cost=0.00..4149039.04 rows=90257289 width=4) (actual
> time=28.190..157708.405 rows=9001 loops=1)
>->  Index Only Scan Backward using pgbench_accounts_pkey on
> pgbench_accounts  (cost=0.00..2795179.71 rows=90257289 width=4)
> (actual time=28.178..135282.317 rows=9001 loops=1)
>  Index Cond: ((aid >= 1000) AND (aid <= 2))
>  Heap Fetches: 0
>  Total runtime: 160735.539 ms
> I/O thoughput averages 12MB/s (a small increase), and the 3-second
> difference seems related to it (it's consistent).
> I/O utilization averages 88% (important increase)
>
> This last result makes me think deeper prefetching could be
> potentially beneficial (it would result in read merges), but it's
> rather hard to implement without a skiplist of leaf pages. Maybe the
> backward-sequential pattern could be detected. I'll have to tinker
> with that.

Fun. That didn't take long.

With the attached anti-sequential scan patch, and effective_io_concurrency=8:


  QUERY PLAN
-
 GroupAggregate  (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=26.964..84299.789 rows=9001 loops=1)
   ->  Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts  (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=26.955..62761.774 rows=9001 loops=1)
 Index Cond: ((aid >= 1000) AND (aid <= 2))
 Heap Fetches: 0
 Total runtime: 87170.355 ms
I/O thoughput 22MB/s (twice as fast)
I/O utilization 95% (I was expecting 100% but... hey... good enough)

With e_i_c=24, it gets to 100% utilization and 30MB/s (that's 3 times
faster). So, I'd like to know what you think, but maybe for
back-sequential scans, prefetch should be set to a multiple (ie: x24)
of e_i_c, in order to exploit read request merges.


postgresql-git-bt_prefetch_backseq.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers