Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-08-25 Thread Heikki Linnakangas

On 08/25/2014 12:49 AM, johnlumby wrote:

On 08/19/14 18:27, Heikki Linnakangas wrote:

Please write the patch without atomic CAS operation. Just use a spinlock.


Umm,   this is a new criticism I think.


Yeah. Be prepared that new issues will crop up as the patch gets slimmer 
and easier to review :-). Right now there's still so much chaff that 
it's difficult to see the wheat.



  I use CAS for things other
than locking,
such as add/remove from shared queue.   I suppose maybe a spinlock on
the entire queue
can be used equivalently,  but with more code (extra confusion) and
worse performance
(coarser serialization).  What is your objection to using gcc's
atomic ops?   Portability?


Yeah, portability.

Atomic ops might make sense, but at this stage it's important to slim 
down the patch to the bare minimum, so that it's easier to review. Once 
the initial patch is in, you can improve it with additional patches.



There's a patch in the commitfest to add support for that,


sorry,  support for what? There are already spinlocks in postgresql,
you mean some new kind?please point me at it with hacker msgid or
something.


Atomic ops: https://commitfest.postgresql.org/action/patch_view?id=1314

Once that's committed, you can use the new atomic ops in your patch. But 
until then, stick to spinlocks.



On 08/19/14 19:10, Claudio Freire wrote:

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Also, please split prefetching of regular index scans into a separate patch. ...

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.

Several people have asked to split this patch into several smaller ones
and I
have thought about it. It would introduce some awkward dependencies.
E.g.  splitting the scanner code  (index,  relation-heap) into separate
patch
from aio code would mean that the scanner patch becomes dependent
on the aio patch. They are not quite orthogonal.


Right now, please focus on the main AIO patch. That should show a 
benefit for bitmap heap scans too, so to demonstrate the benefits of 
AIO, you don't need to prefetch regular index scans. The main AIO patch 
can be written, performance tested, and reviewed without caring about 
regular index scans at all.



The reason is that the scanners call a new function, DiscardBuffer(blockid)
when aio is in use. We can get around it by providing a stub for
that function
in the scanner patch,   but then there is some risk of someone getting the
wrong version of that function in their build. It just adds yet more
complexity
and breakage opportunities.


Regardless of the regular index scans, we'll need to discuss the new API 
of PrefetchBuffer and DiscardBuffer.


It would be simpler for the callers if you didn't require the 
DiscardBuffer calls. I think it would be totally feasible to write the 
patch that way. Just drop the buffer pin after the asynchronous read 
finishes. When the caller actually needs the page, it will call 
ReadBuffer which will pin it again. You'll get a little bit more bufmgr 
traffic that way, but I think it'll be fine in practice.



One further comment concerning these BufferAiocb and aiocb control blocks
being in shared memory :

I explained above that the BufferAiocb must be in shared memory for
wait/post.
The aiocb does not necessarily have to be in shared memory,
but since there is a one-to-one relationship between BufferAiocb and aiocb,
it makes the code much simpler ,  and the operation much more efficient,
if the aiocb is embedded in the BufferAiocb as I have it now.
E.g. if the aiocb is in private-process memory,   then an additional
allocation
scheme is needed (fixed number?   palloc()in'g extra ones as needed?  ...)
which adds complexity,


Yep, use palloc or a fixed pool. There's nothing wrong with that.


 and probably some inefficiency since a shared
pool is usually
more efficient (allows higher maximum per process etc),   and there
would have to be
some pointer de-referencing from BufferAiocb to aiocb adding some
(small) overhead.


I think you're falling into the domain of premature optimization. A few 
pointer dereferences are totally insignificant, and the amount of memory 
you're saving pales in comparison to other similar non-shared pools and 
caches we have (catalog caches, for example). And on the other side of 
the coin, with a shared pool you'll waste memory when async I/O is not 
used (e.g because everything fits in RAM), and when it is used, you'll 
have more contention on locks and cache lines when multiple processes 
use the same objects.


The general guideline in PostgreSQL is that everything is 
backend-private, except structures used to communicate between backends.


- Heikki



--
Sent via pgsql-hackers mailing list 

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-08-24 Thread johnlumby

Thanks for the replies and thoughts.

On 08/19/14 18:27, Heikki Linnakangas wrote:

On 08/20/2014 12:17 AM, John Lumby wrote:
I am attaching a new version of the patch for consideration in the 
current commit fest.


Thanks for working on this!

Relative to the one I submitted on 25 June in 
bay175-w412ff89303686022a9f16aa3...@phx.gbl
the method for handling aio completion using sigevent has been 
re-written to use

signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock 
method.


ISTM the patch is still allocating stuff in shared memory that really 
doesn't belong there. Namely, the BufferAiocb structs. Or at least 
parts of it; there's also a waiter queue there which probably needs to 
live in shared memory, but the rest of it does not.


Actually the reason the BufferAiocb ( the postgresql block corresponding 
to the aio's aiocb)
must be located in shared memory is that, as you said,  it acts as 
anchor for the waiter list.

See further comment below.



At least BufAWaitAioCompletion is still calling aio_error() on an AIO 
request that might've been initiated by another backend. That's not OK.


Yes,   you are right,  and I agree with this one  -
I will add a aio_error_return_code field in the BufferAiocb
and only the originator will set this from the real aiocb.



Please write the patch without atomic CAS operation. Just use a spinlock. 


Umm,   this is a new criticism I think.   I use CAS for things other 
than locking,
such as add/remove from shared queue.   I suppose maybe a spinlock on 
the entire queue
can be used equivalently,  but with more code (extra confusion) and 
worse performance
(coarser serialization).  What is your objection to using gcc's 
atomic ops?   Portability?




There's a patch in the commitfest to add support for that,


sorry,  support for what? There are already spinlocks in postgresql,
you mean some new kind?please point me at it with hacker msgid or 
something.


but it's not committed yet, and all those 
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult 
to read. Same with all the other #ifdefs; please just pick a method 
that works.


Ok,  yes,   the ifdefs are unpleasant.I will do something about that.
Ideally they would be entirely confined to header files and only macro 
functions
in C files  -  maybe I can do that.   And eventually when the dust has 
settled

eliminate obsolete ifdef blocks altogether.

Also, please split prefetching of regular index scans into a separate 
patch. It's orthogonal to doing async I/O;


actually not completely orthogonal, see next

we could prefetch regular index scans with posix_fadvise too, and AIO 
should be useful for prefetching other stuff.


On 08/19/14 19:10, Claudio Freire wrote:

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Also, please split prefetching of regular index scans into a separate patch. ...

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.
Several people have asked to split this patch into several smaller ones 
and I

have thought about it. It would introduce some awkward dependencies.
E.g.  splitting the scanner code  (index,  relation-heap) into separate 
patch

from aio code would mean that the scanner patch becomes dependent
on the aio patch. They are not quite orthogonal.

The reason is that the scanners call a new function, DiscardBuffer(blockid)
when aio is in use. We can get around it by providing a stub for 
that function

in the scanner patch,   but then there is some risk of someone getting the
wrong version of that function in their build. It just adds yet more 
complexity

and breakage opportunities.



- Heikki


One further comment concerning these BufferAiocb and aiocb control blocks
being in shared memory :

I explained above that the BufferAiocb must be in shared memory for 
wait/post.

The aiocb does not necessarily have to be in shared memory,
but since there is a one-to-one relationship between BufferAiocb and aiocb,
it makes the code much simpler ,  and the operation much more efficient,
if the aiocb is embedded in the BufferAiocb as I have it now.
E.g. if the aiocb is in private-process memory,   then an additional 
allocation

scheme is needed (fixed number?   palloc()in'g extra ones as needed?  ...)
which adds complexity,   and probably some inefficiency since a shared 
pool is usually
more efficient (allows higher maximum per process etc),   and there 
would have to be
some pointer de-referencing from BufferAiocb to aiocb adding some 
(small) overhead.


I understood your objection to use of shared memory as being that you 
don't want
a non-originator to access the originator's aiocb using 

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-08-19 Thread Heikki Linnakangas

On 08/20/2014 12:17 AM, John Lumby wrote:

I am attaching a new version of the patch for consideration in the current 
commit fest.


Thanks for working on this!


Relative to the one I submitted on 25 June in 
bay175-w412ff89303686022a9f16aa3...@phx.gbl
the method for handling aio completion using sigevent has been re-written to use
signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock method.


ISTM the patch is still allocating stuff in shared memory that really 
doesn't belong there. Namely, the BufferAiocb structs. Or at least parts 
of it; there's also a waiter queue there which probably needs to live in 
shared memory, but the rest of it does not.


At least BufAWaitAioCompletion is still calling aio_error() on an AIO 
request that might've been initiated by another backend. That's not OK.


Please write the patch without atomic CAS operation. Just use a 
spinlock. There's a patch in the commitfest to add support for that, but 
it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 
ifdefs make the patch more difficult to read. Same with all the other 
#ifdefs; please just pick a method that works.


Also, please split prefetching of regular index scans into a separate 
patch. It's orthogonal to doing async I/O; we could prefetch regular 
index scans with posix_fadvise too, and AIO should be useful for 
prefetching other stuff.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-08-19 Thread Claudio Freire
On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Also, please split prefetching of regular index scans into a separate patch.
 It's orthogonal to doing async I/O; we could prefetch regular index scans
 with posix_fadvise too, and AIO should be useful for prefetching other
 stuff.

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-25 Thread Claudio Freire
On Tue, Jun 24, 2014 at 12:08 PM, John Lumby johnlu...@hotmail.com wrote:
 The question is, if you receive the notification of the I/O completion
 using a signal or a thread, is it safe to release the lwlock from the
 signal handler or a separate thread?

 In the forthcoming  new version of the patch that uses sigevent,
 the originator locks a LWlock associated with that BAaiocb eXclusive,
 and ,   when signalled,  in the signal handler it places that LWlock
 on a process-local queue of LWlocks awaiting release.
 (No, It cannot be safely released inside the signal handler or in a
 separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
 and at a few additional points in bufmgr,  the backend walks this 
 process-local
 queue and releases those LWlocks.This is also done if the originator
 itself issues a ReadBuffer,  which is the most frequent case in which it
 is released.

I suggest using a semaphore instead.

Semaphores are supposed to be incremented/decremented from multiple
threads or processes already. So, in theory, the callback itself
should be able to do it.

The problem with the process-local queue is that it may take time to
be processed (the time it takes to get to a CHECK_INTERRUPTS macro,
which as it happened with regexes, it can be quite high).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-25 Thread Heikki Linnakangas

On 06/25/2014 09:20 PM, Claudio Freire wrote:

On Tue, Jun 24, 2014 at 12:08 PM, John Lumby johnlu...@hotmail.com wrote:

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?


In the forthcoming  new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and ,   when signalled,  in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr,  the backend walks this process-local
queue and releases those LWlocks.This is also done if the originator
itself issues a ReadBuffer,  which is the most frequent case in which it
is released.


I suggest using a semaphore instead.

Semaphores are supposed to be incremented/decremented from multiple
threads or processes already. So, in theory, the callback itself
should be able to do it.


LWLocks are implemented with semaphores, so if you can increment a 
semaphore in the signal handler / callback thread, then in theory you 
should be able to release a LWLock. You'll need some additional 
synchronization within the same process, to make sure you don't release 
a lock in signal handler while you're just doing the same thing in the 
main thread. I'm not sure I want to buy into the notion that 
LWLockRelease must be generally safe to call from a signal handler, but 
maybe it's possible to have a variant of it that is.


On Linux at least we use System V semaphores, which are (unsurpisingly) 
not listed as safe for using in a signal handler. See list 
Async-signal-safe functions in signal(7) man page. The function used to 
increment a POSIX semaphore, sem_post(), is in the list, however.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-25 Thread Andres Freund
On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote:
 LWLocks are implemented with semaphores, so if you can increment a semaphore
 in the signal handler / callback thread, then in theory you should be able
 to release a LWLock.

I don't think that's a convincing argument even if semop et al were
signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do
anything with the same spinlock both inside and outside a signal
handler.
I don't think we're going to get lwlock.c/LWLockRelease to work
reasonably from a signal handler without a lot of work.

 On Linux at least we use System V semaphores, which are (unsurpisingly) not
 listed as safe for using in a signal handler. See list Async-signal-safe
 functions in signal(7) man page. The function used to increment a POSIX
 semaphore, sem_post(), is in the list, however.

Heh, just wrote the same after reading the beginning of your email ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-25 Thread John Lumby
My cut'n'pasting failed me at one point corrected below.


 discussion about what is the difference between a synchronous read
 versus an asynchronous read as far as non-originator waiting on it is 
 concerned.

 I thought a bit more about this.   There are currently two differences,
 one of which can easily be changed and one not so easy.

 1) The current code,  even with sigevent,  still makes the non-originator 
 waiter
  call aio_error on the originator's aiocb to get the completion code.
  For sigevent variation,  easily changed to have the originator 
 always call aio_error
  (from its CHECK_INTERRUPTS or from FIleCompleteaio)
  and store that in the BAiocb.
  My idea of why not to do that  was that,  by having the 
 non-originator check the aiocb,
 this would allow the waiter to proceed sooner.   But for a different 
 reason it actually
  doesn't.   (The non-originator must still wait for the LWlock 
 release)

   2)   Buffer pinning and  returning the BufferAiocb to the free list
 With synchronous IO,each backend that calls a ReadBuffer must pin 
 the buffer
 early in the process.
 With asynchronous IO,initially only the originator gets the pin
 (and that is during PrefetchBuffer,  not Readbuffer)
  When the aio completes and some backend checks that completion,
 then the backend has various responsibilities:

.   pin the buffer if it did not already have one (from 
 prefetch)
.  if it was the last such backend to make that check
   (amongst the cohort waiting on it)
then XXpin the buffer if it did not already have one 
 (from prefetch)

then return the BufferAiocb to the free list


  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-24 Thread John Lumby



 From: st...@mit.edu
 Date: Mon, 23 Jun 2014 16:04:50 -0700
 Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
 To: johnlu...@hotmail.com
 CC: klaussfre...@gmail.com; hlinnakan...@vmware.com; 
 pgsql-hackers@postgresql.org

 On Mon, Jun 23, 2014 at 2:43 PM, John Lumby johnlu...@hotmail.com wrote:
 It is when some *other* backend gets there first with the ReadBuffer that
 things are a bit trickier. The current version of the patch did polling for 
 that case
 but that drew criticism, and so an imminent new version of the patch
 uses the sigevent mechanism. And there are other ways still.

 I'm a bit puzzled by this though. Postgres *already* has code for this
 case. When you call ReadBuffer you set the bits on the buffer

Good question. Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(),  performed from a ReadBuffer(),  and some other backend
wants that buffer.    asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer,    and performs its aio_read into an allocated,  
pinned,
postgresql buffer.    This is entirely different from the synchronous io case.
Why?  Because the issuer of the aio_read (the originator) is unaware
of this buffer pinned on its behalf,  and is then free to do any other 
reading or writing it wishes,   such as more prefetching  or any other 
operation.
And furthermore,  it may *never* issue a ReadBuffer for the block which it
prefetched.

Therefore,  asynchronous IO is different from synchronous IO,  and
a new bit,  BM_AIO_IN_PROGRESS, in the buf_header  is required to 
track this aio operation until completion.

I would encourage you to read the new 
postgresql-prefetching-asyncio.README
in the patch file where this is explained in greater detail.

 indicating I/O is in progress. If another backend does ReadBuffer for
 the same block they'll get the same buffer and then wait until the
 first backend's I/O completes. ReadBuffer goes through some hoops to
 handle this (and all the corner cases such as the other backend's I/O
 completing and the buffer being reused for another block before the
 first backend reawakens). It would be a shame to reinvent the wheel.

No re-invention!   Actually some effort has been made to use the
existing functions in bufmgr.c as much as possible rather than
rewriting them.


 The problem with using the Buffers I/O in progress bit is that the I/O
 might complete while the other backend is busy doing stuff. As long as
 you can handle the I/O completion promptly -- either in callback or
 thread or signal handler then that wouldn't matter. But I'm not clear
 that any of those will work reliably.

They both work reliably,  but the criticism was that backend B polling 
an aiocb of an aio issued by backend A is not documented as 
being supported  (although it happens to work),  hence the proposed
change to use sigevent.

By the way,   on the will it actually work though? question which several 
folks
have raised,    I should mention that this patch has been in semi-production 
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel.       I would guess it has had around
500 hours of operation by now. I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.
 

 --
 greg
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-24 Thread Heikki Linnakangas

On 06/24/2014 04:29 PM, John Lumby wrote:

On Mon, Jun 23, 2014 at 2:43 PM, John Lumby johnlu...@hotmail.com wrote:

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier. The current version of the patch did polling for 
that case
but that drew criticism, and so an imminent new version of the patch
uses the sigevent mechanism. And there are other ways still.


I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer


Good question. Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(),  performed from a ReadBuffer(),  and some other backend
wants that buffer.asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer,and performs its aio_read into an allocated,  
pinned,
postgresql buffer.This is entirely different from the synchronous io case.
Why?  Because the issuer of the aio_read (the originator) is unaware
of this buffer pinned on its behalf,  and is then free to do any other
reading or writing it wishes,   such as more prefetching  or any other 
operation.
And furthermore,  it may *never* issue a ReadBuffer for the block which it
prefetched.


I still don't see the difference. Once an asynchronous read is initiated 
on the buffer, it can't be used for anything else until the read has 
finished. This is exactly the same situation as with a synchronous read: 
after read() is called, the buffer can't be used for anything else until 
the call finishes.


In particular, consider the situation from another backend's point of 
view. Looking from another backend (i.e. one that didn't initiate the 
read), there's no difference between a synchronous and asynchronous 
read. So why do we need a different IPC mechanism for the synchronous 
and asynchronous cases? We don't.


I understand that *within the backend*, you need to somehow track the 
I/O, and you'll need to treat synchronous and asynchronous I/Os 
differently. But that's all within the same backend, and doesn't need to 
involve the flags or locks in shared memory at all. The inter-process 
communication doesn't need any changes.



The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.


They both work reliably,  but the criticism was that backend B polling
an aiocb of an aio issued by backend A is not documented as
being supported  (although it happens to work),  hence the proposed
change to use sigevent.


You didn't understand what Greg meant. You need to handle the completion 
of the I/O in the same process that initiated it, by clearing the 
in-progress bit of the buffer and releasing the I/O in-progress lwlock 
on it. And you need to do that very quickly after the I/O has finished, 
because there might be another backend waiting for the buffer and you 
don't want him to wait longer than necessary.


The question is, if you receive the notification of the I/O completion 
using a signal or a thread, is it safe to release the lwlock from the 
signal handler or a separate thread?



By the way,   on the will it actually work though? question which several 
folks
have raised,I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel.   I would guess it has had around
500 hours of operation by now. I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.


Well, a committable version of this patch is going to look quite 
different from the first version that you posted, so I don't put much 
weight on how long you've tested the first version.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-24 Thread John Lumby
Thanks Heikki,


 Date: Tue, 24 Jun 2014 17:02:38 +0300
 From: hlinnakan...@vmware.com
 To: johnlu...@hotmail.com; st...@mit.edu
 CC: klaussfre...@gmail.com; pgsql-hackers@postgresql.org
 Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch

 On 06/24/2014 04:29 PM, John Lumby wrote:
 On Mon, Jun 23, 2014 at 2:43 PM, John Lumby johnlu...@hotmail.com wrote:
 It is when some *other* backend gets there first with the ReadBuffer that
 things are a bit trickier. The current version of the patch did polling 
 for that case
 but that drew criticism, and so an imminent new version of the patch
 uses the sigevent mechanism. And there are other ways still.

 I'm a bit puzzled by this though. Postgres *already* has code for this
 case. When you call ReadBuffer you set the bits on the buffer

 Good question. Let me explain.
 Yes, postgresql has code for the case of a backend is inside a synchronous
 read() or write(), performed from a ReadBuffer(), and some other backend
 wants that buffer. asynchronous aio is initiated not from ReadBuffer
 but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
 postgresql buffer. This is entirely different from the synchronous io case.
 Why? Because the issuer of the aio_read (the originator) is unaware
 of this buffer pinned on its behalf, and is then free to do any other
 reading or writing it wishes, such as more prefetching or any other 
 operation.
 And furthermore, it may *never* issue a ReadBuffer for the block which it
 prefetched.

 I still don't see the difference. Once an asynchronous read is initiated
 on the buffer, it can't be used for anything else until the read has
 finished. This is exactly the same situation as with a synchronous read:
 after read() is called, the buffer can't be used for anything else until
 the call finishes.

Ah,  now I see what you and Greg are asking.   See my next imbed below.


 In particular, consider the situation from another backend's point of
 view. Looking from another backend (i.e. one that didn't initiate the
 read), there's no difference between a synchronous and asynchronous
 read. So why do we need a different IPC mechanism for the synchronous
 and asynchronous cases? We don't.

 I understand that *within the backend*, you need to somehow track the
 I/O, and you'll need to treat synchronous and asynchronous I/Os
 differently. But that's all within the same backend, and doesn't need to
 involve the flags or locks in shared memory at all. The inter-process
 communication doesn't need any changes.

 The problem with using the Buffers I/O in progress bit is that the I/O
 might complete while the other backend is busy doing stuff. As long as
 you can handle the I/O completion promptly -- either in callback or
 thread or signal handler then that wouldn't matter. But I'm not clear
 that any of those will work reliably.

 They both work reliably, but the criticism was that backend B polling
 an aiocb of an aio issued by backend A is not documented as
 being supported (although it happens to work), hence the proposed
 change to use sigevent.

 You didn't understand what Greg meant. You need to handle the completion
 of the I/O in the same process that initiated it, by clearing the
 in-progress bit of the buffer and releasing the I/O in-progress lwlock
 on it. And you need to do that very quickly after the I/O has finished,
 because there might be another backend waiting for the buffer and you
 don't want him to wait longer than necessary.

I think I understand the question now.    I didn't spell out the details 
earlier.
Let me explain a little more.
With this patch,     when read is issued,   it is either a synchronous IO 
(as before),  or an asynchronous aio_read (new,   represented by
both BM_IO_IN_PROGRESS    *and*  BM_AIO_IN_PROGRESS).
The way other backends wait on a synchronous IO in progress is unchanged.
But if BM_AIO_IN_PROGRESS,   then *any*  backend which requests
ReadBuffer on this block (including originator) follows a new path
through BufCheckAsync() which,  depending on various flags and context,
send the backend down to FileCompleteaio to check and maybe wait.
So *all* backends who are waiting for a BM_AIO_IN_PROGRESS buffer
will wait in that way. 
  

 The question is, if you receive the notification of the I/O completion
 using a signal or a thread, is it safe to release the lwlock from the
 signal handler or a separate thread?

In the forthcoming  new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and ,   when signalled,  in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a 
separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr,  the backend walks this process-local
queue and releases those 

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-24 Thread Heikki Linnakangas

On 06/24/2014 06:08 PM, John Lumby wrote:

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?


In the forthcoming  new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and ,   when signalled,  in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr,  the backend walks this process-local
queue and releases those LWlocks.This is also done if the originator
itself issues a ReadBuffer,  which is the most frequent case in which it
is released.

Meanwhile,  any other backend will simply acquire Shared and release.


Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast 
enough? We have never made any hard guarantees on how often 
CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd 
party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls 
for many seconds, or even more. That's a long time to hold onto a buffer 
I/O lock. I don't think that's acceptable :-(.



I think you are right that the existing io_in_progress_lock LWlock in the
buf_header  could be used for this,  because if there is a aio in progress,
then that lock cannot be in use for synchronous IO.  I chose not to use it
because I preferred to keep the wait/post for asynch io separate,
  but they could both use the same LWlock.   However,   the way the LWlock
is acquired and released would still be a bit different because of the need
to have the originator release it in its mainline.


It would be nice to use the same LWLock.

However, if releasing a regular LWLock in a signal handler is not safe, 
and cannot be made safe, perhaps we should, after all, invent a whole 
new mechanism. One that would make it safe to release the lock in a 
signal handler.



By the way, on the will it actually work though? question which several folks
have raised, I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel. I would guess it has had around
500 hours of operation by now. I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.


Well, a committable version of this patch is going to look quite
different from the first version that you posted, so I don't put much
weight on how long you've tested the first version.


Yes,  I am quite willing to change it,  time permitting.
I take the works committable version as a positive sign ...


BTW, sorry if I sound negative, I'm actually quite excited about this 
feature. A patch like this take a lot of work, and usually several 
rewrites, until it's ready ;-). But I'm looking forward for it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-23 Thread Jim Nasby

On 6/20/14, 5:12 PM, John Lumby wrote:

I also appreciate it is not easy to review the patch.
There are really 4 (or maybe 5) parts :

  .   async io (librt functions)
  .   buffer management   (allocating, locking and pinning etc)
  .   scanners making prefetch calls
  .   statistics

 and the autoconf input program

I will see what I can do.   Maybe putting an indicator against each modified 
file?


Generally the best way to handle cases like this is to create multiple patches, 
each of which builds upon previous ones.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-23 Thread John Lumby



 Date: Thu, 19 Jun 2014 15:43:44 -0300
 Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
 From: klaussfre...@gmail.com
 To: st...@mit.edu
 CC: hlinnakan...@vmware.com; johnlu...@hotmail.com; 
 pgsql-hackers@postgresql.org

 On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark st...@mit.edu wrote:
 I don't think the threaded implementation on Linux is the one to use
 though.  [... ] The overhead of thread communication
 will completely outweigh any advantage over posix_fadvise's partial
 win.

 What overhead?

 The only communication is through a done bit and associated
 synchronization structure when *and only when* you want to wait on it.


Threads do cost some extra CPU,  but provided the system had some
spare CPU capacity,  then performance improves because of reduced IO wait.
I quoted a measured improvement of  17% - 18% improvement in the README
along with some more explanation of when the asyc IO gives and improvement.

 Furthermore, posix_fadvise is braindead on this use case, been there,
 done that. What you win with threads is a better postgres-kernel
 interaction, even if you loose some CPU performance it's gonna beat
 posix_fadvise by a large margin.

 [...]

 When I investigated this I found the buffer manager's I/O bits seemed
 to already be able to represent the state we needed (i/o initiated on
 this buffer but not completed). The problem was in ensuring that a
 backend would process the i/o completion promptly when it might be in
 the midst of handling other tasks and might even get an elog() stack
 unwinding. The interface that actually fits Postgres best might be the
 threaded interface (orthogonal to the threaded implementation
 question) which is you give aio a callback which gets called on a
 separate thread when the i/o completes. The alternative is you give
 aio a list of operation control blocks and it tells you the state of
 all the i/o operations. But it's not clear to me how you arrange to do
 that regularly, promptly, and reliably.

 Indeed we've been musing about using librt's support of completion
 callbacks for that.

For the most common case of a backend issues a PrefetchBuffer
and then that *same* backend issues ReadBuffer,  the posix aio works
ideally,  since there is no need for any callback or completion signal,
we simply check is it complete during the ReadBuffer.

It is when some *other* backend gets there first with the ReadBuffer that 
things are a bit trickier.    The current version of the patch did polling for 
that case
but that drew criticism,    and so an imminent new version of the patch
uses the sigevent mechanism.    And there are other ways still.

In an earlier posting I reported that ,  in my benchmark,
99.8% of [FileCompleteaio]  calls are from originator and only  0.2% are 
not.so,  from a performance perspective,  only the common case really matters.
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-23 Thread Greg Stark
On Mon, Jun 23, 2014 at 2:43 PM, John Lumby johnlu...@hotmail.com wrote:
 It is when some *other* backend gets there first with the ReadBuffer that
 things are a bit trickier.The current version of the patch did polling 
 for that case
 but that drew criticism,and so an imminent new version of the patch
 uses the sigevent mechanism.And there are other ways still.

I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer
indicating I/O is in progress. If another backend does ReadBuffer for
the same block they'll get the same buffer and then wait until the
first backend's I/O completes. ReadBuffer goes through some hoops to
handle this (and all the corner cases such as the other backend's I/O
completing and the buffer being reused for another block before the
first backend reawakens). It would be a shame to reinvent the wheel.

The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-20 Thread John Lumby
Thanks Fujii ,   that is a bug   --   an #ifdef  USE_PREFETCH is missing in 
heapam.c
   (maybe several)
I will fix it in the next patch version.

I also appreciate it is not easy to review the patch.
There are really 4 (or maybe 5) parts :
   
 .   async io (librt functions)
 .   buffer management   (allocating, locking and pinning etc)
 .   scanners making prefetch calls
 .   statistics

    and the autoconf input program

I will see what I can do.   Maybe putting an indicator against each modified 
file?

I am currently working on two things :
  .   alternative way for non-originator of an aio_read to wait on completion
 (LWlock instead of polling the aiocb)
  This was talked about in several earlier posts and Claudio is also 
working on something there
  .   package up my benchmark

Cheers    John


 Date: Fri, 20 Jun 2014 04:21:19 +0900
 Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
 and patch
 From: masao.fu...@gmail.com
 To: johnlu...@hotmail.com
 CC: pgsql-hackers@postgresql.org; klaussfre...@gmail.com

 On Mon, Jun 9, 2014 at 11:12 AM, johnlumby johnlu...@hotmail.com wrote:
 updated version of patch compatible with git head of 140608,
 (adjusted proc oid and a couple of minor fixes)

 Compilation of patched version on MacOS failed. The error messages were

 gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
 -Wdeclaration-after-statement -Wendif-labels
 -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
 -fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c
 heapam.c: In function 'heap_unread_add':
 heapam.c:362: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_next'
 heapam.c:363: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_count'
 heapam.c:369: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_count'
 heapam.c:375: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_base'
 heapam.c:381: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_base'
 heapam.c:387: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_next'
 heapam.c:405: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_count'
 heapam.c: In function 'heap_unread_subtract':
 heapam.c:419: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_next'
 heapam.c:425: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_count'
 heapam.c:434: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_base'
 heapam.c:442: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_base'
 heapam.c:452: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_next'
 heapam.c:453: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_next'
 heapam.c:454: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_next'
 heapam.c:456: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_count'
 heapam.c: In function 'heapgettup':
 heapam.c:944: error: 'struct HeapScanDescData' has no member named
 'rs_pfchblock'
 heapam.c:716: warning: unused variable 'ix'
 heapam.c: In function 'heapgettup_pagemode':
 heapam.c:1243: error: 'struct HeapScanDescData' has no member named
 'rs_pfchblock'
 heapam.c:1029: warning: unused variable 'ix'
 heapam.c: In function 'heap_endscan':
 heapam.c:1808: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_base'
 heapam.c:1809: error: 'struct HeapScanDescData' has no member named
 'rs_Unread_Pfetched_base'
 make[4]: *** [heapam.o] Error 1
 make[3]: *** [heap-recursive] Error 2
 make[2]: *** [access-recursive] Error 2
 make[1]: *** [install-backend-recurse] Error 2
 make: *** [install-src-recurse] Error 2


 Huge patch is basically not easy to review. What about simplifying the patch
 by excluding non-core parts like the change of pg_stat_statements, so that
 reviewers can easily read the patch? We can add such non-core parts later.

 Regards,

 --
 Fujii Masao
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-19 Thread Greg Stark
On Wed, May 28, 2014 at 2:19 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 How portable is POSIX aio nowadays? Googling around, it still seems that on
 Linux, it's implemented using threads. Does the thread-emulation
 implementation cause problems with the rest of the backend, which assumes
 that there is only a single thread? In any case, I think we'll want to
 encapsulate the AIO implementation behind some kind of an API, to allow
 other implementations to co-exist.

I think POSIX aio is pretty damn standard and it's a pretty fiddly
interface. If we abstract it behind an i/o interface we're going to
lose a lot of the power. Abstracting it behind a set of buffer manager
operations (initiate i/o on buffer, complete i/o on buffer, abort i/o
on buffer) should be fine but that's basically what we have, no?

I don't think the threaded implementation on Linux is the one to use
though. I find this *super* confusing but the kernel definitely
supports aio syscalls, glibc also has a threaded implementation it
uses if run on a kernel that doesn't implement the syscalls, and I
think there are existing libaio and librt libraries from outside glibc
that do one or the other. Which you build against seems to make a big
difference. My instinct is that anything but the kernel native
implementation will be worthless. The overhead of thread communication
will completely outweigh any advantage over posix_fadvise's partial
win.

The main advantage of posix aio is that we can actually receive the
data out of order. With posix_fadvise we can get the i/o and cpu
overlap but we will never process the later blocks until the earlier
requests are satisfied and processed in order. With aio you could do a
sequential scan, initiating i/o on 1,000 blocks and then processing
them as they arrive, initiating new requests as those blocks are
handled.

When I investigated this I found the buffer manager's I/O bits seemed
to already be able to represent the state we needed (i/o initiated on
this buffer but not completed). The problem was in ensuring that a
backend would process the i/o completion promptly when it might be in
the midst of handling other tasks and might even get an elog() stack
unwinding. The interface that actually fits Postgres best might be the
threaded interface (orthogonal to the threaded implementation
question) which is you give aio a callback which gets called on a
separate thread when the i/o completes. The alternative is you give
aio a list of operation control blocks and it tells you the state of
all the i/o operations. But it's not clear to me how you arrange to do
that regularly, promptly, and reliably.

The other gotcha here is that the kernel implementation only does
anything useful on DIRECT_IO files. That means you have to do *all*
the prefetching and i/o scheduling yourself. You would be doing that
anyways for sequential scans and bitmap scans -- and we already do it
with things like synchronised scans and posix_fadvise -- but index
scans would need to get some intelligence for when it makes sense to
read more than one page at a time.  It might be possible to do
something fairly coarse like having our i/o operators keep track of
how often i/o on a relation falls within a certain number of blocks of
an earlier i/o and autotune number of blocks to read based on that. It
might not be hard to do better than the kernel with even basic info
like what level of the index we're reading or what type of pointer
we're following.

Finally, when I did the posix_fadvise work I wrote a synthetic
benchmark for testing the equivalent i/o pattern of a bitmap scan. It
let me simulate bitmap scans of varying densities with varying
parameters, notably how many i/o to keep in flight at once. It
supported posix_fadvise or aio. You should look it up in the archives,
it made for some nice looking graphs. IIRC I could not find any build
environment where aio offered any performance boost at all. I think
this means I just didn't know how to build it against the right
libraries or wasn't using the right kernel or there was some skew
between them at the time.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-19 Thread Claudio Freire
On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark st...@mit.edu wrote:
 I don't think the threaded implementation on Linux is the one to use
 though. I find this *super* confusing but the kernel definitely
 supports aio syscalls, glibc also has a threaded implementation it
 uses if run on a kernel that doesn't implement the syscalls, and I
 think there are existing libaio and librt libraries from outside glibc
 that do one or the other. Which you build against seems to make a big
 difference. My instinct is that anything but the kernel native
 implementation will be worthless. The overhead of thread communication
 will completely outweigh any advantage over posix_fadvise's partial
 win.

What overhead?

The only communication is through a done bit and associated
synchronization structure when *and only when* you want to wait on it.

Furthermore, posix_fadvise is braindead on this use case, been there,
done that. What you win with threads is a better postgres-kernel
interaction, even if you loose some CPU performance it's gonna beat
posix_fadvise by a large margin.


 The main advantage of posix aio is that we can actually receive the
 data out of order. With posix_fadvise we can get the i/o and cpu
 overlap but we will never process the later blocks until the earlier
 requests are satisfied and processed in order. With aio you could do a
 sequential scan, initiating i/o on 1,000 blocks and then processing
 them as they arrive, initiating new requests as those blocks are
 handled.

And each and every I/O you issue with it counts as such on the kernel side.

It's not the case with posix_fadvise, mind you, and that's terribly
damaging for performance.

 When I investigated this I found the buffer manager's I/O bits seemed
 to already be able to represent the state we needed (i/o initiated on
 this buffer but not completed). The problem was in ensuring that a
 backend would process the i/o completion promptly when it might be in
 the midst of handling other tasks and might even get an elog() stack
 unwinding. The interface that actually fits Postgres best might be the
 threaded interface (orthogonal to the threaded implementation
 question) which is you give aio a callback which gets called on a
 separate thread when the i/o completes. The alternative is you give
 aio a list of operation control blocks and it tells you the state of
 all the i/o operations. But it's not clear to me how you arrange to do
 that regularly, promptly, and reliably.

Indeed we've been musing about using librt's support of completion
callbacks for that.

 The other gotcha here is that the kernel implementation only does
 anything useful on DIRECT_IO files. That means you have to do *all*
 the prefetching and i/o scheduling yourself.

If that's the case, we should discard kernel-based implementations and
stick to thread-based ones. Postgres cannot do scheduling as
efficiently as the kernel, and it shouldn't try.

 You would be doing that
 anyways for sequential scans and bitmap scans -- and we already do it
 with things like synchronised scans and posix_fadvise

That only works because the patterns are semi-sequential. If you try
to schedule random access, it becomes effectively impossible without
hardware info.

The kernel is the one with hardware info.

 Finally, when I did the posix_fadvise work I wrote a synthetic
 benchmark for testing the equivalent i/o pattern of a bitmap scan. It
 let me simulate bitmap scans of varying densities with varying
 parameters, notably how many i/o to keep in flight at once. It
 supported posix_fadvise or aio. You should look it up in the archives,
 it made for some nice looking graphs. IIRC I could not find any build
 environment where aio offered any performance boost at all. I think
 this means I just didn't know how to build it against the right
 libraries or wasn't using the right kernel or there was some skew
 between them at the time.

If it's old, it probable you didn't hit the kernel's braindeadness
since it was introduced somewhat recently (somewhate, ie, before 3.x I
believe).

Even if you did hit it, bitmap heap scans are blessed with sequential
ordering. The technique doesn't work nearly as well with random I/O
that might be sorted from time to time.

When traversing an index, you do a mostly sequential pattern due to
physical correlation, but not completely sequential. Not only that,
with the assumption of random I/O, and the uncertainty of when will
the scan be aborted too, you don't read ahead as much as you could if
you knew it was sequential or a full scan. That kills performance. You
don't fetch enough ahead of time to avoid stalls, and the kernel
doesn't do read-ahead either because posix_fadvise effectively
disables it, resulting in the equivalent of direct I/O with bad
scheduling.

Solving this for index scans isn't just a little more complex. It's
insanely more complex, because you need hardware information to do it
right. How many spindles, how many sectors per cylinder if it's

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-19 Thread Fujii Masao
On Mon, Jun 9, 2014 at 11:12 AM, johnlumby johnlu...@hotmail.com wrote:
 updated version of patch compatible with git head of 140608,
 (adjusted proc oid and a couple of minor fixes)

Compilation of patched version on MacOS failed. The error messages were

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include   -c -o heapam.o heapam.c
heapam.c: In function 'heap_unread_add':
heapam.c:362: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:363: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:369: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:375: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:381: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:387: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:405: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heap_unread_subtract':
heapam.c:419: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:425: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:434: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:442: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:452: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:453: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:454: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:456: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heapgettup':
heapam.c:944: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:716: warning: unused variable 'ix'
heapam.c: In function 'heapgettup_pagemode':
heapam.c:1243: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:1029: warning: unused variable 'ix'
heapam.c: In function 'heap_endscan':
heapam.c:1808: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:1809: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
make[4]: *** [heapam.o] Error 1
make[3]: *** [heap-recursive] Error 2
make[2]: *** [access-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2


Huge patch is basically not easy to review. What about simplifying the patch
by excluding non-core parts like the change of pg_stat_statements, so that
reviewers can easily read the patch? We can add such non-core parts later.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-09 Thread Claudio Freire
I'm having trouble setting max_async_io_prefetchers in postgresql.conf

It says it cannot be changed.

Is that fixed at initdb time? (sounds like a bad idea if it is)

On Sun, Jun 8, 2014 at 11:12 PM, johnlumby johnlu...@hotmail.com wrote:
 updated version of patch compatible with git head of 140608,
 (adjusted proc oid and a couple of minor fixes)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-01 Thread Heikki Linnakangas

On 06/01/2014 03:44 AM, johnlumby wrote:

If you look at the new patch,  you'll see that for the different-pid case,
I still call aio_suspend with a timeout.
As you or Claudio pointed out earlier,it could just as well sleep
for the same timeout,
but the small advantage of calling aio_suspend is if the io completed
just between
the aio_error returning EINPROGRESS and the aio_suspend call.
Also it makes the code simpler.   In fact this change is quite small,
just a few lines
in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c

Based on this,I think it is not necessary  to get rid of the polling
altogether
(and in any case, as far as I can see,  very difficult).


That's still just as wrong as it always has been. Just get rid of it. 
Don't put aiocb structs in shared memory at all. They don't belong there.



Well,  as mentioned earlier,  it is not broken. Whether it is
efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that,  if caller is not the original aio_read process,
it renders the suspend() into an instant timeout.  I will see if I can
verify that.


I don't see the point of pursuing this design further. Surely we don't want
to use polling here, and you're relying on undefined behavior anyway. I'm
pretty sure aio_return/aio_error won't work from a different process on all
platforms, even if it happens to work on Linux. Even on Linux, it might stop
working if the underlying implementation changes from the glibc pthread
emulation to something kernel-based.


Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective,  it will be caught in configure.


No, that does not make it robust enough.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-31 Thread johnlumby

On 05/31/14 20:44, johnlumby wrote:

On 05/30/14 09:36, Claudio Freire wrote:

Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective,  it will be caught in configure.



I meant to add that by including this test,  which involves a fork(),
in the autoconf tester,   on Windows
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP would always by un-defined.
(But could then be defined manually if someone wanted to give it a try)



Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-31 Thread Claudio Freire
On Sat, May 31, 2014 at 9:44 PM, johnlumby johnlu...@hotmail.com wrote:
 I'll try to do some measuring of performance with:
 a) git head
 b) git head + patch as-is
 c) git head + patch without aio_suspend in foreign processes (just re-read)
 d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

 a-c will be the fastest, d might take some while.

 I'll let you know of the results as I get them.


 Claudio,  I am not quite sure if what I am submitting now is
 quite the same as any of yours.   As I promised before,  but have
 not yet done,  I will package one or two of my benchmarks and
 send them in.


It's a tad different. c will not do polling on the foreign process, I
will just let PG do the read again. d will be like polling, but
without the associated CPU overhead.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-30 Thread Heikki Linnakangas

On 05/30/2014 12:53 AM, John Lumby wrote:

Date: Thu, 29 May 2014 18:00:28 -0300
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
and patch
From: klaussfre...@gmail.com
To: hlinnakan...@vmware.com
CC: johnlu...@hotmail.com; pgsql-hackers@postgresql.org

On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

On 05/29/2014 11:34 PM, Claudio Freire wrote:


On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:


On 05/29/2014 04:12 PM, John Lumby wrote:




On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.



It works on linux.Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,.   since librt takes care of the
waiting.
specifically,   no need for extra wait-for-io control blocks
such as in bufmgr's  WaitIO()



[checks]. No, it doesn't work. See attached test program.


Thanks for checkingand thanks for coming up with that test program.
However,  yes,  it really does work  --  always  (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However,  the rule is,  call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.


I see no such a rule in the man pages of any of the functions involved. 
And it wouldn't matter anyway; the behavior is exactly the same if you 
aio_error() first.



See the code changes to fd.c function FileCompleteaio()
to see how we have done it.   And I am attaching corrected version
of your test program which runs just fine.


As Claudio mentioned earlier, the way FileCompleteaio() uses aio_suspend 
is just a complicated way of polling. You might as well replace the 
aio_suspend() calls with pg_usleep().



It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling
aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local
mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from
a portability point of view. No, the only sane way to make this work is
that
the process that initiates an I/O request is responsible for completing
it.
If another process needs to wait for an async I/O to complete, we must
use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.


But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.


We don't want polling... And even if we did, calling aio_suspend() in a way
that's known to be broken, in a loop, is a pretty crappy way of polling.


Well,  as mentioned earlier,  it is not broken. Whether it is efficient I 
am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that,  if caller is not the original aio_read process,
it renders the suspend() into an instant timeout.  I will see if I can 
verify that.


I don't see the point of pursuing this design further. Surely we don't 
want to use polling here, and you're relying on undefined behavior 
anyway. I'm pretty sure aio_return/aio_error won't work from a different 
process on all platforms, even if it happens to work on Linux. Even on 
Linux, it might stop working if the underlying implementation changes 
from the glibc pthread emulation to something kernel-based.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-30 Thread Claudio Freire
On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 We don't want polling... And even if we did, calling aio_suspend() in a
 way
 that's known to be broken, in a loop, is a pretty crappy way of polling.


 Well,  as mentioned earlier,  it is not broken. Whether it is
 efficient I am not sure.
 I have looked at the mutex in aio_suspend that you mentioned and I am not
 quite convinced that,  if caller is not the original aio_read process,
 it renders the suspend() into an instant timeout.  I will see if I can
 verify that.


 I don't see the point of pursuing this design further. Surely we don't want
 to use polling here, and you're relying on undefined behavior anyway. I'm
 pretty sure aio_return/aio_error won't work from a different process on all
 platforms, even if it happens to work on Linux. Even on Linux, it might stop
 working if the underlying implementation changes from the glibc pthread
 emulation to something kernel-based.

I'll try to do some measuring of performance with:
a) git head
b) git head + patch as-is
c) git head + patch without aio_suspend in foreign processes (just re-read)
d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

a-c will be the fastest, d might take some while.

I'll let you know of the results as I get them.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby
Thanks for looking at it!

 Date: Thu, 29 May 2014 00:19:33 +0300
 From: hlinnakan...@vmware.com
 To: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
 CC: klaussfre...@gmail.com
 Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
 and patch
 
 On 05/28/2014 11:52 PM, John Lumby wrote:
 
 
 The patch seems to assume that you can put the aiocb struct in shared 
 memory, initiate an asynchronous I/O request from one process, and wait 
 for its completion from another process. I'm pretty surprised if that 
 works on any platform.

It works on linux.Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,.   since librt takes care of the waiting.
specifically,   no need for extra wait-for-io control blocks
such as in bufmgr's  WaitIO()

This also plays very nicely with the syncscan where the cohorts run close 
together.

If anyone would like to advise whether this also works on MacOS/BSD , windows,
that would be good,  as I can't verify it myself.

 
 How portable is POSIX aio nowadays? Googling around, it still seems that 
 on Linux, it's implemented using threads. Does the thread-emulation 
 implementation cause problems with the rest of the backend, which 
 assumes that there is only a single thread? In any case, I think we'll 

Good question,   I am not aware of any dependency on a backend having only
a single thread.Can you please point me to such dependencies?


 want to encapsulate the AIO implementation behind some kind of an API, 
 to allow other implementations to co-exist.

It is already  -it follows the smgr(stg mgr) - md (mag disk) -  fd 
(filesystem) 
layering used for the existing filesystem ops including posix-fadvise.

 
 Benchmarks?

I will see if I can package mine up somehow.
Would be great if anyone else would like to benchmark it on theirs   ...


 - Heikki
  

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby
I have pasted  below the EXPLAIN of one of my benchmark queries
(the one I reference in the README).
Plenty of nested loop joins.
However I think I understand your question as to how effective it would be
if the outer is not sorted, and I will see if I can dig into that
if I get time  (and it sounds as though Claudio is on it too).

The area of exactly what the best prefetch strategy should be for
each particular type of scan and context is a good one to work on.

John

 Date: Wed, 28 May 2014 18:12:23 -0700
 Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
 and patch
 From: p...@heroku.com
 To: klaussfre...@gmail.com
 CC: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
 
 On Wed, May 28, 2014 at 5:59 PM, Claudio Freire klaussfre...@gmail.com 
 wrote:
  For nestloop, correct me if I'm wrong, but index scan nodes don't have
  visibility of the next tuple to be searched for.
 
 Nested loop joins are considered a particularly compelling case for
 prefetching, actually.
 
 -- 
 Peter Geoghegan

-
 QUERY PLAN 

  

 Limit  (cost=801294.81..801294.81 rows=2 width=532)
   CTE deploy_zone_down
 -  Recursive Union  (cost=1061.25..2687.40 rows=11 width=573)
   -  Nested Loop  (cost=1061.25..1423.74 rows=1 width=41)
 -  Nested Loop  (cost=1061.11..1421.22 rows=14 width=49)
   -  Bitmap Heap Scan on entity zone_tree  
(cost=1060.67..1175.80 rows=29 width=40)
 Recheck Cond: ((name = 'testZone-4375'::text) AND 
(name = 'testZone-5499'::text) AND ((discriminator)::text = 'ZONE'::text))
 -  BitmapAnd  (cost=1060.67..1060.67 rows=29 
width=0)
   -  Bitmap Index Scan on entity_name  
(cost=0.00..139.71 rows=5927 width=0)
 Index Cond: ((name = 
'testZone-4375'::text) AND (name = 'testZone-5499'::text))
   -  Bitmap Index Scan on 
entity_discriminatorx  (cost=0.00..920.70 rows=49636 width=0)
 Index Cond: ((discriminator)::text = 
'ZONE'::text)
   -  Index Scan using metadata_value_owner_id on 
metadata_value mv  (cost=0.43..8.45 rows=1 width=17)
 Index Cond: (owner_id = zone_tree.id)
 -  Index Scan using metadata_field_pkey on metadata_field mf  
(cost=0.15..0.17 rows=1 width=8)
   Index Cond: (id = mv.field_id)
   Filter: ((name)::text = 'deployable'::text)
   -  Nested Loop  (cost=0.87..126.34 rows=1 width=573)
 -  Nested Loop  (cost=0.72..125.44 rows=5 width=581)
   -  Nested Loop  (cost=0.29..83.42 rows=10 width=572)
 -  WorkTable Scan on deploy_zone_down dzd  
(cost=0.00..0.20 rows=10 width=540)
 -  Index Scan using 
entity_discriminator_parent_zone on entity ch  (cost=0.29..8.31 rows=1 width=40)
   Index Cond: ((parent_id = dzd.zone_tree_id) 
AND ((discriminator)::text = 'ZONE'::text))
   -  Index Scan using metadata_value_owner_id on 
metadata_value mv_1  (cost=0.43..4.19 rows=1 width=17)
 Index Cond: (owner_id = ch.id)
 -  Index Scan using metadata_field_pkey on metadata_field 
mf_1  (cost=0.15..0.17 rows=1 width=8)
   Index Cond: (id = mv_1.field_id)
   Filter: ((name)::text = 'deployable'::text)
   CTE deploy_zone_tree
 -  Recursive Union  (cost=0.00..9336.89 rows=21 width=1105)
   -  CTE Scan on deploy_zone_down dzd_1  (cost=0.00..0.22 rows=11 
width=1105)
   -  Nested Loop  (cost=0.43..933.63 rows=1 width=594)
 -  WorkTable Scan on deploy_zone_tree dzt  (cost=0.00..2.20 
rows=110 width=581)
 -  Index Scan using entity_pkey on entity pt  
(cost=0.43..8.46 rows=1 width=21)
   Index Cond: (id = dzt.zone_tree_ancestor_parent_id)
   Filter: ((discriminator)::text = ANY 
('{VIEW,ZONE}'::text[]))
   CTE forward_host_ip
 -  Nested Loop  (cost=1.30..149.65 rows=24 width=88)
   -  Nested Loop  (cost=0.87..121.69 rows=48 width=56)
 -  Nested Loop  (cost=0.43..71.61 rows

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
El 28/05/2014 22:12, Peter Geoghegan p...@heroku.com escribió:

 On Wed, May 28, 2014 at 5:59 PM, Claudio Freire klaussfre...@gmail.com
wrote:
  For nestloop, correct me if I'm wrong, but index scan nodes don't have
  visibility of the next tuple to be searched for.

 Nested loop joins are considered a particularly compelling case for
 prefetching, actually.

Of course. I only doubt it can be implemented without not so small changes
to all execution nodes.

I'll look into it


 --
 Peter Geoghegan


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Heikki Linnakangas

On 05/29/2014 04:12 PM, John Lumby wrote:

Thanks for looking at it!


Date: Thu, 29 May 2014 00:19:33 +0300
From: hlinnakan...@vmware.com
To: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
CC: klaussfre...@gmail.com
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
and patch

On 05/28/2014 11:52 PM, John Lumby wrote:




The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.


It works on linux.Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,.   since librt takes care of the waiting.
specifically,   no need for extra wait-for-io control blocks
such as in bufmgr's  WaitIO()


[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in 
glibc. The aiocb struct has a field for the result value and errno, and 
when the I/O is finished, the worker thread fills them in. aio_error() 
and aio_return() just return the values of those fields, so calling 
aio_error() or aio_return() do in fact happen to work from a different 
process. aio_suspend(), however, is implemented by sleeping on a 
process-local mutex, which does not work from a different process.


Even if it worked on Linux today, it would be a bad idea to rely on it 
from a portability point of view. No, the only sane way to make this 
work is that the process that initiates an I/O request is responsible 
for completing it. If another process needs to wait for an async I/O to 
complete, we must use some other means to do the waiting. Like the 
io_in_progress_lock that we already have, for the same purpose.


- Heikki

/*
 * Test program to test if POSIX aio functions work across processes
 */

#include unistd.h
#include stdio.h
#include stdlib.h
#include string.h
#include sys/mman.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include aio.h


char *shmem;

void
processA(void)
{
	int fd;
	struct aiocb *aiocbp = (struct aiocb *) shmem;
	char *buf = shmem + sizeof(struct aiocb);

	fd = open(aio-shmem-test-file, O_CREAT | O_WRONLY | O_SYNC, S_IRWXU);
	if (fd == -1)
	{
		fprintf(stderr, open() failed\n);
		exit(1);
	}
	printf(processA starting AIO\n);

	strcpy(buf, foobar);

	memset(aiocbp, 0, sizeof(struct aiocb));
	aiocbp-aio_fildes = fd;
	aiocbp-aio_offset = 0;
	aiocbp-aio_buf = buf;
	aiocbp-aio_nbytes = strlen(buf);
	aiocbp-aio_reqprio = 0;
	aiocbp-aio_sigevent.sigev_notify = SIGEV_NONE;

	if (aio_write(aiocbp) != 0)
	{
		fprintf(stderr, aio_write() failed\n);
		exit(1);
	}
}

void
processB(void)
{
	struct aiocb *aiocbp = (struct aiocb *) shmem;
	const struct aiocb * const pl[1] = { aiocbp };
	int rv;


	printf(waiting for the write to finish in process B\n);

	if (aio_suspend(pl, 1, NULL) != 0)
	{
		fprintf(stderr, aio_suspend() failed\n);
		exit(1);
	}
	printf(aio_suspend() returned 0\n);

	rv = aio_error(aiocbp);
	if (rv != 0)
	{
		fprintf(stderr, aio_error returned %d: %s\n, rv, strerror(rv));
		exit(1);
	}
	rv = aio_return(aiocbp);
	printf(aio_return returned %d\n, rv);
}



int main(int argc, char **argv)
{
	int pidB;

	shmem = mmap(NULL, sizeof(struct aiocb) + 1000,
 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
 -1, 0);
	if (shmem == MAP_FAILED)
	{
		fprintf(stderr, mmap() failed\n);
		exit(1);
	}

#ifdef SINGLE_PROCESS
	/* this works */
	processA();
	processB();
#else
	/*
	 * Start the I/O request in parent process, then fork and try to wait
	 * for it to finish from the child process. (doesn't work, it will hang
	 * forever)
	 */
	processA();

	pidB = fork();
	if (pidB == -1)
	{
		fprintf(stderr, fork() failed\n);
		exit(1);
	}
	if (pidB != 0)
	{
		/* parent */
		wait (pidB);
	}
	else
	{
		/* child */
		processB();
	}
#endif
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 05/29/2014 04:12 PM, John Lumby wrote:

 Thanks for looking at it!

 Date: Thu, 29 May 2014 00:19:33 +0300
 From: hlinnakan...@vmware.com
 To: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
 CC: klaussfre...@gmail.com
 Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO -
 proposal and patch

 On 05/28/2014 11:52 PM, John Lumby wrote:



 The patch seems to assume that you can put the aiocb struct in shared
 memory, initiate an asynchronous I/O request from one process, and wait
 for its completion from another process. I'm pretty surprised if that
 works on any platform.


 It works on linux.Actually this ability allows the asyncio
 implementation to
 reduce complexity in one respect (yes I know it looks complex enough) :
 it makes waiting for completion of an in-progress IO simpler than for
 the existing synchronous IO case,.   since librt takes care of the
 waiting.
 specifically,   no need for extra wait-for-io control blocks
 such as in bufmgr's  WaitIO()


 [checks]. No, it doesn't work. See attached test program.

 It kinda seems to work sometimes, because of the way it's implemented in
 glibc. The aiocb struct has a field for the result value and errno, and when
 the I/O is finished, the worker thread fills them in. aio_error() and
 aio_return() just return the values of those fields, so calling aio_error()
 or aio_return() do in fact happen to work from a different process.
 aio_suspend(), however, is implemented by sleeping on a process-local mutex,
 which does not work from a different process.

 Even if it worked on Linux today, it would be a bad idea to rely on it from
 a portability point of view. No, the only sane way to make this work is that
 the process that initiates an I/O request is responsible for completing it.
 If another process needs to wait for an async I/O to complete, we must use
 some other means to do the waiting. Like the io_in_progress_lock that we
 already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

Which is odd now that I read carefully, is how come 256 waits of 10us
amounts to anything? That's just 2.5ms, not enough IIUC for any normal
I/O to complete.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Heikki Linnakangas

On 05/29/2014 11:34 PM, Claudio Freire wrote:

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

On 05/29/2014 04:12 PM, John Lumby wrote:



On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.


It works on linux.Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,.   since librt takes care of the
waiting.
specifically,   no need for extra wait-for-io control blocks
such as in bufmgr's  WaitIO()


[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it from
a portability point of view. No, the only sane way to make this work is that
the process that initiates an I/O request is responsible for completing it.
If another process needs to wait for an async I/O to complete, we must use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.


But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.


We don't want polling... And even if we did, calling aio_suspend() in a 
way that's known to be broken, in a loop, is a pretty crappy way of polling.



Which is odd now that I read carefully, is how come 256 waits of 10us
amounts to anything? That's just 2.5ms, not enough IIUC for any normal
I/O to complete


Wild guess: the buffer you're reading is already in OS cache.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 05/29/2014 11:34 PM, Claudio Freire wrote:

 On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 On 05/29/2014 04:12 PM, John Lumby wrote:


 On 05/28/2014 11:52 PM, John Lumby wrote:

 The patch seems to assume that you can put the aiocb struct in shared
 memory, initiate an asynchronous I/O request from one process, and wait
 for its completion from another process. I'm pretty surprised if that
 works on any platform.


 It works on linux.Actually this ability allows the asyncio
 implementation to
 reduce complexity in one respect (yes I know it looks complex enough) :
 it makes waiting for completion of an in-progress IO simpler than for
 the existing synchronous IO case,.   since librt takes care of the
 waiting.
 specifically,   no need for extra wait-for-io control blocks
 such as in bufmgr's  WaitIO()


 [checks]. No, it doesn't work. See attached test program.

 It kinda seems to work sometimes, because of the way it's implemented in
 glibc. The aiocb struct has a field for the result value and errno, and
 when
 the I/O is finished, the worker thread fills them in. aio_error() and
 aio_return() just return the values of those fields, so calling
 aio_error()
 or aio_return() do in fact happen to work from a different process.
 aio_suspend(), however, is implemented by sleeping on a process-local
 mutex,
 which does not work from a different process.

 Even if it worked on Linux today, it would be a bad idea to rely on it
 from
 a portability point of view. No, the only sane way to make this work is
 that
 the process that initiates an I/O request is responsible for completing
 it.
 If another process needs to wait for an async I/O to complete, we must
 use
 some other means to do the waiting. Like the io_in_progress_lock that we
 already have, for the same purpose.


 But calls to it are timeouted by 10us, effectively turning the thing
 into polling mode.


 We don't want polling... And even if we did, calling aio_suspend() in a way
 that's known to be broken, in a loop, is a pretty crappy way of polling.

Agreed. Just saying that that's why it works.

The PID of the process responsible for the aio should be there
somewhere, and other processes should explicitly synchronize (or
initiate their own I/O and let the OS merge them - which also works).



 Which is odd now that I read carefully, is how come 256 waits of 10us
 amounts to anything? That's just 2.5ms, not enough IIUC for any normal
 I/O to complete


 Wild guess: the buffer you're reading is already in OS cache.

My wild guess as well.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Tom Lane
Claudio Freire klaussfre...@gmail.com writes:
 Didn't fix that, but the attached patch does fix regression tests when
 scanning over index types other than btree (was invoking elog when the
 index am didn't have ampeeknexttuple)

ampeeknexttuple?  That's a bit scary.  It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 6:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Claudio Freire klaussfre...@gmail.com writes:
 Didn't fix that, but the attached patch does fix regression tests when
 scanning over index types other than btree (was invoking elog when the
 index am didn't have ampeeknexttuple)

 ampeeknexttuple?  That's a bit scary.  It would certainly be unsafe
 for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
 nbtree/README).


It's not really the tuple, just the tid


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 6:43 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Thu, May 29, 2014 at 6:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Claudio Freire klaussfre...@gmail.com writes:
 Didn't fix that, but the attached patch does fix regression tests when
 scanning over index types other than btree (was invoking elog when the
 index am didn't have ampeeknexttuple)

 ampeeknexttuple?  That's a bit scary.  It would certainly be unsafe
 for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
 nbtree/README).


 It's not really the tuple, just the tid

And, furthermore, it's used only to do prefetching, so even if the tid
was invalid when the tuple needs to be accessed, it wouldn't matter,
because the indexam wouldn't use the result of ampeeknexttuple to do
anything at that time.

Though, your comment does illustrate the need to document that on
ampeeknexttuple, for future users.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby


 Date: Thu, 29 May 2014 18:00:28 -0300
 Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
 and patch
 From: klaussfre...@gmail.com
 To: hlinnakan...@vmware.com
 CC: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
 
 On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  On 05/29/2014 11:34 PM, Claudio Freire wrote:
 
  On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
  hlinnakan...@vmware.com wrote:
 
  On 05/29/2014 04:12 PM, John Lumby wrote:
 
 
  On 05/28/2014 11:52 PM, John Lumby wrote:
 
  The patch seems to assume that you can put the aiocb struct in shared
  memory, initiate an asynchronous I/O request from one process, and wait
  for its completion from another process. I'm pretty surprised if that
  works on any platform.
 
 
  It works on linux.Actually this ability allows the asyncio
  implementation to
  reduce complexity in one respect (yes I know it looks complex enough) :
  it makes waiting for completion of an in-progress IO simpler than for
  the existing synchronous IO case,.   since librt takes care of the
  waiting.
  specifically,   no need for extra wait-for-io control blocks
  such as in bufmgr's  WaitIO()
 
 
  [checks]. No, it doesn't work. See attached test program.

Thanks for checkingand thanks for coming up with that test program.
However,  yes,  it really does work  --  always  (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However,  the rule is,  call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.

See the code changes to fd.c function FileCompleteaio()
to see how we have done it.   And I am attaching corrected version
of your test program which runs just fine.


 
  It kinda seems to work sometimes, because of the way it's implemented in
  glibc. The aiocb struct has a field for the result value and errno, and
  when
  the I/O is finished, the worker thread fills them in. aio_error() and
  aio_return() just return the values of those fields, so calling
  aio_error()
  or aio_return() do in fact happen to work from a different process.
  aio_suspend(), however, is implemented by sleeping on a process-local
  mutex,
  which does not work from a different process.
 
  Even if it worked on Linux today, it would be a bad idea to rely on it
  from
  a portability point of view. No, the only sane way to make this work is
  that
  the process that initiates an I/O request is responsible for completing
  it.
  If another process needs to wait for an async I/O to complete, we must
  use
  some other means to do the waiting. Like the io_in_progress_lock that we
  already have, for the same purpose.
 
 
  But calls to it are timeouted by 10us, effectively turning the thing
  into polling mode.
 
 
  We don't want polling... And even if we did, calling aio_suspend() in a way
  that's known to be broken, in a loop, is a pretty crappy way of polling.

Well,  as mentioned earlier,  it is not broken. Whether it is efficient I 
am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that,  if caller is not the original aio_read process,
it renders the suspend() into an instant timeout.  I will see if I can 
verify that.
Where are you (Claudio) seeing 10us?

 
 
 Didn't fix that, but the attached patch does fix regression tests when
 scanning over index types other than btree (was invoking elog when the
 index am didn't have ampeeknexttuple)
  /*
 * Test program to test if POSIX aio functions work across processes
 */

#include unistd.h
#include stdio.h
#include stdlib.h
#include string.h
#include sys/mman.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include aio.h
#include errno.h

char *shmem;

void
processA(void)
{
int fd;
struct aiocb *aiocbp = (struct aiocb *) shmem;
char *buf = shmem + sizeof(struct aiocb);

fd = open(aio-shmem-test-file, O_CREAT | O_WRONLY | O_SYNC, S_IRWXU);
if (fd == -1)
{
fprintf(stderr, open() failed\n);
exit(1);
}
printf(processA starting AIO\n);

strcpy(buf, foobar);

memset(aiocbp, 0, sizeof(struct aiocb));
aiocbp-aio_fildes = fd;
aiocbp-aio_offset = 0;
aiocbp-aio_buf = buf;
aiocbp-aio_nbytes = strlen(buf);
aiocbp-aio_reqprio = 0;
aiocbp-aio_sigevent.sigev_notify = SIGEV_NONE;

if (aio_write(aiocbp) != 0)
{
fprintf(stderr, aio_write() failed\n);
exit(1);
}
}

void
processB(void)
{
struct aiocb *aiocbp = (struct aiocb *) shmem;
const struct aiocb * const pl[1] = { aiocbp };
int rv;
int returnCode;
struct timespec my_timeout = { 0 , 1 };
int max_polls;

printf(waiting for the write to finish in process B\n

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Tom Lane
Claudio Freire klaussfre...@gmail.com writes:
 On Thu, May 29, 2014 at 6:43 PM, Claudio Freire klaussfre...@gmail.com 
 wrote:
 On Thu, May 29, 2014 at 6:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ampeeknexttuple?  That's a bit scary.  It would certainly be unsafe
 for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
 nbtree/README).

 It's not really the tuple, just the tid

 And, furthermore, it's used only to do prefetching, so even if the tid
 was invalid when the tuple needs to be accessed, it wouldn't matter,
 because the indexam wouldn't use the result of ampeeknexttuple to do
 anything at that time.

Nonetheless, getting the next tid out of the index may involve stepping
to the next index page, at which point you've lost your interlock
guaranteeing that the *previous* tid will still mean something by the time
you arrive at its heap page.  I presume that the ampeeknexttuple call is
issued before trying to visit the heap (otherwise you're not actually
getting much I/O overlap), so I think there's a real risk here.

Having said that, it's probably OK as long as this mode is only invoked
for user queries (with MVCC snapshots) and not for system indexscans.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby


 From: t...@sss.pgh.pa.us
 To: klaussfre...@gmail.com
 CC: hlinnakan...@vmware.com; johnlu...@hotmail.com; 
 pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
 and patch
 Date: Thu, 29 May 2014 17:56:57 -0400
 
 Claudio Freire klaussfre...@gmail.com writes:
  On Thu, May 29, 2014 at 6:43 PM, Claudio Freire klaussfre...@gmail.com 
  wrote:
  On Thu, May 29, 2014 at 6:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  ampeeknexttuple?  That's a bit scary.  It would certainly be unsafe
  for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
  nbtree/README).
 
  It's not really the tuple, just the tid
 
  And, furthermore, it's used only to do prefetching, so even if the tid
  was invalid when the tuple needs to be accessed, it wouldn't matter,
  because the indexam wouldn't use the result of ampeeknexttuple to do
  anything at that time.
 
 Nonetheless, getting the next tid out of the index may involve stepping
 to the next index page, at which point you've lost your interlock

I think we are ok as peeknexttuple (yes  bad name,  sorry,  can change it ...

never advances to another page  :



 *btpeeknexttuple() -- peek at the next tuple different from any blocknum 
in pfch_list

 *   without reading a new index page

 *   and without causing any side-effects such as altering 
values in control blocks

 *   if found, store blocknum in next element of pfch_list




 guaranteeing that the *previous* tid will still mean something by the time
 you arrive at its heap page.  I presume that the ampeeknexttuple call is
 issued before trying to visit the heap (otherwise you're not actually
 getting much I/O overlap), so I think there's a real risk here.
 
 Having said that, it's probably OK as long as this mode is only invoked
 for user queries (with MVCC snapshots) and not for system indexscans.
 
   regards, tom lane
  

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 6:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Claudio Freire klaussfre...@gmail.com writes:
 On Thu, May 29, 2014 at 6:43 PM, Claudio Freire klaussfre...@gmail.com 
 wrote:
 On Thu, May 29, 2014 at 6:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ampeeknexttuple?  That's a bit scary.  It would certainly be unsafe
 for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
 nbtree/README).

 It's not really the tuple, just the tid

 And, furthermore, it's used only to do prefetching, so even if the tid
 was invalid when the tuple needs to be accessed, it wouldn't matter,
 because the indexam wouldn't use the result of ampeeknexttuple to do
 anything at that time.

 Nonetheless, getting the next tid out of the index may involve stepping
 to the next index page, at which point you've lost your interlock
 guaranteeing that the *previous* tid will still mean something by the time

No, no... that's exactly why a new regproc is needed, because for
prefetching, we need to get the next tid that satisfies some
conditions *without* walking the index.

This, in nbtree, only looks through the tid array to find the suitable
tid, or just return false if the array is exhausted.

 Having said that, it's probably OK as long as this mode is only invoked
 for user queries (with MVCC snapshots) and not for system indexscans.

I think system index scans will also invoke this. There's no rule
excluding that possibility.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 7:11 PM, John Lumby johnlu...@hotmail.com wrote:
 Nonetheless, getting the next tid out of the index may involve stepping
 to the next index page, at which point you've lost your interlock

 I think we are ok as peeknexttuple (yes  bad name,  sorry,  can change it
 ...
 never advances to another page  :

  *btpeeknexttuple() -- peek at the next tuple different from any
 blocknum in pfch_list
  *   without reading a new index page
  *   and without causing any side-effects such as
 altering values in control blocks
  *   if found, store blocknum in next element of pfch_list


Yeah, I was getting to that conclusion myself too.

We could call it amprefetchnextheap, since it does just prefetch, and
is good for nothing *but* prefetch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Andres Freund
Hi,

On 2014-05-29 17:53:51 -0400, John Lumby wrote:
 to see how we have done it.   And I am attaching corrected version
 of your test program which runs just fine.

It's perfectly fine to not be up to the coding style at this point, but
trying to adhere to it to some degree will make code review later less
painfull...
* comments with **
* line length
* tabs vs spaces
* ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby


 Date: Thu, 29 May 2014 18:00:28 -0300
 Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
 and patch
 From: klaussfre...@gmail.com
 To: hlinnakan...@vmware.com
 CC: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
 

 
  Even if it worked on Linux today, it would be a bad idea to rely on it
  from
  a portability point of view. No, the only sane way to make this work is
  that
  the process that initiates an I/O request is responsible for completing
  it.


I meant to add  -  it is really a significant benefit that a bkend
can wait on the aio of a different bkend's original prefeetching aio_read.
Remember that we check completion only when the bkend decides it really
wants the block in a buffer,i.e ReadBuffer and friends,
which might be a very long time after it had issued the prefetch request,
or even never (see below).We don't want other bkends which want that
block to have to wait for the originator to get around to reading it.
*Especially* since the originator may *never* read it if it quits its scan early
leaving prefetched but unread blocks behind.   (Which is also taken
care of in the patch).


  If another process needs to wait for an async I/O to complete, we must
  use
  some other means to do the waiting. Like the io_in_progress_lock that we
  already have, for the same purpose.


  

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 6:53 PM, John Lumby johnlu...@hotmail.com wrote:
 Well,  as mentioned earlier,  it is not broken. Whether it is efficient
 I am not sure.
 I have looked at the mutex in aio_suspend that you mentioned and I am not
 quite convinced that,  if caller is not the original aio_read process,
 it renders the suspend() into an instant timeout.  I will see if I can
 verify that.
 Where are you (Claudio) seeing 10us?


fd.c, in FileCompleteaio, sets timeout to:

my_timeout.tv_sec = 0; my_timeout.tv_nsec = 1;

Which is 10k ns, which is 10 us.

It loops 256 times at most, so it's polling 256 times with a 10 us
timeout. Sounds wasteful.

I'd:

1) If it's the same process, wait for the full timeout (no looping).
If you have to loop (EAGAIN or EINTR - which I just noticed you don't
check for), that's ok.

2) If it's not, just fall through, don't wait, issue the I/O. The
kernel will merge the requests.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread Claudio Freire
On Thu, May 29, 2014 at 7:26 PM, Claudio Freire klaussfre...@gmail.com wrote:
 1) If it's the same process, wait for the full timeout (no looping).
 If you have to loop (EAGAIN or EINTR - which I just noticed you don't
 check for), that's ok.

Sorry, meant to say just looping on EINTR.

About the style guidelines, no, I just copy the style of surrounding
code usually.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-28 Thread Heikki Linnakangas

On 05/28/2014 11:52 PM, John Lumby wrote:


The patch is attached.
It is based on clone of today's 9.4dev source.
I have noticed that this source is
(not suprisingly) quite a moving target at present,
meaning that this patch becomes stale quite quickly.
So although this copy is fine for reviewing,
it may quite probably soon not be correct
for the current source tree.

As mentioned before,  if anyone wishes to try this feature out
on 9.3.4,   I will be making a patch for that soon
which I can supply on request.


Wow, that's a huge patch. I took a very brief look, focusing on the 
basic design. ignoring the style  other minor things for now:


The patch seems to assume that you can put the aiocb struct in shared 
memory, initiate an asynchronous I/O request from one process, and wait 
for its completion from another process. I'm pretty surprised if that 
works on any platform.


How portable is POSIX aio nowadays? Googling around, it still seems that 
on Linux, it's implemented using threads. Does the thread-emulation 
implementation cause problems with the rest of the backend, which 
assumes that there is only a single thread? In any case, I think we'll 
want to encapsulate the AIO implementation behind some kind of an API, 
to allow other implementations to co-exist.


Benchmarks?
- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-28 Thread Peter Geoghegan
On Tue, May 27, 2014 at 3:17 PM, John Lumby johnlu...@hotmail.com wrote:
 Below I am pasting the README we have written for this new functionality
 which mentions some of the measurements, advantages (and disadvantages)
 and we welcome all and any comments on this.

I think that this is likely to be a useful area to work on, but I
wonder: can you suggest a useful test-case or benchmark to show the
advantages of the patch you posted? You mention a testcase already,
but it's a little short on details. I think it's always a good idea to
start with that when pursuing a performance feature.

Have you thought about things like specialized prefetching for nested
loop joins?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-28 Thread Claudio Freire
On Wed, May 28, 2014 at 6:51 PM, Peter Geoghegan p...@heroku.com wrote:
 Have you thought about things like specialized prefetching for nested
 loop joins?

Currently, such a thing would need some non-trivial changes to the
execution nodes, I believe.

For nestloop, correct me if I'm wrong, but index scan nodes don't have
visibility of the next tuple to be searched for.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-28 Thread Peter Geoghegan
On Wed, May 28, 2014 at 5:59 PM, Claudio Freire klaussfre...@gmail.com wrote:
 For nestloop, correct me if I'm wrong, but index scan nodes don't have
 visibility of the next tuple to be searched for.

Nested loop joins are considered a particularly compelling case for
prefetching, actually.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers