[HACKERS] legitimacy of using PG_TRY , PG_CATCH , PG_END_TRY in C function

2017-10-22 Thread John Lumby
I have a C function (a trigger function) which uses the PG_TRY() 
construct to handle certain ERROR conditions.
One example is where invoked as INSTEAD OF INSERT into a view.  It 
PG_TRYs INSERT into the real base table,
but this table may not yet exist  (it is a partitioned child of an 
inheritance parent).
If the error is  ERRCODE_UNDEFINED_TABLE,  then the CATCH issues 
FlushErrorState() and returns to caller who CREATes the table and 
re-issues the insert.
All works perfectly (on every release of 9.x).

But in a different part of the same trigger function,   there is a 
PG_TRY , PG_CATCH of a CREATE INDEX,
which sometimes hits error "relation already exists" (i.e. index with 
same name indexing a different table).
The CATCH issues FlushErrorState() and returns to caller who ignores the 
error.
All works but not perfectly --  at COMMIT,  resource_owner issues  
relcache reference leak messages about relation scans not closed
and also about  snapshot still active. I guess that the CREATE has 
switched resource_owner and pushed a snapshot,  but I did not
debug in detail.

Those are just examples,  but my question is whether use of this 
construct is legal in a C function
(always?   sometimes  e,g only for DML but not DDL?  never?),
and if not,   then how can one legally catch ERRORs like this in a C 
function?
(Yes I realise I could write selects of pg_class to avoid these two 
particular cases rather than the try-and-fail approach
but I would like to understand the general rules where it may be much 
harder to try avoidance)

Cheers,  John


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-25 Thread John Lumby
My cut'n'pasting failed me at one point corrected below.


> discussion about what is the difference between a synchronous read
> versus an asynchronous read as far as non-originator waiting on it is 
> concerned.
>
> I thought a bit more about this.   There are currently two differences,
> one of which can easily be changed and one not so easy.
>
> 1) The current code,  even with sigevent,  still makes the non-originator 
> waiter
>  call aio_error on the originator's aiocb to get the completion code.
>  For sigevent variation,  easily changed to have the originator 
> always call aio_error
>  (from its CHECK_INTERRUPTS or from FIleCompleteaio)
>  and store that in the BAiocb.
>  My idea of why not to do that  was that,  by having the 
> non-originator check the aiocb,
> this would allow the waiter to proceed sooner.   But for a different 
> reason it actually
>  doesn't.   (The non-originator must still wait for the LWlock 
> release)
>
>   2)   Buffer pinning and  returning the BufferAiocb to the free list
> With synchronous IO,each backend that calls a ReadBuffer must pin 
> the buffer
> early in the process.
> With asynchronous IO,initially only the originator gets the pin
> (and that is during PrefetchBuffer,  not Readbuffer)
>  When the aio completes and some backend checks that completion,
> then the backend has various responsibilities:
>
>.   pin the buffer if it did not already have one (from 
> prefetch)
>.  if it was the last such backend to make that check
>   (amongst the cohort waiting on it)
>then XXpin the buffer if it did not already have one 
> (from prefetch)

then return the BufferAiocb to the free list


  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-24 Thread John Lumby
Thanks Heikki,


> Date: Tue, 24 Jun 2014 17:02:38 +0300
> From: hlinnakan...@vmware.com
> To: johnlu...@hotmail.com; st...@mit.edu
> CC: klaussfre...@gmail.com; pgsql-hackers@postgresql.org
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 06/24/2014 04:29 PM, John Lumby wrote:
>>> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby  wrote:
>>>> It is when some *other* backend gets there first with the ReadBuffer that
>>>> things are a bit trickier. The current version of the patch did polling 
>>>> for that case
>>>> but that drew criticism, and so an imminent new version of the patch
>>>> uses the sigevent mechanism. And there are other ways still.
>>>
>>> I'm a bit puzzled by this though. Postgres *already* has code for this
>>> case. When you call ReadBuffer you set the bits on the buffer
>>
>> Good question. Let me explain.
>> Yes, postgresql has code for the case of a backend is inside a synchronous
>> read() or write(), performed from a ReadBuffer(), and some other backend
>> wants that buffer. asynchronous aio is initiated not from ReadBuffer
>> but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
>> postgresql buffer. This is entirely different from the synchronous io case.
>> Why? Because the issuer of the aio_read (the "originator") is unaware
>> of this buffer pinned on its behalf, and is then free to do any other
>> reading or writing it wishes, such as more prefetching or any other 
>> operation.
>> And furthermore, it may *never* issue a ReadBuffer for the block which it
>> prefetched.
>
> I still don't see the difference. Once an asynchronous read is initiated
> on the buffer, it can't be used for anything else until the read has
> finished. This is exactly the same situation as with a synchronous read:
> after read() is called, the buffer can't be used for anything else until
> the call finishes.

Ah,  now I see what you and Greg are asking.   See my next imbed below.

>
> In particular, consider the situation from another backend's point of
> view. Looking from another backend (i.e. one that didn't initiate the
> read), there's no difference between a synchronous and asynchronous
> read. So why do we need a different IPC mechanism for the synchronous
> and asynchronous cases? We don't.
>
> I understand that *within the backend*, you need to somehow track the
> I/O, and you'll need to treat synchronous and asynchronous I/Os
> differently. But that's all within the same backend, and doesn't need to
> involve the flags or locks in shared memory at all. The inter-process
> communication doesn't need any changes.
>
>>> The problem with using the Buffers I/O in progress bit is that the I/O
>>> might complete while the other backend is busy doing stuff. As long as
>>> you can handle the I/O completion promptly -- either in callback or
>>> thread or signal handler then that wouldn't matter. But I'm not clear
>>> that any of those will work reliably.
>>
>> They both work reliably, but the criticism was that backend B polling
>> an aiocb of an aio issued by backend A is not documented as
>> being supported (although it happens to work), hence the proposed
>> change to use sigevent.
>
> You didn't understand what Greg meant. You need to handle the completion
> of the I/O in the same process that initiated it, by clearing the
> in-progress bit of the buffer and releasing the I/O in-progress lwlock
> on it. And you need to do that very quickly after the I/O has finished,
> because there might be another backend waiting for the buffer and you
> don't want him to wait longer than necessary.

I think I understand the question now.    I didn't spell out the details 
earlier.
Let me explain a little more.
With this patch,     when read is issued,   it is either a synchronous IO 
(as before),  or an asynchronous aio_read (new,   represented by
both BM_IO_IN_PROGRESS    *and*  BM_AIO_IN_PROGRESS).
The way other backends wait on a synchronous IO in progress is unchanged.
But if BM_AIO_IN_PROGRESS,   then *any*  backend which requests
ReadBuffer on this block (including originator) follows a new path
through BufCheckAsync() which,  depending on various flags and context,
send the backend down to FileCompleteaio to check and maybe wait.
So *all* backends who are waiting for a BM_AIO_IN_PROGRESS buffer
will wait in that way. 
  
>
> The question is, if you receive the notification of the I/O completion
> using a signal or a thread, is it safe to release the lwlock from t

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-24 Thread John Lumby



> From: st...@mit.edu
> Date: Mon, 23 Jun 2014 16:04:50 -0700
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
> To: johnlu...@hotmail.com
> CC: klaussfre...@gmail.com; hlinnakan...@vmware.com; 
> pgsql-hackers@postgresql.org
>
> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby  wrote:
>> It is when some *other* backend gets there first with the ReadBuffer that
>> things are a bit trickier. The current version of the patch did polling for 
>> that case
>> but that drew criticism, and so an imminent new version of the patch
>> uses the sigevent mechanism. And there are other ways still.
>
> I'm a bit puzzled by this though. Postgres *already* has code for this
> case. When you call ReadBuffer you set the bits on the buffer

Good question. Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(),  performed from a ReadBuffer(),  and some other backend
wants that buffer.    asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer,    and performs its aio_read into an allocated,  
pinned,
postgresql buffer.    This is entirely different from the synchronous io case.
Why?  Because the issuer of the aio_read (the "originator") is unaware
of this buffer pinned on its behalf,  and is then free to do any other 
reading or writing it wishes,   such as more prefetching  or any other 
operation.
And furthermore,  it may *never* issue a ReadBuffer for the block which it
prefetched.

Therefore,  asynchronous IO is different from synchronous IO,  and
a new bit,  BM_AIO_IN_PROGRESS, in the buf_header  is required to 
track this aio operation until completion.

I would encourage you to read the new 
postgresql-prefetching-asyncio.README
in the patch file where this is explained in greater detail.

> indicating I/O is in progress. If another backend does ReadBuffer for
> the same block they'll get the same buffer and then wait until the
> first backend's I/O completes. ReadBuffer goes through some hoops to
> handle this (and all the corner cases such as the other backend's I/O
> completing and the buffer being reused for another block before the
> first backend reawakens). It would be a shame to reinvent the wheel.

No re-invention!   Actually some effort has been made to use the
existing functions in bufmgr.c as much as possible rather than
rewriting them.

>
> The problem with using the Buffers I/O in progress bit is that the I/O
> might complete while the other backend is busy doing stuff. As long as
> you can handle the I/O completion promptly -- either in callback or
> thread or signal handler then that wouldn't matter. But I'm not clear
> that any of those will work reliably.

They both work reliably,  but the criticism was that backend B polling 
an aiocb of an aio issued by backend A is not documented as 
being supported  (although it happens to work),  hence the proposed
change to use sigevent.

By the way,   on the "will it actually work though?" question which several 
folks
have raised,    I should mention that this patch has been in semi-production 
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel.       I would guess it has had around
500 hours of operation by now. I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.
 
>
> --
> greg
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-23 Thread John Lumby



> Date: Thu, 19 Jun 2014 15:43:44 -0300
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfre...@gmail.com
> To: st...@mit.edu
> CC: hlinnakan...@vmware.com; johnlu...@hotmail.com; 
> pgsql-hackers@postgresql.org
>
> On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark  wrote:
>> I don't think the threaded implementation on Linux is the one to use
>> though.  [... ] The overhead of thread communication
>> will completely outweigh any advantage over posix_fadvise's partial
>> win.
>
> What overhead?
>
> The only communication is through a "done" bit and associated
> synchronization structure when *and only when* you want to wait on it.
>

Threads do cost some extra CPU,  but provided the system had some
spare CPU capacity,  then performance improves because of reduced IO wait.
I quoted a measured improvement of  17% - 18% improvement in the README
along with some more explanation of when the asyc IO gives and improvement.

> Furthermore, posix_fadvise is braindead on this use case, been there,
> done that. What you win with threads is a better postgres-kernel
> interaction, even if you loose some CPU performance it's gonna beat
> posix_fadvise by a large margin.
>
> [...]
>
>> When I investigated this I found the buffer manager's I/O bits seemed
>> to already be able to represent the state we needed (i/o initiated on
>> this buffer but not completed). The problem was in ensuring that a
>> backend would process the i/o completion promptly when it might be in
>> the midst of handling other tasks and might even get an elog() stack
>> unwinding. The interface that actually fits Postgres best might be the
>> threaded interface (orthogonal to the threaded implementation
>> question) which is you give aio a callback which gets called on a
>> separate thread when the i/o completes. The alternative is you give
>> aio a list of operation control blocks and it tells you the state of
>> all the i/o operations. But it's not clear to me how you arrange to do
>> that regularly, promptly, and reliably.
>
> Indeed we've been musing about using librt's support of completion
> callbacks for that.

For the most common case of a backend issues a PrefetchBuffer
and then that *same* backend issues ReadBuffer,  the posix aio works
ideally,  since there is no need for any callback or completion signal,
we simply check "is it complete" during the ReadBuffer.

It is when some *other* backend gets there first with the ReadBuffer that 
things are a bit trickier.    The current version of the patch did polling for 
that case
but that drew criticism,    and so an imminent new version of the patch
uses the sigevent mechanism.    And there are other ways still.

In an earlier posting I reported that ,  in my benchmark,
99.8% of [FileCompleteaio]  calls are from originator and only < 0.2% are 
not.so,  from a performance perspective,  only the common case really matters.
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-20 Thread John Lumby
Thanks Fujii ,   that is a bug   --   an #ifdef  USE_PREFETCH is missing in 
heapam.c
   (maybe several)
I will fix it in the next patch version.

I also appreciate it is not easy to review the patch.
There are really 4 (or maybe 5) parts :
   
 .   async io (librt functions)
 .   buffer management   (allocating, locking and pinning etc)
 .   scanners making prefetch calls
 .   statistics

    and the autoconf input program

I will see what I can do.   Maybe putting an indicator against each modified 
file?

I am currently working on two things :
  .   alternative way for non-originator of an aio_read to wait on completion
 (LWlock instead of polling the aiocb)
  This was talked about in several earlier posts and Claudio is also 
working on something there
  .   package up my benchmark

Cheers    John


> Date: Fri, 20 Jun 2014 04:21:19 +0900
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
> and patch
> From: masao.fu...@gmail.com
> To: johnlu...@hotmail.com
> CC: pgsql-hackers@postgresql.org; klaussfre...@gmail.com
>
> On Mon, Jun 9, 2014 at 11:12 AM, johnlumby  wrote:
>> updated version of patch compatible with git head of 140608,
>> (adjusted proc oid and a couple of minor fixes)
>
> Compilation of patched version on MacOS failed. The error messages were
>
> gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c
> heapam.c: In function 'heap_unread_add':
> heapam.c:362: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:363: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:369: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:375: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:381: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:387: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:405: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c: In function 'heap_unread_subtract':
> heapam.c:419: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:425: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:434: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:442: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:452: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:453: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:454: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:456: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c: In function 'heapgettup':
> heapam.c:944: error: 'struct HeapScanDescData' has no member named
> 'rs_pfchblock'
> heapam.c:716: warning: unused variable 'ix'
> heapam.c: In function 'heapgettup_pagemode':
> heapam.c:1243: error: 'struct HeapScanDescData' has no member named
> 'rs_pfchblock'
> heapam.c:1029: warning: unused variable 'ix'
> heapam.c: In function 'heap_endscan':
> heapam.c:1808: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:1809: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> make[4]: *** [heapam.o] Error 1
> make[3]: *** [heap-recursive] Error 2
> make[2]: *** [access-recursive] Error 2
> make[1]: *** [install-backend-recurse] Error 2
> make: *** [install-src-recurse] Error 2
>
>
> Huge patch is basically not easy to review. What about simplifying the patch
> by excluding non-core parts like the change of pg_stat_statements, so that
> reviewers can easily read the patch? We can add such non-core parts later.
>
> Regards,
>
> --
> Fujii Masao
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby


> Date: Thu, 29 May 2014 18:00:28 -0300
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
> and patch
> From: klaussfre...@gmail.com
> To: hlinnakan...@vmware.com
> CC: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
> 

> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.


I meant to add  -  it is really a significant benefit that a bkend
can wait on the aio of a different bkend's original prefeetching aio_read.
Remember that we check completion only when the bkend decides it really
wants the block in a buffer,i.e ReadBuffer and friends,
which might be a very long time after it had issued the prefetch request,
or even never (see below).We don't want other bkends which want that
block to have to wait for the originator to get around to reading it.
*Especially* since the originator may *never* read it if it quits its scan early
leaving prefetched but unread blocks behind.   (Which is also taken
care of in the patch).


> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.


  

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby


> From: t...@sss.pgh.pa.us
> To: klaussfre...@gmail.com
> CC: hlinnakan...@vmware.com; johnlu...@hotmail.com; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
> and patch
> Date: Thu, 29 May 2014 17:56:57 -0400
> 
> Claudio Freire  writes:
> > On Thu, May 29, 2014 at 6:43 PM, Claudio Freire  
> > wrote:
> >> On Thu, May 29, 2014 at 6:19 PM, Tom Lane  wrote:
> >>> "ampeeknexttuple"?  That's a bit scary.  It would certainly be unsafe
> >>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
> >>> nbtree/README).
> 
> >> It's not really the tuple, just the tid
> 
> > And, furthermore, it's used only to do prefetching, so even if the tid
> > was invalid when the tuple needs to be accessed, it wouldn't matter,
> > because the indexam wouldn't use the result of ampeeknexttuple to do
> > anything at that time.
> 
> Nonetheless, getting the next tid out of the index may involve stepping
> to the next index page, at which point you've lost your interlock

I think we are ok as peeknexttuple (yes  bad name,  sorry,  can change it ...

never advances to another page  :



 *btpeeknexttuple() -- peek at the next tuple different from any blocknum 
in pfch_list

 *   without reading a new index page

 *   and without causing any side-effects such as altering 
values in control blocks

 *   if found, store blocknum in next element of pfch_list




> guaranteeing that the *previous* tid will still mean something by the time
> you arrive at its heap page.  I presume that the ampeeknexttuple call is
> issued before trying to visit the heap (otherwise you're not actually
> getting much I/O overlap), so I think there's a real risk here.
> 
> Having said that, it's probably OK as long as this mode is only invoked
> for user queries (with MVCC snapshots) and not for system indexscans.
> 
>   regards, tom lane
  

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby


> Date: Thu, 29 May 2014 18:00:28 -0300
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
> and patch
> From: klaussfre...@gmail.com
> To: hlinnakan...@vmware.com
> CC: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
> 
> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
>  wrote:
> > On 05/29/2014 11:34 PM, Claudio Freire wrote:
> >>
> >> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
> >>  wrote:
> >>>
> >>> On 05/29/2014 04:12 PM, John Lumby wrote:
> >>>>
> >>>>
> >>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
> >>>>>
> >>>>> The patch seems to assume that you can put the aiocb struct in shared
> >>>>> memory, initiate an asynchronous I/O request from one process, and wait
> >>>>> for its completion from another process. I'm pretty surprised if that
> >>>>> works on any platform.
> >>>>
> >>>>
> >>>> It works on linux.Actually this ability allows the asyncio
> >>>> implementation to
> >>>> reduce complexity in one respect (yes I know it looks complex enough) :
> >>>> it makes waiting for completion of an in-progress IO simpler than for
> >>>> the existing synchronous IO case,.   since librt takes care of the
> >>>> waiting.
> >>>> specifically,   no need for extra wait-for-io control blocks
> >>>> such as in bufmgr's  WaitIO()
> >>>
> >>>
> >>> [checks]. No, it doesn't work. See attached test program.

Thanks for checkingand thanks for coming up with that test program.
However,  yes,  it really does work  --  always  (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However,  the rule is,  call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.

See the code changes to fd.c function FileCompleteaio()
to see how we have done it.   And I am attaching corrected version
of your test program which runs just fine.


> >>>
> >>> It kinda seems to work sometimes, because of the way it's implemented in
> >>> glibc. The aiocb struct has a field for the result value and errno, and
> >>> when
> >>> the I/O is finished, the worker thread fills them in. aio_error() and
> >>> aio_return() just return the values of those fields, so calling
> >>> aio_error()
> >>> or aio_return() do in fact happen to work from a different process.
> >>> aio_suspend(), however, is implemented by sleeping on a process-local
> >>> mutex,
> >>> which does not work from a different process.
> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.
> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.
> >>
> >>
> >> But calls to it are timeouted by 10us, effectively turning the thing
> >> into polling mode.
> >
> >
> > We don't want polling... And even if we did, calling aio_suspend() in a way
> > that's known to be broken, in a loop, is a pretty crappy way of polling.

Well,  as mentioned earlier,  it is not broken. Whether it is efficient I 
am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that,  if caller is not the original aio_read process,
it renders the suspend() into an instant timeout.  I will see if I can 
verify that.
Where are you (Claudio) seeing 10us?

> 
> 
> Didn't fix that, but the attached patch does fix regression tests when
> scanning over index types other than btree (was invoking elog when the
> index am didn't have ampeeknexttuple)
  /*
 * Test program to test if POSIX aio functions work across processes
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char *shmem;

void
processA(void)
{
int fd;
struct aiocb *aiocbp = (struct aiocb *) shmem;
char *buf = shmem + sizeof(struct aiocb);

fd = open("aio-shmem-test-file", O_CREAT | O_WRONLY | O_SYNC, 

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby
I have pasted  below the EXPLAIN of one of my benchmark queries
(the one I reference in the README).
Plenty of nested loop joins.
However I think I understand your question as to how effective it would be
if the outer is not sorted, and I will see if I can dig into that
if I get time  (and it sounds as though Claudio is on it too).

The area of exactly what the best prefetch strategy should be for
each particular type of scan and context is a good one to work on.

John

> Date: Wed, 28 May 2014 18:12:23 -0700
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
> and patch
> From: p...@heroku.com
> To: klaussfre...@gmail.com
> CC: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
> 
> On Wed, May 28, 2014 at 5:59 PM, Claudio Freire  
> wrote:
> > For nestloop, correct me if I'm wrong, but index scan nodes don't have
> > visibility of the next tuple to be searched for.
> 
> Nested loop joins are considered a particularly compelling case for
> prefetching, actually.
> 
> -- 
> Peter Geoghegan

-
 QUERY PLAN 

  

 Limit  (cost=801294.81..801294.81 rows=2 width=532)
   CTE deploy_zone_down
 ->  Recursive Union  (cost=1061.25..2687.40 rows=11 width=573)
   ->  Nested Loop  (cost=1061.25..1423.74 rows=1 width=41)
 ->  Nested Loop  (cost=1061.11..1421.22 rows=14 width=49)
   ->  Bitmap Heap Scan on entity zone_tree  
(cost=1060.67..1175.80 rows=29 width=40)
 Recheck Cond: ((name >= 'testZone-4375'::text) AND 
(name <= 'testZone-5499'::text) AND ((discriminator)::text = 'ZONE'::text))
 ->  BitmapAnd  (cost=1060.67..1060.67 rows=29 
width=0)
   ->  Bitmap Index Scan on entity_name  
(cost=0.00..139.71 rows=5927 width=0)
 Index Cond: ((name >= 
'testZone-4375'::text) AND (name <= 'testZone-5499'::text))
   ->  Bitmap Index Scan on 
entity_discriminatorx  (cost=0.00..920.70 rows=49636 width=0)
 Index Cond: ((discriminator)::text = 
'ZONE'::text)
   ->  Index Scan using metadata_value_owner_id on 
metadata_value mv  (cost=0.43..8.45 rows=1 width=17)
 Index Cond: (owner_id = zone_tree.id)
 ->  Index Scan using metadata_field_pkey on metadata_field mf  
(cost=0.15..0.17 rows=1 width=8)
   Index Cond: (id = mv.field_id)
   Filter: ((name)::text = 'deployable'::text)
   ->  Nested Loop  (cost=0.87..126.34 rows=1 width=573)
 ->  Nested Loop  (cost=0.72..125.44 rows=5 width=581)
   ->  Nested Loop  (cost=0.29..83.42 rows=10 width=572)
 ->  WorkTable Scan on deploy_zone_down dzd  
(cost=0.00..0.20 rows=10 width=540)
 ->  Index Scan using 
entity_discriminator_parent_zone on entity ch  (cost=0.29..8.31 rows=1 width=40)
   Index Cond: ((parent_id = dzd.zone_tree_id) 
AND ((discriminator)::text = 'ZONE'::text))
   ->  Index Scan using metadata_value_owner_id on 
metadata_value mv_1  (cost=0.43..4.19 rows=1 width=17)
 Index Cond: (owner_id = ch.id)
 ->  Index Scan using metadata_field_pkey on metadata_field 
mf_1  (cost=0.15..0.17 rows=1 width=8)
   Index Cond: (id = mv_1.field_id)
   Filter: ((name)::text = 'deployable'::text)
   CTE deploy_zone_tree
 ->  Recursive Union  (cost=0.00..9336.89 rows=21 width=1105)
   ->  CTE Scan on deploy_zone_down dzd_1  (cost=0.00..0.22 rows=11 
width=1105)
   ->  Nested Loop  (cost=0.43..933.63 rows=1 width=594)
 ->  WorkTable Scan on deploy_zone_tree dzt  (cost=0.00..2.20 
rows=110 width=581)
 ->  Index Scan using entity_pkey on entity pt  
(cost=0.43..8.46 rows=1 width=21)
   Index Cond: (id = dzt.zone_tree_ancestor_parent_id)
   Filter: ((discriminator)::text = ANY 
('{VIEW,ZONE}'::text[]))
   CTE forward_host_ip
 ->  Nested Loop  (cost=1.30..149.65 rows=24 width=88)
   ->  Nested Loop  (cost=0.87..121.69 rows=48 width=56)
 ->  Nested Loop

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-29 Thread John Lumby
Thanks for looking at it!

> Date: Thu, 29 May 2014 00:19:33 +0300
> From: hlinnakan...@vmware.com
> To: johnlu...@hotmail.com; pgsql-hackers@postgresql.org
> CC: klaussfre...@gmail.com
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal 
> and patch
> 
> On 05/28/2014 11:52 PM, John Lumby wrote:
> >
> 
> The patch seems to assume that you can put the aiocb struct in shared 
> memory, initiate an asynchronous I/O request from one process, and wait 
> for its completion from another process. I'm pretty surprised if that 
> works on any platform.

It works on linux.Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,.   since librt takes care of the waiting.
specifically,   no need for extra wait-for-io control blocks
such as in bufmgr's  WaitIO()

This also plays very nicely with the syncscan where the cohorts run close 
together.

If anyone would like to advise whether this also works on MacOS/BSD , windows,
that would be good,  as I can't verify it myself.

> 
> How portable is POSIX aio nowadays? Googling around, it still seems that 
> on Linux, it's implemented using threads. Does the thread-emulation 
> implementation cause problems with the rest of the backend, which 
> assumes that there is only a single thread? In any case, I think we'll 

Good question,   I am not aware of any dependency on a backend having only
a single thread.Can you please point me to such dependencies?


> want to encapsulate the AIO implementation behind some kind of an API, 
> to allow other implementations to co-exist.

It is already  -it follows the smgr(stg mgr) -> md (mag disk) ->  fd 
(filesystem) 
layering used for the existing filesystem ops including posix-fadvise.

> 
> Benchmarks?

I will see if I can package mine up somehow.
Would be great if anyone else would like to benchmark it on theirs   ...


> - Heikki
  

Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-06 Thread John Lumby

Bruce Momjian wrote:
>
> On Fri, Nov 2, 2012 at 09:59:08AM -0400, John Lumby wrote:
> > However,the OP describes an implementation based on libaio.
> > Today what we have (for linux) is librt,  which is quite different.
>
> Well, good thing we didn't switch to using libaio, now that it is gone.
>
Yes,  I think you are correct.    Although I should correct myself about
status of libaio -  it seems many distros continue to provide it and at least
one other popular database (MySQL) uses it,   but as far as I can tell
the content has not been updated by the original authors for around 10 years.
That is perhaps not surprising since it does very little other than wrap
the linux kernel syscalls.

Set against the CPU-overhead disadvantage of librt,  I think the three
main advantages of librt vs libaio/kernel-aio for postgresql are :
  .   posix standard,  and probably easier to provide very similar
  implementation on windows  (I see at least one posix aio lib for windows)
  .   no restrictions on the way files are accessed  (kernel-aio imposes 
restrictions
  on open() flags and buffer alignment etc)
  .   it seems (from the recent postings about the earlier attempt to implement
  async io using libaio) that the posix threads style lends itself better 
to 
  fitting in with the postgresql backend model.

John

  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-02 Thread John Lumby

Claudio Freire wrote:
>
> On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith  wrote:
> > On 11/1/12 6:13 PM, Claudio Freire wrote:
> >
> >> posix_fadvise what's the trouble there, but the fact that the kernel
> >> stops doing read-ahead when a call to posix_fadvise comes. I noticed
> >> the performance hit, and checked the kernel's code. It effectively
> >> changes the prediction mode from sequential to fadvise, negating the
> >> (assumed) kernel's prefetch logic.
> >
> > The Linux posix_fadvise implementation never seemed like it was well liked
> > by the kernel developers. Quirky stuff like this popped up all the time
> > during that period, when effective_io_concurrency was being added. I wonder
> > how far back the fadvise/read-ahead conflict goes back.
>
> Well, to be precise it's not so much as a problem in posix_fadvise
> itself, it's a problem in how it interacts with readahead. Since
> readahead works at the memory mapper level, and only when actually
> performing I/O (which would seem at first glance quite sensible), it
> doesn't get to see fadvise activity.
>
> FADV_WILLNEED is implemented as a forced readahead, which doesn't
> update any of the readahead context structures. Again, at first
> glance, this would seem sensible (explicit hints shouldn't interfere
> with pattern detection logic). However, since those pages are (after
> the fadvise call) under async I/O, next time the memory mapper needs
> that page, instead of requesting I/O through readahead logic, it will
> wait for async I/O to complete.
>
> IOW, what was sequential in fact, became invisible to readahead,
> indistinguishable from random I/O. Whatever page fadvise failed to
> predict will be treated as random I/O, and here the trouble lies.

And this may be one other advantage of async io over posix_fadvise in
the linux environment (with the present mmap behaviour) :
that async io achives the same objective of improving disk/processing overlap
without the mentioned interference with read-ahead.
Although to confirm this would ideally require 3-way comparing
   posix-fadvise + existing readahead behaviour
   posix-fadvise + modify existing readahead behaviour 
  to not force waiting for current async io
                              (i.e. just check the aio and continue normal 
readahead if in progress)
   async io wth no posix-fadvise

It seems in general to be preferable to have an implementation that is
less dependent on specific behaviour of the OS read-head mechanism.

>
> >> I've mused about the possibility to batch async_io requests, and use
> >> the scatter/gather API instead of sending tons of requests to the
> >
> >> kernel. I think doing so would enable a zero-copy path that could very
> >> possibly imply big speed improvements when memory bandwidth is the
> >> bottleneck.
> >
> > Another possibly useful bit of history here for you. Greg Stark wrote a
> > test program that used async I/O effectively on both Linux and Solaris.
> > Unfortunately, it was hard to get that to work given how Postgres does its
> > buffer I/O, and using processes instead of threads. This looks like the
> > place he commented on why:
> >
> > http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html
> >
> > The part I think was relevant there from him:
> >
> > "In the libaio view of the world you initiate io and either get a
> > callback or call another syscall to test if it's complete. Either
> > approach has problems for Postgres. If the process that initiated io
> > is in the middle of a long query it might take a long time, or not even
> > never get back to complete the io. The callbacks use threads...
> >
> > And polling for completion has the problem that another process could
> > be waiting on the io and can't issue a read as long as the first
> > process has the buffer locked and io in progress. I think aio makes a
> > lot more sense if you're using threads so you can start a thread to
> > wait for the io to complete."
>
> I noticed that. I always envisioned async I/O as managed by some
> dedicated process. One that could check for completion or receive
> callbacks. Postmaster, for instance.

Thanks for the mentioning this posting.    Interesting.
However,    the OP describes an implementation based on libaio. 
Today what we have (for linux) is librt,  which is quite different.
It is arguable worse than libaio (well actually I am sure it is worse)
since it is essentially just an encapsulation of using threads to do
synchronous ios  -  you can look at it as making it easier to do what the 
application could do itself if it set up its own pthreads. The linux
kernel does not know about it and so the CPU overhead of checking for
completion is higher.
But if async io is used *only* for prefetching, and not for the actual
ReadBuffer itself   (which is what I did),   then the problem
mentioned by the OP 
 "If the process that initiated io is in the middle of a long 

FW: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread John Lumby

Claudio wrote :
>
> Check the latest patch, it contains heap page prefetching too.
>

Oh yes I see. I missed that - I was looking in the wrong place.
I do have one question about the way you did it : by placing the
prefetch heap-page calls in _bt_next, which effectively means inside
a call from the index am index_getnext_tid to btgettuple, are you sure
you are synchronizing your prefetches of heap pages with the index am's
ReadBuffer's of heap pages? I.e. are you complying with this comment
from nodeBitmapHeapscan.c for prefetching its bitmap heap pages in
the bitmap-index-scan case:

* We issue prefetch requests *after* fetching the current page to try
* to avoid having prefetching interfere with the main I/O.

I can't really tell whether your design conforms to this and nor do I
know whether it is important, but I decided to do it in the same manner,
and so implemented the heap-page fetching in index_fetch_heap

>
> async_io indeed may make that logic obsolete, but it's not redundant
> posix_fadvise what's the trouble there, but the fact that the kernel
> stops doing read-ahead when a call to posix_fadvise comes. I noticed
> the performance hit, and checked the kernel's code. It effectively
> changes the prediction mode from sequential to fadvise, negating the
> (assumed) kernel's prefetch logic.
>
I did not know that. Very interesting.


>
> I've mused about the possibility to batch async_io requests, and use
> the scatter/gather API insead of sending tons of requests to the
> kernel. I think doing so would enable a zero-copy path that could very
> possibly imply big speed improvements when memory bandwidth is the
> bottleneck.

I think you are totally correct on this point. If I recall, the 
glic (librt) aio does have an lio_listio but it is either a noop
or just loops over the list, I forget which (don't have its source right now),
but in any case I am sure there is a potential for implementing such a facility.
But to be really effective, it should be implemented in the kernel itself,
which we don't have today.

John
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-11-01 Thread John Lumby

Claudio wrote :
>
> Oops - forgot to effectively attach the patch.
>

I've read through your patch and the earlier posts by you and Cédric.

This is very interesting.      You chose to prefetch index btree (key-ptr) pages
whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
Never mind why  --  I think they should work very well together  -  as both have
been demonstrated to produce improvements.   I will see if I can combine them,
git permitting  (as of course their changed file lists overlap).

I was surprised by this design decision :
    /* start prefetch on next page, but not if we're reading sequentially 
already, as it's counterproductive in those cases */
Is it really?    Are you assuming the it's redundant with posix_fadvise for 
this case?
I think possibly when async_io is also in use by the postgresql prefetcher,
this decision could change.

Cédric wrote:
> If the gain is visible mostly for the backward and not for other access
>
> building the latest kernel with that patch included, replicating the
>

I  found improvement from forward scans. 
Actually I did not even try backward but only because I did not have time.
It should help both.

>> I don't even think windows supports posix_fadvise, but if async_io is
>> used (as hinted by the link Lumby posted), it would probably also work
>> in windows.

windows has async io and I think it would not be hard to extend my 
implementation
to windows  (although I don't plan it myself). Actually about 95% of the 
code I wrote
to implement async-io in postgresql concerns not the async io,  which is 
trivial,
but the buffer management.   With async io,   PrefetchBuffer must allocate and 
pin a
buffer,  (not too hard),   but now also every other part of buf mgr must know 
about the 
possibility that a buffer may be in async_io_in_progress state and be prepared 
to
determine the possible completion (quite hard)   -  and also if and when the 
prefetch requester
comes again with ReadBuffer,  buf mgr has to remember that this buffer was 
pinned by 
this backend during previous prefetch and must not be re-pinned a second time
(very hard without increasing size of the shared descriptor,  which was 
important
since there could be a very large number of these).
It ended up with a major change to bufmgr.c plus one new file for handling
buffer management aspects of starting, checking and terminating async io.

However I think in some environments the async-io has significant benefits over
posix-fadvise,  especially (of course!)   where access is very non-sequential,
but even also for sequential if there are many concurrent conflicting sets of 
sequential
command streams from different backends
(always assuming the RAID can manage them concurrently).

I've attached a snapshot patch of just the non-bitmap-index-scan changes I've 
made.
You can't compile it as is because I had to change the interface to 
PrefetchBuffer
and add a new DiscardBuffer which I did not include in this snapshot to avoid 
confusing.

John

  --- src/backend/executor/nodeIndexscan.c.orig	2012-10-31 15:24:12.083163547 -0400
+++ src/backend/executor/nodeIndexscan.c	2012-11-01 11:45:16.244967963 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans */
+#endif   /* USE_PREFETCH */
 
 /* 
  *		IndexNext
@@ -418,7 +423,17 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+{
 		index_endscan(indexScanDesc);
+
+#ifdef USE_PREFETCH
+if (   indexScanDesc->do_prefetch
+&& ( (struct BlockIdData*)0 != indexScanDesc->pfch_list )
+   ) {
+ pfree(indexScanDesc->pfch_list);
+}
+#endif   /* USE_PREFETCH */
+}
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +624,25 @@ ExecInitIndexScan(IndexScan *node, EStat
 			   indexstate->iss_NumScanKeys,
 			 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+/*  initialize prefetching   */
+		if (prefetch_index_scans
+ &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+ && ((   (prefetch_dbOid > 0)
+   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+ )
+ ||  (prefetch_dbOid == 0)
+)
+   ) {
+			indexstate->iss_ScanDesc->pfch_list = palloc( t

Re: [HACKERS] [PATCH] Prefetch index pages for B-Tree index scans

2012-10-23 Thread John Lumby

> From: Claudio Freire 
> I hope I'm not talking to myself.

Indeed not.   I also looked into prefetching for pure index scans for b-trees  
(and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

I am not where I have a proper setup this week but will reply at greater length 
next week.

John
  

[HACKERS] Re: proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule

2012-07-05 Thread John Lumby

First,  apologies for taking so long to reply to your post.

On Fri, 22 Jun 2012 09:55:13, Robert Haas wrote:

> On Wed, Jun 20, 2012 at 12:24 PM, John Lumby  
> wrote:
> > An INSERT which has a RETURNING clause and which is to be rewritten 
> >based on
> > a rule will be accepted if the rule is an "unconditional DO INSTEAD".
> > In general I believe "unconditional" means "no WHERE clause", but in 
> >practice
> > if the rule is of the form
> >    CREATE RULE insert_part_history as ON INSERT to history \
> >  DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
> > this is treated as conditional and the query is rejected.
> 
> This isn't rejected because the query is treated as condition; it's
> rejected because it's not valid syntax.  A SELECT query can't have a
> RETURNING clause, because the target list (i.e. the part that
> immediately follows the SELECT) already serves that purpose. The fact
> that it's in a CREATE RULE statement is irrelevant.

Thanks for correcting me.   At the time,  it wasn't clear to me whether the 
RETURNING clause
in a CREATE RULE statement belonged to the CREATE RULE statement or the rule 
being created,
but now I see it's the latter.

> >   .  I propose to extend the rule system to recognize cases where the 
> >INSERT query specifies
> >  RETURNING and the rule promises to return a row,  and to then permit 
> >this query to run
> >  and return the expected row.   In effect,  to widen the definition of 
> >"unconditional"
> >  to handle cases such as my testcase.
> 
> That already (kind of) works:
> 
>  [...]
> 
> I do notice that the RETURNING clause of the INSERT can't reference
> NEW, which seems like a restriction that we probably ought to lift,
> but it doesn't seem to have much to do with your patch.
> 

The main use of my proposal is to be able to return the value of the sequence 
assigned
to the NEW.id column,  so yes   that is a serious restriction.

However,   even if that restriction is lifted,  it will not help with the case 
where
the rule is an invocation of a function,   which is the case I need.    For 
example,
in my testcase,  the function is building the SQL statement to be executed 
including the
 table name of the partitioned child table,   based on the timestamp in the NEW 
record.

Your comments have helped me realize that my original subject line did not 
accurately state
my requirement,  which should read
 "support INSERT INTO...RETURNING with partitioned table using rule which 
invokes a function"

And for this requirement as re-stated,  as far as I can tell,   current 
postgresql has no solution:
  .  only way to invoke a function in a rewrite rule is with SELECT function()
  .  SELECT does not permit RETURNING clause
  .  my proposal provides a way of relaxing this restriction for the case of 
SELECT in a rewrite rule.

I hope this describes the proposal a bit better.

John
  
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule

2012-06-20 Thread John Lumby

---
Problem I'm trying to solve:
    For partitioned tables,  make it possible to use RETURNING clause on INSERT 
INTO
   together with DO INSTEAD rule

[  Note  -  wherever I say INSERT I also mean UPDATE and DELETE ]

---
Current behaviour :

    An INSERT which has a RETURNING clause and which is to be rewritten based on
    a rule will be accepted if the rule is an "unconditional DO INSTEAD".
    In general I believe "unconditional" means "no WHERE clause", but in 
practice
    if the rule is of the form 
   CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    this is treated as conditional and the query is rejected.    

    Testcase:
    A table T is partitioned and has two or more columns,  one of which 
    is an id column declared as
 id bigint DEFAULT nextval('history_id_seq'::regclass) NOT NULL
    and the application issues
  "INSERT into history  (column-list which excludes id)  values () 
RETURNING id"

    I can get the re-direction of the INSERT *without* RETURNING to work using
    either trigger or rule, in which the trigger/rule invokes a procedure, but
    whichever way I do it, I could not get this RETURNING clause to work.

    For a trigger,
  the INSERT ... RETURNING was accepted but returned no rows, (as I would
  expect), and for the RULE, the INSERT ... RETURNING was rejected with :
 
  ERROR:  cannot perform INSERT RETURNING on relation "history"
  HINT:  You need an unconditional ON INSERT DO INSTEAD rule with a 
RETURNING clause.

  but this hint was not much help,  since :

   For a rule,
 CREATE RULE insert_part_history as ON INSERT to history \
 DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id
    ERROR:  syntax error at or near "returning"
    LINE 1: ...DO INSTEAD SELECT history_insert_partitioned(NEW) returning ...

    Here the function history_insert_partitioned is something like
    CREATE FUNCTION history_insert_partitioned( NEW public.history) RETURNS 
BIGINT AS $$
    DECLARE
   ...
    BEGIN
   ...
 < acccess NEW fields e.g. timestamp>
 <  construct partitioned table name>
  < EXECUTE 'INSERT INTO ' partitioned table
   ...
    RETURN history_id;
    END;
    $$
    LANGUAGE plpgsql

---
Some references to discussion of this requirement :
  .  http://wiki.postgresql.org/wiki/Todo
 item "Make it possible to use RETURNING together with conditional DO 
INSTEAD rules,
   such as for partitioning setups"
  .  http://archives.postgresql.org/pgsql-general/2012-06/msg00377.php
  .  http://archives.postgresql.org/pgsql-general/2010-12/msg00542.php
  .  
http://acodapella.blogspot.it/2011/06/hibernate-postgresql-table-partitioning.html

---
Proposal:
  .  I propose to extend the rule system to recognize cases where the INSERT 
query specifies
 RETURNING and the rule promises to return a row,  and to then permit this 
query to run
 and return the expected row.   In effect,  to widen the definition of 
"unconditional"
 to handle cases such as my testcase.
  .  One comment is that all the infrastructure for returning one row from the 
re-written query
 is already present in the code,  and the non-trivial question is how to 
ensure the
 new design is safe in preventing any rewrite that actually would not 
return a row.
  .  In this patch,   I have chosen to make use of the LIMIT clause  -
 I add a side-effect implication to a LIMIT clause when it occurs in a 
rewrite of an INSERT
 to mean "this rule will return a row".
 So,  with my patch, same testcase,  same function 
history_insert_partitioned and new rule
    CREATE RULE insert_part_history as ON INSERT to history \
   DO INSTEAD SELECT history_insert_partitioned(NEW) LIMIT 1
 the INSERT is accepted and returns the id.
 This use of LIMIT clause is probably contentious but I wished to avoid 
introducing new
 SQL syntax,  and the LIMIT clause does have a connotation of returning 
rows.

---
I attach patch based on clone of postgresql.git as of yesterday (120619-145751 
EST)
I have tested the patch with INSERT and UPDATE  (not tested with DELETE but 
should work).
The patch is not expected to be final but just to show how I did it.

John
  --- src/backend/optimizer/plan/planner.c.orig	2012-06-19 14:59:21.264574275 -0400
+++ src/backend/optimizer/plan/planner.c	2012-06-19 15:10:54.776590758 -0400
@@ -226,7 +226,8 @@ standard_planner(Query *parse, int curso
 
 	result->commandType = parse->commandType;
 	result->queryId = parse->queryId;
-	result->hasReturning = (