Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-12 Thread Amit Kapila
On Thu, Jul 2, 2015 at 7:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote:
  I'm marking this as returned with feedback in the commitfest.

  There are no unresolved issues with the approach, nor is it true it is
  slower. If you think there are some, you should say what they are, not
act
  high handedly to reject a patch without reason.

 Have you read the thread?

I have read the thread again with a focus on the various problems
(objections) raised for this patch and here is a summarization of
my reading.

1. lseek can lie about EOF which could lead to serious problem
during checkpoint if drop table misses to remove the shared buffer
belonging to the table-to-be-dropped.

This problem can be solved by maintaining the list of dropped relations
and then while checkpoint clean such buffers during buffer-pool scan.
Something similar is already used to avoid similar problems during
FSync.

2. Patch doesn't cater to DROP TABLESPACE and DROP DATABASE
operations.

It would have been better if there could be a simpler solution for these
operations as well, but even if we have something that generic to
avoid problems for these operations, I think there is no reason why it
can't be easily adopted for Drop Table operation as the changes
proposed by this patch are very much localized to one function (which
we have to anyway change even without patch if we come-up with
a generic mechanism), the other changes required to avoid the
problem-1 (lseek problem) would still be required even when we
have patch for generic approach ready.  As mentioned by Andrew,
another thing to note is that these operations are much less used
as compare to Drop/Truncate Table, so I think optimzing these
are of somewhat lower priority.


3. Can't close-and-open a file (to avoid lseek lie about EOF or
otherwise) as that might lead to a failure if there is flush operation
for file in parallel.

I haven't checked about this, but I think we can find some way
to check if vm and fsm files exist before checking the number
of blocks for those files.


Apart from above, Heikki mentioned about overflow for total number of
blocks calculation, which I think is relatively simpler problem to fix.


  There were plenty of objections, as well as
 a design for a better solution.

I think here by better solution you mean radix based approach or
something similar, first I don't see any clear design for the same
and second even if tomorrow we have patch for the same ready,
it's not very difficult to change the proposed solution even after
it is committed as the changes are very much localized to one
function.

  In addition, we should think about
 more holistic solutions such as Andres' nearby proposal (which I doubt
 will work as-is, but maybe somebody will think of how to fix it).
 Committing a band-aid will just make it harder to pursue real fixes.


Right, but OTOH waiting for a long time to have some thing much
more generic doesn't sound wise either, especially when we can replace
the generic solution without much difficulty.

Having said above, I am not wedded to work on this idea, so if you
and or others have no inclination for the work in this direction, then
I will stop it and lets wait for the day when we have clear idea for
some generic way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Heikki Linnakangas

On 07/02/2015 04:33 PM, Simon Riggs wrote:

On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote:


On 06/27/2015 07:45 AM, Amit Kapila wrote:


Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.



I'm marking this as returned with feedback in the commitfest. In
addition to the issues raised so far, ISTM that the patch makes dropping a
very large table with small shared_buffers slower
(DropForkSpecificBuffers() is O(n) where n is the size of the relation,
while the current method is O(shared_buffers))


There are no unresolved issues with the approach, nor is it true it is
slower. If you think there are some, you should say what they are, not act
high handedly to reject a patch without reason.


Oh, I missed the NBuffers / 4 threshold.

(The total_blocks calculation is prone to overflowing, btw, if you have 
a table close to 32 TB in size. But that's trivial to fix)



I concur that we should explore using a radix tree or something else that
would naturally allow you to find all buffers for relation/database X
quickly.


I doubt that it can be managed efficiently while retaining optimal memory
management. If it can its going to be very low priority (or should be).

This approach works, so lets do it, now. If someone comes up with a better
way later, great.


*shrug*. I don't think this is ready to be committed. I can't stop you 
from working on this, but as far as this commitfest is concerned, case 
closed.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Heikki Linnakangas

On 07/02/2015 04:18 PM, Amit Kapila wrote:

On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote:


On 06/27/2015 07:45 AM, Amit Kapila wrote:


Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.


I'm marking this as returned with feedback in the commitfest. In

addition to the issues raised so far,

Don't you think the approach discussed (delayed cleanup of buffers
during checkpoint scan) is sufficient to fix the issues raised.


Dunno, but there is no patch for that.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Andres Freund
On 2015-07-02 16:08:03 +0300, Heikki Linnakangas wrote:
 I'm marking this as returned with feedback in the commitfest. In addition
 to the issues raised so far, ISTM that the patch makes dropping a very large
 table with small shared_buffers slower (DropForkSpecificBuffers() is O(n)
 where n is the size of the relation, while the current method is
 O(shared_buffers))

I think upthread there was talk about only using the O(relsize) approach
if relsize  NBuffers. That's actually a pretty common scenario,
especially in testsuites.

 I concur that we should explore using a radix tree or something else that
 would naturally allow you to find all buffers for relation/database X
 quickly.

I unsurprisingly think that's the right way to go. But I'm not sure if
it's not worth to add another layer of bandaid till were there...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Amit Kapila
On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 06/27/2015 07:45 AM, Amit Kapila wrote:

 Sometime back on one of the PostgreSQL blog [1], there was
 discussion about the performance of drop/truncate table for
 large values of shared_buffers and it seems that as the value
 of shared_buffers increase the performance of drop/truncate
 table becomes worse.  I think those are not often used operations,
 so it never became priority to look into improving them if possible.

 I have looked into it and found that the main reason for such
 a behaviour is that for those operations it traverses whole
 shared_buffers and it seems to me that we don't need that
 especially for not-so-big tables.  We can optimize that path
 by looking into buff mapping table for the pages that exist in
 shared_buffers for the case when table size is less than some
 threshold (say 25%) of shared buffers.

 Attached patch implements the above idea and I found that
 performance doesn't dip much with patch even with large value
 of shared_buffers.  I have also attached script and sql file used
 to take performance data.


 I'm marking this as returned with feedback in the commitfest. In
addition to the issues raised so far,

Don't you think the approach discussed (delayed cleanup of buffers
during checkpoint scan) is sufficient to fix the issues raised.

 ISTM that the patch makes dropping a very large table with small
shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the
size of the relation, while the current method is O(shared_buffers))


Can you please explain how did you reach that conclusion?
It does only when relation size is less than 25% of shared
buffers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote:
 I'm marking this as returned with feedback in the commitfest.

 There are no unresolved issues with the approach, nor is it true it is
 slower. If you think there are some, you should say what they are, not act
 high handedly to reject a patch without reason.

Have you read the thread?  There were plenty of objections, as well as
a design for a better solution.  In addition, we should think about
more holistic solutions such as Andres' nearby proposal (which I doubt
will work as-is, but maybe somebody will think of how to fix it).
Committing a band-aid will just make it harder to pursue real fixes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 06/27/2015 07:45 AM, Amit Kapila wrote:

 Sometime back on one of the PostgreSQL blog [1], there was
 discussion about the performance of drop/truncate table for
 large values of shared_buffers and it seems that as the value
 of shared_buffers increase the performance of drop/truncate
 table becomes worse.  I think those are not often used operations,
 so it never became priority to look into improving them if possible.

 I have looked into it and found that the main reason for such
 a behaviour is that for those operations it traverses whole
 shared_buffers and it seems to me that we don't need that
 especially for not-so-big tables.  We can optimize that path
 by looking into buff mapping table for the pages that exist in
 shared_buffers for the case when table size is less than some
 threshold (say 25%) of shared buffers.

 Attached patch implements the above idea and I found that
 performance doesn't dip much with patch even with large value
 of shared_buffers.  I have also attached script and sql file used
 to take performance data.


 I'm marking this as returned with feedback in the commitfest. In
 addition to the issues raised so far, ISTM that the patch makes dropping a
 very large table with small shared_buffers slower
 (DropForkSpecificBuffers() is O(n) where n is the size of the relation,
 while the current method is O(shared_buffers))


There are no unresolved issues with the approach, nor is it true it is
slower. If you think there are some, you should say what they are, not act
high handedly to reject a patch without reason.

I concur that we should explore using a radix tree or something else that
 would naturally allow you to find all buffers for relation/database X
 quickly.


I doubt that it can be managed efficiently while retaining optimal memory
management. If it can its going to be very low priority (or should be).

This approach works, so lets do it, now. If someone comes up with a better
way later, great.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Amit Kapila
On Thu, Jul 2, 2015 at 7:27 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/02/2015 04:18 PM, Amit Kapila wrote:


 Don't you think the approach discussed (delayed cleanup of buffers
 during checkpoint scan) is sufficient to fix the issues raised.


 Dunno, but there is no patch for that.


That's fair enough, however your reply sounded to me like there is
no further need to explore this idea, unless we have something like
radix tree based approach.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Heikki Linnakangas

On 06/27/2015 07:45 AM, Amit Kapila wrote:

Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.


I'm marking this as returned with feedback in the commitfest. In 
addition to the issues raised so far, ISTM that the patch makes dropping 
a very large table with small shared_buffers slower 
(DropForkSpecificBuffers() is O(n) where n is the size of the relation, 
while the current method is O(shared_buffers))


I concur that we should explore using a radix tree or something else 
that would naturally allow you to find all buffers for relation/database 
X quickly.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-01 Thread Simon Riggs
On 1 July 2015 at 15:39, Amit Kapila amit.kapil...@gmail.com wrote:


 Okay. I think we can maintain the list in similar way as we do for
 UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but
 why to wait till 64 tables?


I meant once per checkpoint cycle OR every N tables, whichever is sooner. N
would need to vary according to size of Nbuffers.


 We already scan whole buffer list in each
 checkpoint cycle, so during that scan we can refer this dropped relation
 list and avoid syncing such buffer contents.  Also for ENOENT error
 handling for FileWrite, we can use this list to refer relations for which
 we
 need to ignore the error.  I think we are already doing something similar
 in
 mdsync to avoid the problem of Dropped tables, so it seems okay to
 have it in mdwrite as well.

 The crucial thing in this idea to think about is avoiding reassignment of
 relfilenode (due to wrapped OID's) before we have ensured that none of
 the buffers contains tag for that relfilenode.  Currently we avoid this for
 Fsync case by retaining the first segment of relation (which will avoid
 reassignment of relfilenode) till checkpoint ends, I think if we just
 postpone
 it till we have validated it in shared_buffers, then we can avoid this
 problem
 in new scheme and it should be delay of maximum one checkpoint cycle
 for unlinking such file assuming we refer dropped relation list in each
 checkpoint
 cycle during buffer scan.


Yes

So you are keeping more data around for longer, right? I think we would
need some way to trigger a scan when the amount of deferred dropped data
files hits a certain size.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-01 Thread Amit Kapila
On Tue, Jun 30, 2015 at 12:10 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 30 June 2015 at 07:34, Amit Kapila amit.kapil...@gmail.com wrote:

 On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com
wrote:
 
  On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote:
 
  On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com
wrote:
  
   On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:
  
   If lseek fails badly then SeqScans would give *silent* data loss,
which in my view is worse. Just added pages aren't the only thing we might
miss if lseek is badly wrong.
  
 
  So for the purpose of this patch, do we need to assume that
  lseek can give us wrong size of file and we should add preventive
  checks and other handling for the same?
  I am okay to change that way, if we are going to have that as
assumption
  in out code wherever we are using it or will use it in-future,
otherwise
  we will end with some preventive checks which are actually not
required.
 
 
  They're preventative checks. You always hope it is wasted effort.
 

 I am not sure if Preventative checks (without the real need) are okay if
they
 are not-cheap which could happen in this case.  I think Validating
buffer-tag
 would require rel or sys cache lookup.


 True, so don't do that.

 Keep a list of dropped relations and have the checkpoint process scan the
buffer pool every 64 tables, kinda like AbsorbFsync


Okay. I think we can maintain the list in similar way as we do for
UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but
why to wait till 64 tables?  We already scan whole buffer list in each
checkpoint cycle, so during that scan we can refer this dropped relation
list and avoid syncing such buffer contents.  Also for ENOENT error
handling for FileWrite, we can use this list to refer relations for which we
need to ignore the error.  I think we are already doing something similar in
mdsync to avoid the problem of Dropped tables, so it seems okay to
have it in mdwrite as well.

The crucial thing in this idea to think about is avoiding reassignment of
relfilenode (due to wrapped OID's) before we have ensured that none of
the buffers contains tag for that relfilenode.  Currently we avoid this for
Fsync case by retaining the first segment of relation (which will avoid
reassignment of relfilenode) till checkpoint ends, I think if we just
postpone
it till we have validated it in shared_buffers, then we can avoid this
problem
in new scheme and it should be delay of maximum one checkpoint cycle
for unlinking such file assuming we refer dropped relation list in each
checkpoint
cycle during buffer scan.

Does that make sense?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-01 Thread Amit Kapila
On Wed, Jul 1, 2015 at 8:26 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 1 July 2015 at 15:39, Amit Kapila amit.kapil...@gmail.com wrote:


 Okay. I think we can maintain the list in similar way as we do for
 UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but
 why to wait till 64 tables?


 I meant once per checkpoint cycle OR every N tables, whichever is sooner.
N would need to vary according to size of Nbuffers.


That sounds sensible to me, see my reply below what I think we can do
for this to work.


 We already scan whole buffer list in each
 checkpoint cycle, so during that scan we can refer this dropped relation
 list and avoid syncing such buffer contents.  Also for ENOENT error
 handling for FileWrite, we can use this list to refer relations for
which we
 need to ignore the error.  I think we are already doing something
similar in
 mdsync to avoid the problem of Dropped tables, so it seems okay to
 have it in mdwrite as well.

 The crucial thing in this idea to think about is avoiding reassignment of
 relfilenode (due to wrapped OID's) before we have ensured that none of
 the buffers contains tag for that relfilenode.  Currently we avoid this
for
 Fsync case by retaining the first segment of relation (which will avoid
 reassignment of relfilenode) till checkpoint ends, I think if we just
postpone
 it till we have validated it in shared_buffers, then we can avoid this
problem
 in new scheme and it should be delay of maximum one checkpoint cycle
 for unlinking such file assuming we refer dropped relation list in each
checkpoint
 cycle during buffer scan.


 Yes

 So you are keeping more data around for longer, right?

Yes and we already do it for the sake of Fsyncs.

 I think we would need some way to trigger a scan when the amount of
deferred dropped data files hits a certain size.


Okay, I think we can keep it as if the number of dropped
relations reached 64 or 0.01% (or 0.1%) of shared_buffers
whichever is minimum as a point to trigger checkpoint.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-30 Thread Amit Kapila
On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote:

 On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com
wrote:
 
  On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:
 
  If lseek fails badly then SeqScans would give *silent* data loss,
which in my view is worse. Just added pages aren't the only thing we might
miss if lseek is badly wrong.
 

 So for the purpose of this patch, do we need to assume that
 lseek can give us wrong size of file and we should add preventive
 checks and other handling for the same?
 I am okay to change that way, if we are going to have that as assumption
 in out code wherever we are using it or will use it in-future, otherwise
 we will end with some preventive checks which are actually not required.


 They're preventative checks. You always hope it is wasted effort.


I am not sure if Preventative checks (without the real need) are okay if
they
are not-cheap which could happen in this case.  I think Validating
buffer-tag
would require rel or sys cache lookup.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 07:34, Amit Kapila amit.kapil...@gmail.com wrote:

 On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com
 wrote:
 
  On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote:
 
  On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
  
   On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:
  
   If lseek fails badly then SeqScans would give *silent* data loss,
 which in my view is worse. Just added pages aren't the only thing we might
 miss if lseek is badly wrong.
  
 
  So for the purpose of this patch, do we need to assume that
  lseek can give us wrong size of file and we should add preventive
  checks and other handling for the same?
  I am okay to change that way, if we are going to have that as assumption
  in out code wherever we are using it or will use it in-future, otherwise
  we will end with some preventive checks which are actually not required.
 
 
  They're preventative checks. You always hope it is wasted effort.
 

 I am not sure if Preventative checks (without the real need) are okay if
 they
 are not-cheap which could happen in this case.  I think Validating
 buffer-tag
 would require rel or sys cache lookup.


True, so don't do that.

Keep a list of dropped relations and have the checkpoint process scan the
buffer pool every 64 tables, kinda like AbsorbFsync

All the heavy lifting gets done in a background process and we know we're
safe.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-29 Thread Simon Riggs
On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote:
  I don't like this too much because it will fail badly if the caller
  is wrong about the maximum possible page number for the table, which
  seems not exactly far-fetched.  (For instance, remember those kernel
 bugs
  we've seen that cause lseek to lie about the EOF position?)

  If that is true, then our reliance on lseek elsewhere could also cause
 data
  loss, for example by failing to scan data during a seq scan.

 The lseek point was a for-example, not the entire universe of possible
 problem sources for this patch.  (Also, underestimating the EOF point in
 a seqscan is normally not an issue since any rows in a just-added page
 are by definition not visible to the scan's snapshot.  But I digress.)

  The consequences of failure of lseek in this case are nowhere near as
 dire,
  since by definition the data is being destroyed by the user.

 I'm not sure what you consider dire, but missing a dirty buffer
 belonging to the to-be-destroyed table would result in the system being
 permanently unable to checkpoint, because attempts to write out the buffer
 to the no-longer-extant file would fail.  You could only get out of the
 situation via a forced database crash (immediate shutdown), followed by
 replaying all the WAL since the time of the problem.  In production
 contexts that could be pretty dire.


Yes, its bad, but we do notice that has happened. We can also put in code
to specifically avoid this error at checkpoint time.

If lseek fails badly then SeqScans would give *silent* data loss, which in
my view is worse. Just added pages aren't the only thing we might miss if
lseek is badly wrong.

So, I think this patch still has legs. We can check that the clean up has
been 100% when we do the buffer scan at the start of the checkpoint - that
way we do just one scan of the buffer pool and move a time-consuming
operation into a background process.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-29 Thread Amit Kapila
On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:

 I'm not sure what you consider dire, but missing a dirty buffer
 belonging to the to-be-destroyed table would result in the system being
 permanently unable to checkpoint, because attempts to write out the
buffer
 to the no-longer-extant file would fail.  You could only get out of the
 situation via a forced database crash (immediate shutdown), followed by
 replaying all the WAL since the time of the problem.  In production
 contexts that could be pretty dire.


 Yes, its bad, but we do notice that has happened. We can also put in code
to specifically avoid this error at checkpoint time.

 If lseek fails badly then SeqScans would give *silent* data loss, which
in my view is worse. Just added pages aren't the only thing we might miss
if lseek is badly wrong.


So for the purpose of this patch, do we need to assume that
lseek can give us wrong size of file and we should add preventive
checks and other handling for the same?
I am okay to change that way, if we are going to have that as assumption
in out code wherever we are using it or will use it in-future, otherwise
we will end with some preventive checks which are actually not required.

Another idea here is that use some other way instead of lseek to know
the size of file.  Do you think we can use stat() for this purpose, we
are already using it in fd.c?

 So, I think this patch still has legs. We can check that the clean up has
been 100% when we do the buffer scan at the start of the checkpoint


One way to ensure that is verify that each buffer header tag is
valid (which essentially means whether the object exists), do
you have something else in mind to accomplish this part if we
decide to go this route?

 - that way we do just one scan of the buffer pool and move a
time-consuming operation into a background process.


Agreed and if that turns out to be cheap, then we might want to use
some way to optimize Drop Database and others in same way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-29 Thread Amit Kapila
On Mon, Jun 29, 2015 at 5:41 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
  Amit Kapila amit.kapil...@gmail.com writes:
   On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   I don't like this too much because it will fail badly if the caller
   is wrong about the maximum possible page number for the table, which
   seems not exactly far-fetched.  (For instance, remember those kernel
bugs
   we've seen that cause lseek to lie about the EOF position?)
 
   Considering we already have exclusive lock while doing this operation
   and nobody else can perform write on this file, won't closing and
   opening it again would avoid such problems.
 
  On what grounds do you base that touching faith?

 Sorry, but I don't get what problem do you see in this touching?


On again thinking about it, I think you are worried that if we close
and reopen the file, it would break any flush operation that could
happen in parallel to it via checkpoint.  Still I am not clear that do
we want to assume that we can't rely on lseek for the size of file
if there can be parallel write activity on file (even if that write doesn't
increase the size of file)?

If yes, then we have below options:

a.  Have some protection mechanism for File Access (ignore error
for file not present or accessed during flush) and clean the buffers
buffers containing invalid objects as is being discussed up-thread.

b. Use some other API like stat to get the size of file?

Do you prefer any of these or if you have any better idea, then please
do share the same?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-29 Thread Simon Riggs
On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote:

 On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
 
  On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:
 
  I'm not sure what you consider dire, but missing a dirty buffer
  belonging to the to-be-destroyed table would result in the system being
  permanently unable to checkpoint, because attempts to write out the
 buffer
  to the no-longer-extant file would fail.  You could only get out of the
  situation via a forced database crash (immediate shutdown), followed by
  replaying all the WAL since the time of the problem.  In production
  contexts that could be pretty dire.
 
 
  Yes, its bad, but we do notice that has happened. We can also put in
 code to specifically avoid this error at checkpoint time.
 
  If lseek fails badly then SeqScans would give *silent* data loss, which
 in my view is worse. Just added pages aren't the only thing we might miss
 if lseek is badly wrong.
 

 So for the purpose of this patch, do we need to assume that
 lseek can give us wrong size of file and we should add preventive
 checks and other handling for the same?
 I am okay to change that way, if we are going to have that as assumption
 in out code wherever we are using it or will use it in-future, otherwise
 we will end with some preventive checks which are actually not required.


They're preventative checks. You always hope it is wasted effort.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Robert Haas
On Sat, Jun 27, 2015 at 11:38 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-27 10:10:04 -0400, Tom Lane wrote:
 In the past we've speculated about fixing the performance of these things
 by complicating the buffer lookup mechanism enough so that it could do
 find any page for this table/tablespace/database efficiently.
 Nobody's had ideas that seemed sane performance-wise though.

 I've started to play around with doing that a year or three back. My
 approach was to use a linux style radix tree for the buffer mapping
 table.  Besides lack of time what made it hard to be efficient was the
 size of our buffer tags requiring rather deep trees.

 I think I was considering playing around with a two-level tree (so we
 could cache a pointer in Relation or such), but the memory management
 requirements for that made my head hurt too much. The other alternative
 is to work on having a much simpler buffer tag

Wouldn't even a two-level tree have the same problem you complained
about vis-a-vis chash?  In that case, you were of the opinion that
even ONE extra level of indirection was enough to pinch.  (I'm still
hoping there is a way to fix that, but even so.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't like this too much because it will fail badly if the caller
 is wrong about the maximum possible page number for the table, which
 seems not exactly far-fetched.  (For instance, remember those kernel bugs
 we've seen that cause lseek to lie about the EOF position?)

 If that is true, then our reliance on lseek elsewhere could also cause data
 loss, for example by failing to scan data during a seq scan.

The lseek point was a for-example, not the entire universe of possible
problem sources for this patch.  (Also, underestimating the EOF point in
a seqscan is normally not an issue since any rows in a just-added page
are by definition not visible to the scan's snapshot.  But I digress.)

 The consequences of failure of lseek in this case are nowhere near as dire,
 since by definition the data is being destroyed by the user.

I'm not sure what you consider dire, but missing a dirty buffer
belonging to the to-be-destroyed table would result in the system being
permanently unable to checkpoint, because attempts to write out the buffer
to the no-longer-extant file would fail.  You could only get out of the
situation via a forced database crash (immediate shutdown), followed by
replaying all the WAL since the time of the problem.  In production
contexts that could be pretty dire.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Andrew Dunstan


On 06/27/2015 10:10 AM, Tom Lane wrote:

It also
offers no hope of a fix for the other operations that scan the whole
buffer pool, such as DROP TABLESPACE and DROP DATABASE.



Improving DROP TABLE / TRUNCATE would still be a significant advance. 
These cases cause far more real world pain than the ones you mention, in 
my experience.


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't like this too much because it will fail badly if the caller
 is wrong about the maximum possible page number for the table, which
 seems not exactly far-fetched.  (For instance, remember those kernel bugs
 we've seen that cause lseek to lie about the EOF position?)

 Considering we already have exclusive lock while doing this operation
 and nobody else can perform write on this file, won't closing and
 opening it again would avoid such problems.

On what grounds do you base that touching faith?  Quite aside from
outright bugs, having lock on a table has nothing to do with whether
low-level processes such as the checkpointer can touch it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Andres Freund
On 2015-06-28 09:11:29 -0400, Robert Haas wrote:
 On Sat, Jun 27, 2015 at 11:38 AM, Andres Freund and...@anarazel.de wrote:
  I've started to play around with doing that a year or three back. My
  approach was to use a linux style radix tree for the buffer mapping
  table.  Besides lack of time what made it hard to be efficient was the
  size of our buffer tags requiring rather deep trees.
 
  I think I was considering playing around with a two-level tree (so we
  could cache a pointer in Relation or such), but the memory management
  requirements for that made my head hurt too much. The other alternative
  is to work on having a much simpler buffer tag
 
 Wouldn't even a two-level tree have the same problem you complained
 about vis-a-vis chash?

I was hoping to avoid the upper tree (mapping relfilenodes to the block
tree) in the majority of the cases by caching that mapping in struct
Relation or so.

But generally, yes, a tree will have more indirections. But they're
often of a different quality than with a hash table. There's a high
amount of spatial locality when looking up blocks: You're much more
likely to lookup a block close to one recently looked up than just a
randomly different one. Hashtables don't have a way to benefit from that
- tree structures sometimes do.

I don't think using a radix tree in itself will have significant
performance benefits over the hash table. But it allows for a bunch of
cool further optimizations like the aforementioned 'intelligent'
readahead, combining writes when flushing buffers, and - which made me
look into it originally - tagging inner nodes in the tree with
information about dirtyness to avoid scanning large amounts of nondirty
buffers.

 In that case, you were of the opinion that even ONE extra level of
 indirection was enough to pinch.  (I'm still hoping there is a way to
 fix that, but even so.)

Me too.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Simon Riggs
On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote:

 Amit Kapila amit.kapil...@gmail.com writes:
  I have looked into it and found that the main reason for such
  a behaviour is that for those operations it traverses whole
  shared_buffers and it seems to me that we don't need that
  especially for not-so-big tables.  We can optimize that path
  by looking into buff mapping table for the pages that exist in
  shared_buffers for the case when table size is less than some
  threshold (say 25%) of shared buffers.

 I don't like this too much because it will fail badly if the caller
 is wrong about the maximum possible page number for the table, which
 seems not exactly far-fetched.  (For instance, remember those kernel bugs
 we've seen that cause lseek to lie about the EOF position?)  It also
 offers no hope of a fix for the other operations that scan the whole
 buffer pool, such as DROP TABLESPACE and DROP DATABASE.


If that is true, then our reliance on lseek elsewhere could also cause data
loss, for example by failing to scan data during a seq scan.

The consequences of failure of lseek in this case are nowhere near as dire,
since by definition the data is being destroyed by the user.

So ISTM that we should be able to use this technique.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Amit Kapila
On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Amit Kapila amit.kapil...@gmail.com writes:
  On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I don't like this too much because it will fail badly if the caller
  is wrong about the maximum possible page number for the table, which
  seems not exactly far-fetched.  (For instance, remember those kernel
bugs
  we've seen that cause lseek to lie about the EOF position?)

  Considering we already have exclusive lock while doing this operation
  and nobody else can perform write on this file, won't closing and
  opening it again would avoid such problems.

 On what grounds do you base that touching faith?

Sorry, but I don't get what problem do you see in this touching?

 Quite aside from
 outright bugs, having lock on a table has nothing to do with whether
 low-level processes such as the checkpointer can touch it.


I thought that this problem (lseek lie about EOF) would only occur
if there is a recent extension to the file or does mere writes to existing
pages could also cause this problem?

Though we should ideally take care of any failures of OS API's especially
if they could lead to data loss, however not sure if we can do that reliably
in all possible cases.  Can we safely guarantee that as we haven't
encountered any other such problem in any other API, so everything is good.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Amit Kapila
On Sun, Jun 28, 2015 at 9:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote:
  I don't like this too much because it will fail badly if the caller
  is wrong about the maximum possible page number for the table, which
  seems not exactly far-fetched.  (For instance, remember those kernel
bugs
  we've seen that cause lseek to lie about the EOF position?)

  If that is true, then our reliance on lseek elsewhere could also cause
data
  loss, for example by failing to scan data during a seq scan.

 The lseek point was a for-example, not the entire universe of possible
 problem sources for this patch.  (Also, underestimating the EOF point in
 a seqscan is normally not an issue since any rows in a just-added page
 are by definition not visible to the scan's snapshot.

How do we ensure that just-added page is before or after the scan's
snapshot?
If it is before, then the above point mentioned by Simon is valid.  Does
this
mean that all other usages of smgrnblocks()/mdnblocks() is safe with respect
to this issue or the consequences will not be so bad as for this usage?

  But I digress.)

  The consequences of failure of lseek in this case are nowhere near as
dire,
  since by definition the data is being destroyed by the user.

 I'm not sure what you consider dire, but missing a dirty buffer
 belonging to the to-be-destroyed table would result in the system being
 permanently unable to checkpoint, because attempts to write out the buffer
 to the no-longer-extant file would fail.

So another idea here could be that if instead of failing, we just ignore the
error in case the the object (to which that page belongs) doesn't exist and
we can make Drop free by not invalidating from shared_buffers in case of
Drop/Truncate.  I think this might not be sane idea as we need to have a
way to do lookup of objects from checkpoint and need to handle the case
where same Oid could be assigned to new objects (after wraparound?).



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-28 Thread Robert Haas
On Sun, Jun 28, 2015 at 12:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm not sure what you consider dire, but missing a dirty buffer
 belonging to the to-be-destroyed table would result in the system being
 permanently unable to checkpoint, because attempts to write out the buffer
 to the no-longer-extant file would fail.  You could only get out of the
 situation via a forced database crash (immediate shutdown), followed by
 replaying all the WAL since the time of the problem.  In production
 contexts that could be pretty dire.

Hmm, that is kind of ugly.  Is the write actually going to fail,
though, or is it just going to create a sparse file?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-27 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 I have looked into it and found that the main reason for such
 a behaviour is that for those operations it traverses whole
 shared_buffers and it seems to me that we don't need that
 especially for not-so-big tables.  We can optimize that path
 by looking into buff mapping table for the pages that exist in
 shared_buffers for the case when table size is less than some
 threshold (say 25%) of shared buffers.

I don't like this too much because it will fail badly if the caller
is wrong about the maximum possible page number for the table, which
seems not exactly far-fetched.  (For instance, remember those kernel bugs
we've seen that cause lseek to lie about the EOF position?)  It also
offers no hope of a fix for the other operations that scan the whole
buffer pool, such as DROP TABLESPACE and DROP DATABASE.

In the past we've speculated about fixing the performance of these things
by complicating the buffer lookup mechanism enough so that it could do
find any page for this table/tablespace/database efficiently.
Nobody's had ideas that seemed sane performance-wise though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-27 Thread Gurjeet Singh
On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 Sometime back on one of the PostgreSQL blog [1], there was
 discussion about the performance of drop/truncate table for
 large values of shared_buffers and it seems that as the value
 of shared_buffers increase the performance of drop/truncate
 table becomes worse.  I think those are not often used operations,
 so it never became priority to look into improving them if possible.

 I have looked into it and found that the main reason for such
 a behaviour is that for those operations it traverses whole
 shared_buffers and it seems to me that we don't need that
 especially for not-so-big tables.  We can optimize that path
 by looking into buff mapping table for the pages that exist in
 shared_buffers for the case when table size is less than some
 threshold (say 25%) of shared buffers.

 Attached patch implements the above idea and I found that
 performance doesn't dip much with patch even with large value
 of shared_buffers.  I have also attached script and sql file used
 to take performance data.


+1 for the effort to improve this.

With your technique added, there are 3 possible ways the search can happen
a) Scan NBuffers and scan list of relations, b) Scan NBuffers and bsearch
list of relations, and c) Scan list of relations and then invalidate blocks
of each fork from shared buffers. Would it be worth it finding one
technique that can serve decently from the low-end shared_buffers to the
high-end.

On patch:

There are multiple naming styles being used in DropForkSpecificBuffers();
my_name and myName. Given this is a new function, it'd help to be
consistent.

s/blk_count/blockNum/

s/new//, for eg. newTag, because there's no corresponding tag/oldTag
variable in the function.

s/blocksToDel/blocksToDrop/. BTW, we never pass anything other than the
total number of blocks in the fork, so we may as well call it just
numBlocks.

s/traverse_buf_freelist/scan_shared_buffers/, because when it is true, we
scan the whole shared_buffers.

s/rel_count/rel_num/

Reduce indentation/tab in header-comments of DropForkSpecificBuffers(). But
I see there's precedent in neighboring functions, so this may be okay.

Doing pfree() of num_blocks, num_fsm_blocks and num_vm_blocks in one place
(instead of two, at different indentation levels) would help readability.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-27 Thread Amit Kapila
On Sat, Jun 27, 2015 at 8:06 PM, Gurjeet Singh gurj...@singh.im wrote:

 On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 Attached patch implements the above idea and I found that
 performance doesn't dip much with patch even with large value
 of shared_buffers.  I have also attached script and sql file used
 to take performance data.


 +1 for the effort to improve this.


Thanks.

 With your technique added, there are 3 possible ways the search can
happen a) Scan NBuffers and scan list of relations, b) Scan NBuffers and
bsearch list of relations, and c) Scan list of relations and then
invalidate blocks of each fork from shared buffers. Would it be worth it
finding one technique that can serve decently from the low-end
shared_buffers to the high-end.


Yes, it would be great if we could have any such technique, but in
absence of that current optimization would suffice the need unless
there are any loop-holes in it.

 On patch:

 There are multiple naming styles being used in DropForkSpecificBuffers();
my_name and myName. Given this is a new function, it'd help to be
consistent.


Thanks for suggestions, I will improve the patch on those lines if
there is no problem with idea at broad level.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-27 Thread Amit Kapila
On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Amit Kapila amit.kapil...@gmail.com writes:
  I have looked into it and found that the main reason for such
  a behaviour is that for those operations it traverses whole
  shared_buffers and it seems to me that we don't need that
  especially for not-so-big tables.  We can optimize that path
  by looking into buff mapping table for the pages that exist in
  shared_buffers for the case when table size is less than some
  threshold (say 25%) of shared buffers.

 I don't like this too much because it will fail badly if the caller
 is wrong about the maximum possible page number for the table, which
 seems not exactly far-fetched.  (For instance, remember those kernel bugs
 we've seen that cause lseek to lie about the EOF position?)

Considering we already have exclusive lock while doing this operation
and nobody else can perform write on this file, won't closing and
opening it again would avoid such problems.  In patch that is already
done (smgrexists()) for 2 kind of forks and can be done for the third kind
as well.

  It also
 offers no hope of a fix for the other operations that scan the whole
 buffer pool, such as DROP TABLESPACE and DROP DATABASE.


True, but it is not foreclosing if any body has idea to optimize those
paths.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com