date:20121120

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Andres Freund

On 2012-11-21 16:47:11 +0900, Michael Paquier wrote:
> On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund wrote:
> 
> > On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> > > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund  > >wrote:
> > >
> > > > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > > > Btw, here are some extra comments based on my progress, hope it will
> > be
> > > > > useful for other people playing around with your patches.
> > > > > 1) Necessary to install the contrib module test_decoding on server
> > side
> > > > or
> > > > > the test case will not work.
> > > > > 2) Obtention of the following logs on server:
> > > > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > > > running
> > > > > to early
> > > > > WARNING:  ABORT 1370
> > > > > Actually I saw that there are many warnings like this.
> > > >
> > > > Those aren't unexpected. Perhaps I should not make it a warning then...
> > > >
> > > A NOTICE would be more adapted, a WARNING means that something that may
> > > endanger the system has happened, but as far as I understand from your
> > > explanation this is not the case.
> >
> > I think it should go DEBUG2 or so once were a bit more confident about
> > the code.
> >
> > > > A short explanation:
> > > >
> > > > We can only decode tuples we see in the WAL when we already have a
> > > > timetravel catalog snapshot before that transaction started. To build
> > > > such a snapshot we need to collect information about committed which
> > > > changed the catalog. Unfortunately we can't diagnose whether a txn
> > > > changed the catalog without a snapshot so we just assume all committed
> > > > ones do - it just costs a bit of memory. Thats the background of the
> > > > "forced to assume catalog changes for ..." message.
> > > >
> > > OK, so this snapshot only needs to include the XIDs of transactions that
> > > have modified the catalogs. Do I get it right? This way you are able to
> > > fetch the correct relation definition for replication decoding.
> >
> > Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
> > so its not all of them. Normal snapshots carry all in-progress
> > transactionids instead of the committed ones, but that would have been
> > far more in our case (only a minority of txn's touch the catalog) and it
> > has problems with subtransaction tracking.
> >
> Hum. I might have missed something but what is the variable tracking the
> newest XID that modified catalogs.
> I can see of course recentXmin in snapmgr.c but nothing related to what you
> describe.

We determine that ourselves.

SnapBuildCommitTxn(Snapstate *snapstate, ReorderBuffer *reorder,
   XLogRecPtr lsn, TransactionId xid,
   int nsubxacts, TransactionId *subxacts)
{
...
if (forced_timetravel || top_does_timetravel || sub_does_timetravel)
{
if (!TransactionIdIsValid(snapstate->xmax) ||
NormalTransactionIdFollows(xid, snapstate->xmax))
{
snapstate->xmax = xid;
TransactionIdAdvance(snapstate->xmax);
}

> > > Just thinking but... It looks to be a waste to store the transactions
> > XIDs
> > > of all the committed transactions, but on the other hand there is no way
> > to
> > > track the XIDs of transactions that modified a catalog in current core
> > > code. So yes this approach is better as refining the transaction XID
> > > tracking for snapshot reconstruction is something that could be improved
> > > later. Those are only thoughts though...
> >
> > We actually only track xids of catalog modifying transactions once we
> > hit the CONSISTENT state. Before the initial snapshot we can't detect
> > that.
> >
> How to you track them? I think I need to go deeper in the code before
> asking more...

You mean, how do I detect they are catalog modifying? By asking the
reorderbuffer (ReorderBufferXidDoesTimetravel(...)). That one knows
because we told him so (ReorderBufferXidSetTimetravel()) and we do that
by looking at the type of xid records we've seen incoming (HEAP_INPLACE,
HEAP2_NEW_CID tell us its doing timetravel).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Michael Paquier

On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund wrote:

> On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund  >wrote:
> >
> > > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > > Btw, here are some extra comments based on my progress, hope it will
> be
> > > > useful for other people playing around with your patches.
> > > > 1) Necessary to install the contrib module test_decoding on server
> side
> > > or
> > > > the test case will not work.
> > > > 2) Obtention of the following logs on server:
> > > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > > running
> > > > to early
> > > > WARNING:  ABORT 1370
> > > > Actually I saw that there are many warnings like this.
> > >
> > > Those aren't unexpected. Perhaps I should not make it a warning then...
> > >
> > A NOTICE would be more adapted, a WARNING means that something that may
> > endanger the system has happened, but as far as I understand from your
> > explanation this is not the case.
>
> I think it should go DEBUG2 or so once were a bit more confident about
> the code.
>
> > > A short explanation:
> > >
> > > We can only decode tuples we see in the WAL when we already have a
> > > timetravel catalog snapshot before that transaction started. To build
> > > such a snapshot we need to collect information about committed which
> > > changed the catalog. Unfortunately we can't diagnose whether a txn
> > > changed the catalog without a snapshot so we just assume all committed
> > > ones do - it just costs a bit of memory. Thats the background of the
> > > "forced to assume catalog changes for ..." message.
> > >
> > OK, so this snapshot only needs to include the XIDs of transactions that
> > have modified the catalogs. Do I get it right? This way you are able to
> > fetch the correct relation definition for replication decoding.
>
> Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
> so its not all of them. Normal snapshots carry all in-progress
> transactionids instead of the committed ones, but that would have been
> far more in our case (only a minority of txn's touch the catalog) and it
> has problems with subtransaction tracking.
>
Hum. I might have missed something but what is the variable tracking the
newest XID that modified catalogs.
I can see of course recentXmin in snapmgr.c but nothing related to what you
describe.


>
> > Just thinking but... It looks to be a waste to store the transactions
> XIDs
> > of all the committed transactions, but on the other hand there is no way
> to
> > track the XIDs of transactions that modified a catalog in current core
> > code. So yes this approach is better as refining the transaction XID
> > tracking for snapshot reconstruction is something that could be improved
> > later. Those are only thoughts though...
>
> We actually only track xids of catalog modifying transactions once we
> hit the CONSISTENT state. Before the initial snapshot we can't detect
> that.
>
How to you track them? I think I need to go deeper in the code before
asking more...
-- 
Michael Paquier
http://michael.otacoo.com

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Andres Freund

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund wrote:
> 
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> > The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> > Only when all transactions we observed as running before the
> > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> > As we want a consistent/reproducible set of transactions to produce
> > output via the logstream we only pass transactions to the output plugin
> > if they commit *after* CONSISTENT (they can start earlier though!). This
> > allows us to produce a pg_dump compatible snapshot in the moment we get
> > consistent that contains exactly the changes we won't stream out.
> >
> > Makes sense?
> >
> > > 3) Assertion failure while running pgbench, I was  just curious to see
> > how
> > > it reacted when logical replication was put under a little bit of load.
> > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File:
> > "snapbuild.c",
> > > Line: 877)
> > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres
> >
> > Can you reproduce this one? I would be interested in log output. Because
> > I did run pgbench for quite some time and I haven't seen that one after
> > fixing some issues last week.
> >
> 
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Ah, I see. Could you try the following diff?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index df24b33..797a126 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
Snapstate *snapstate,
 */
snapstate->transactions_after = buf->origptr;
 
+   snapstate->nrrunning = running->xcnt;
snapstate->xmin_running = InvalidTransactionId;
snapstate->xmax_running = InvalidTransactionId;


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Michael Paquier

On Wed, Nov 21, 2012 at 4:31 PM, Andres Freund wrote:

> On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> > On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund  >wrote:
> > > It implies that snapstate->nrrunning has lost touch with reality...
> > >
> > Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> > outputted anything in the logs, but here is the backtrace of the core
> file
> > produced.
>
> Could you run it with log_level=DEBUG2?
>
Let me try.

> Do you run pgbench after youve reached a consistent state (by issuing a
> manual checkpoint)?
>
Yes. I issue a manual checkpoint to initialize the replication.
-- 
Michael Paquier
http://michael.otacoo.com

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Andres Freund

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund wrote:
> 
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> > The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> > Only when all transactions we observed as running before the
> > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> > As we want a consistent/reproducible set of transactions to produce
> > output via the logstream we only pass transactions to the output plugin
> > if they commit *after* CONSISTENT (they can start earlier though!). This
> > allows us to produce a pg_dump compatible snapshot in the moment we get
> > consistent that contains exactly the changes we won't stream out.
> >
> > Makes sense?
> >
> > > 3) Assertion failure while running pgbench, I was  just curious to see
> > how
> > > it reacted when logical replication was put under a little bit of load.
> > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File:
> > "snapbuild.c",
> > > Line: 877)
> > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres
> >
> > Can you reproduce this one? I would be interested in log output. Because
> > I did run pgbench for quite some time and I haven't seen that one after
> > fixing some issues last week.
> >
> 
> > It implies that snapstate->nrrunning has lost touch with reality...
> >
> Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
> outputted anything in the logs, but here is the backtrace of the core file
> produced.

Could you run it with log_level=DEBUG2?

Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Andres Freund

On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:
> On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund wrote:
>
> > On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > > Btw, here are some extra comments based on my progress, hope it will be
> > > useful for other people playing around with your patches.
> > > 1) Necessary to install the contrib module test_decoding on server side
> > or
> > > the test case will not work.
> > > 2) Obtention of the following logs on server:
> > > LOG:  forced to assume catalog changes for xid 1370 because it was
> > running
> > > to early
> > > WARNING:  ABORT 1370
> > > Actually I saw that there are many warnings like this.
> >
> > Those aren't unexpected. Perhaps I should not make it a warning then...
> >
> A NOTICE would be more adapted, a WARNING means that something that may
> endanger the system has happened, but as far as I understand from your
> explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

> > A short explanation:
> >
> > We can only decode tuples we see in the WAL when we already have a
> > timetravel catalog snapshot before that transaction started. To build
> > such a snapshot we need to collect information about committed which
> > changed the catalog. Unfortunately we can't diagnose whether a txn
> > changed the catalog without a snapshot so we just assume all committed
> > ones do - it just costs a bit of memory. Thats the background of the
> > "forced to assume catalog changes for ..." message.
> >
> OK, so this snapshot only needs to include the XIDs of transactions that
> have modified the catalogs. Do I get it right? This way you are able to
> fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.

> Just thinking but... It looks to be a waste to store the transactions XIDs
> of all the committed transactions, but on the other hand there is no way to
> track the XIDs of transactions that modified a catalog in current core
> code. So yes this approach is better as refining the transaction XID
> tracking for snapshot reconstruction is something that could be improved
> later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.

> The reason for the ABORTs is related but different. We start out in the
> > "SNAPBUILD_START" state when we try to build a snapshot. When we find
> > initial information about running transactions (i.e. xl_running_xacts)
> > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> > decode all changes in transactions that start *after* the current
> > lsn. Earlier transactions might have tuples on a catalog state we can't
> > query.
> >
> Just to be clear, lsn means the log-sequence number associated to each xlog
> record?

Yes. And that number is just the position in the stream.

Greetings,

Andres

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Michael Paquier

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund wrote:

> On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> > On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund  >wrote:
> > > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > > > I am just looking at this patch and will provide some comments.
> > > > By the way, you forgot the installation part of pg_receivellog,
> please see
> > > > patch attached.
> > >
> > > That actually was somewhat intended, I thought people wouldn't like the
> > > name and I didn't want a binary that's going to be replaced anyway
> lying
> > > around ;)
> > >
> > OK no problem. For sure this is going to happen, I was wondering myself
> if
> > it could be possible to merge pg_receivexlog and pg_receivellog into a
> > single utility with multiple modes :)
>
> Don't really see that, the differences already are significant and imo
> are bound to get bigger. Shouldn't live in pg_basebackup/ either...
>
I am sure that this will be the object of many future discussions.


> > Btw, here are some extra comments based on my progress, hope it will be
> > useful for other people playing around with your patches.
> > 1) Necessary to install the contrib module test_decoding on server side
> or
> > the test case will not work.
> > 2) Obtention of the following logs on server:
> > LOG:  forced to assume catalog changes for xid 1370 because it was
> running
> > to early
> > WARNING:  ABORT 1370
> > Actually I saw that there are many warnings like this.
>
> Those aren't unexpected. Perhaps I should not make it a warning then...
>
A NOTICE would be more adapted, a WARNING means that something that may
endanger the system has happened, but as far as I understand from your
explanation this is not the case.


> A short explanation:
>
> We can only decode tuples we see in the WAL when we already have a
> timetravel catalog snapshot before that transaction started. To build
> such a snapshot we need to collect information about committed which
> changed the catalog. Unfortunately we can't diagnose whether a txn
> changed the catalog without a snapshot so we just assume all committed
> ones do - it just costs a bit of memory. Thats the background of the
> "forced to assume catalog changes for ..." message.
>
OK, so this snapshot only needs to include the XIDs of transactions that
have modified the catalogs. Do I get it right? This way you are able to
fetch the correct relation definition for replication decoding.

Just thinking but... It looks to be a waste to store the transactions XIDs
of all the committed transactions, but on the other hand there is no way to
track the XIDs of transactions that modified a catalog in current core
code. So yes this approach is better as refining the transaction XID
tracking for snapshot reconstruction is something that could be improved
later. Those are only thoughts though...

The reason for the ABORTs is related but different. We start out in the
> "SNAPBUILD_START" state when we try to build a snapshot. When we find
> initial information about running transactions (i.e. xl_running_xacts)
> we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> decode all changes in transactions that start *after* the current
> lsn. Earlier transactions might have tuples on a catalog state we can't
> query.
>
Just to be clear, lsn means the log-sequence number associated to each xlog
record?


> Only when all transactions we observed as running before the
> FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> As we want a consistent/reproducible set of transactions to produce
> output via the logstream we only pass transactions to the output plugin
> if they commit *after* CONSISTENT (they can start earlier though!). This
> allows us to produce a pg_dump compatible snapshot in the moment we get
> consistent that contains exactly the changes we won't stream out.
>
> Makes sense?
>
OK got it thanks for your explanation.

So, once again coming to it, we need in the snapshot built only the XIDs of
transactions that modified the catalogs to get a consistent view of
relation info for decoding.
Really, I think that refining the XID tracking to minimize the size of the
snapshot built for decoding would be really a key for performance
improvement especially for OLTP-type applications (lots of transactions
involved, few of them involving catalogs).
-- 
Michael Paquier
http://michael.otacoo.com

Re: [HACKERS] StrategyGetBuffer questions

2012-11-20 Thread Amit Kapila

On Wednesday, November 21, 2012 4:21 AM Jeff Janes wrote:
> On Tue, Nov 20, 2012 at 1:26 PM, Merlin Moncure 
> wrote:
> > In this sprawling thread on scaling issues [1], the topic meandered
> > into StrategyGetBuffer() -- in particular the clock sweep loop.  I'm
> > wondering:
> >


> >
> > *) Since the purpose of usage_count is to act on advisory basis to
> > keep recently/frequently accessed buffers from being discarded, is it
> > really necessary to rigorously guard the count with a spinlock?  If a
> > ++ or -- operation on the value gets missed here or there, how big of
> > a deal is it really?
> 
> I don't think it is all that big of a deal.
> 
> I've implemented this patch to do that.  It still applies to head.
> 
> http://archives.postgresql.org/pgsql-hackers/2011-08/msg00305.php
> 
> It was very effective at removing BufFreelistLock contention on the
> system I had at the time.

In that case, why don't we work towards reducing it?
Is the generic use case a problem or will it effect any generic scenario in
negative way?

With Regards,
Amit Kapila.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Michael Paquier

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund wrote:

> Those aren't unexpected. Perhaps I should not make it a warning then...
>
> A short explanation:
>
> We can only decode tuples we see in the WAL when we already have a
> timetravel catalog snapshot before that transaction started. To build
> such a snapshot we need to collect information about committed which
> changed the catalog. Unfortunately we can't diagnose whether a txn
> changed the catalog without a snapshot so we just assume all committed
> ones do - it just costs a bit of memory. Thats the background of the
> "forced to assume catalog changes for ..." message.
>
> The reason for the ABORTs is related but different. We start out in the
> "SNAPBUILD_START" state when we try to build a snapshot. When we find
> initial information about running transactions (i.e. xl_running_xacts)
> we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
> decode all changes in transactions that start *after* the current
> lsn. Earlier transactions might have tuples on a catalog state we can't
> query.
> Only when all transactions we observed as running before the
> FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
> As we want a consistent/reproducible set of transactions to produce
> output via the logstream we only pass transactions to the output plugin
> if they commit *after* CONSISTENT (they can start earlier though!). This
> allows us to produce a pg_dump compatible snapshot in the moment we get
> consistent that contains exactly the changes we won't stream out.
>
> Makes sense?
>
> > 3) Assertion failure while running pgbench, I was  just curious to see
> how
> > it reacted when logical replication was put under a little bit of load.
> > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File:
> "snapbuild.c",
> > Line: 877)
> > => pgbench -i postgres; pgbench -T 500 -c 8 postgres
>
> Can you reproduce this one? I would be interested in log output. Because
> I did run pgbench for quite some time and I haven't seen that one after
> fixing some issues last week.
>

> It implies that snapstate->nrrunning has lost touch with reality...
>
Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
outputted anything in the logs, but here is the backtrace of the core file
produced.
 #2  0x00865145 in ExceptionalCondition (conditionName=0xa15100
"!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >=
((TransactionId) 3)))", errorType=0xa14f3b "FailedAssertion",
fileName=0xa14ed0 "snapbuild.c", lineNumber=877) at assert.c:54
#3  0x0070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,
xid=0) at snapbuild.c:877
#4  0x0070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,
snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at
snapbuild.c:388
#5  0x0070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,
snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
#6  0x007080b9 in DecodeRecordIntoReorderBuffer (reader=0x1a08300,
state=0x19e4e20, buf=0x1a0a368) at decode.c:84
#7  0x00708cad in replay_finished_record (state=0x1a08300,
buf=0x1a0a368) at logicalfuncs.c:54
#8  0x004d8033 in XLogReaderRead (state=0x1a08300) at
xlogreader.c:965
#9  0x0070f7c3 in XLogSendLogical (caughtup=0x7fffb22c35fb "") at
walsender.c:1721
#10 0x0070ea05 in WalSndLoop (send_data=0x70f6e2 )
at walsender.c:1184
#11 0x0070e0eb in StartLogicalReplication (cmd=0x190eaa8) at
walsender.c:726
#12 0x0070e3ac in exec_replication_command (cmd_string=0x19a65c8
"START_LOGICAL_REPLICATION 'id-0' 0/7E1855C") at walsender.c:853
#13 0x00753ee0 in PostgresMain (argc=2, argv=0x18f63d8,
username=0x18f62a8 "michael") at postgres.c:3974
#14 0x006f13ea in BackendRun (port=0x1912600) at postmaster.c:3668
#15 0x006f0b76 in BackendStartup (port=0x1912600) at
postmaster.c:3352
#16 0x006ed900 in ServerLoop () at postmaster.c:1431
#17 0x006ed208 in PostmasterMain (argc=13, argv=0x18f40a0) at
postmaster.c:1180
#18 0x00657517 in main (argc=13, argv=0x18f40a0) at main.c:197
I'm keeping this core and the binary btw.
-- 
Michael Paquier
http://michael.otacoo.com

Re: [HACKERS] Suggestion for --truncate-tables to pg_restore

2012-11-20 Thread Josh Kupershmidt

Hi Karl,
I signed on to review this patch for the current CF. Most of the
background for the patch seems to be in the message below, so I'm
going to respond to this one first.

On Fri, Sep 21, 2012 at 8:54 AM, Karl O. Pinc  wrote:
> On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote:
>
>> I've had problems using pg_restore --data-only when
>> restoring individual schemas (which contain data which
>> has had bad things done to it).  --clean does not work
>> well because of dependent objects in other schemas.

OK.

> 
>
> First, the problem:
>
> Begin with the following structure:
>
> CREATE TABLE schemaA.foo (id PRIMARY KEY, data INT);
>
> CREATE VIEW schemaB.bar AS SELECT * FROM schemaA.foo;
>
> Then, by accident, somebody does:
>
> UPDATE schemaA.foo SET data = data + (RANDOM() * 1000)::INT;
>
> So, you want to restore the data into schemaA.foo.
> But schemaA.foo has (bad) data in it that must first
> be removed. It would seem that using
>
>   pg_restore --clean -n schemaA -t foo my_pg_dump_backup
>
> would solve the problem, it would drop schemaA.foo,
> recreate it, and then restore the data.  But this does
> not work.  schemaA.foo does not drop because it's
> got a dependent database object, schemaB.bar.

Right.

> Of course there are manual work-arounds.  One of these
> is truncating schemaA.foo and then doing a pg_restore
> with --data-only.

Simply doing TRUNCATE manually was the first thought that occurred to
me when I saw your example.

> The manual work-arounds become
> increasingly burdensome as you need to restore more
> tables.  The case that motivated me was an attempt
> to restore the data in an entire schema, one which
> contained a significant number of tables.

TBH, I didn't find the example above particularly compelling for
demonstrating the need for this feature. If you've just got one table
with dependent views which needs to be restored, it's pretty easy to
manually TRUNCATE and have pg_restore --data-only reload the table.
(And easy enough to combine the truncate and restore into a single
transaction in case anything goes wrong, if need be.)

But I'm willing to grant that this proposed feature is potentially as
useful as existing restore-jiggering options like --disable-triggers.
And I guess I could see that if you're really stuck having to perform
a --data-only restore of many tables, this feature could come in
handy.

> So, the idea here is to be able to do a data-only
> restore, first truncating the data in the tables
> being restored to remove the existing corrupted data.
>
> The proposal is to add a --truncate-tables option
> to pg_restore.
>
> 
>
> There were some comments on syntax.
>
> I proposed to use -u as a short option.  This was
> thought confusing, given it's use in other
> Unix command line programs (mysql).   Since there's
> no obvious short option, forget it.  Just have
> a long option.

Agreed.

> Another choice is to avoid introducing yet another
> option and instead overload --clean so that when
> doing a --data-only restore --clean truncates tables
> and otherwise --clean retains the existing behavior of
> dropping and re-creating the restored objects.

I like the --truncate-tables flag idea better than overloading
--clean, since it makes the purpose immediately apparent.

> (I tested pg_restore with 9.1 and when --data-only is
> used --clean is ignored, it does not even produce a warning.
> This is arguably a bug.)

+1 for having pg_restore bail out with an error if the user specifies
--data-only --clean. By the same token, --clean and --truncate-tables
together should also raise a "not allowed" error.

> 
>
> More serious objections were raised regarding semantics.
>
> What if, instead, the initial structure looked like:
>
> CREATE TABLE schemaA.foo
>   (id PRIMARY KEY, data INT);
>
> CREATE TABLE schemaB.bar
>   (id INT CONSTRAINT "bar_on_foo" REFERENCES foo
>  , moredata INT);
>
> With a case like this, in most real-world situations, you'd
> have to use pg_restore with --disable-triggers if you wanted
> to use --data-only and --truncate-tables.  The possibility of
> foreign key referential integrity corruption is obvious.

Why would --disable-triggers be used in this example? I don't think
you could use --truncate-tables to restore only table "foo", because
you would get:

  ERROR:  cannot truncate a table referenced in a foreign key constraint

(At least, I think so, not having tried with the actual patch.) You
could use --disable-triggers when restoring "bar". Though if you're
just enabling that option for performance purposes, and are unable to
guarantee that the restore will leave the tables in a consistent
state, well then it seems like you shouldn't use the option.

> Aside:  Unless you're restoring databases in their entirety
> the pg_restore --disable-triggers option makes it easy to
> introduce foreign key referential integrity corruption.

Yup, and I think the docs could do more to warn users about
--disable-triggers in particular. And I see

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Abhijit Menon-Sen

At 2012-11-20 22:55:52 -0500, t...@sss.pgh.pa.us wrote:
>
> BTW, I probably missed some context upthread, but why do we have two
> fields at all?

I would also have preferred to handle the nodeMergeAppend case using a
context pointer as you suggest, but Andres needs to store two pointers
in his heap nodes.

Andres: suppose we replace binaryheap_node with just a Datum, could you
live with storing a pointer to a struct with two pointers? If so, that
would address the concerns raised.

If not, maybe we should explore Robert's earlier suggestion to make
binaryheap_node user-definable (in effect).

-- Abhijit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Tom Lane

Abhijit Menon-Sen  writes:
>> While I'm thinking about it, why are the fields of a binaryheap_node
>> called "key" and "value"?  That implies a semantics not actually used
>> here.  Maybe "value1" and "value2" instead?

> Yes, I discussed this with Andres earlier (and considered ptr and value
> or p1 and p2), but decided to wait for others to comment on the naming.
> I have no problem with value1 and value2, though I'd very slightly
> prefer d1 and d2 (d for Datum) now.

d1/d2 would be fine with me.

BTW, I probably missed some context upthread, but why do we have two
fields at all?  At least for the purposes of nodeMergeAppend, it
would make a lot more sense for the binaryheap elements to be just
single Datums, without all the redundant pointers to the MergeAppend
node.  The need to get at the comparison support info could be handled
much more elegantly than that by providing a "void *" passthrough arg.
That is, the heap creation function becomes

binaryheap *binaryheap_allocate(size_t capacity,
binaryheap_comparator compare,
void *passthrough);

and the comparator signature becomes

typedef int (*binaryheap_comparator) (Datum a, Datum b, void *passthrough);

For nodeMergeAppend, the Datums would be the slot indexes and the
passthrough pointer could point to the MergeAppend node.

This would make equal sense in a lot of other contexts, such as if you
wanted a heap composed of tuples -- the info to compare tuples could be
handled via the passthrough argument, rather than doubling the size of
the heap in order to put that pointer into each heap element.  If you
have no use for the passthrough arg in a particular usage, just pass
NULL.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Abhijit Menon-Sen

Hi.

A brief response to earlier messages in this thread:

1. I agree that it's a good idea to use Datum rather than void *.
2. I don't think it's worth getting rid of binaryheap_node.
3. I agree that it makes sense to go with what we have now (after
   Robert's reworking of my patch) and reconsider if needed when
   there are more users of the code.
4. I agree with all of Tom's suggestions as well.

At 2012-11-20 18:01:10 -0500, t...@sss.pgh.pa.us wrote:
>
> Is it actually required that the output be exactly these three values,
> or is the specification really <0, 0, >0 ?

The code did require -1/0/+1, but I changed it to expect only <0, 0, >0.
So you're right, the comment should be changed.

> This in binaryheap_replace_first is simply bizarre:
> 
>   if (key)
>   heap->nodes[0].key = key;
>   if (val)
>   heap->nodes[0].value = val;

It's a leftover from the earlier implementation that used pointers,
where you could pass in NULL to leave either key or value unchanged.

> While I'm thinking about it, why are the fields of a binaryheap_node
> called "key" and "value"?  That implies a semantics not actually used
> here.  Maybe "value1" and "value2" instead?

Yes, I discussed this with Andres earlier (and considered ptr and value
or p1 and p2), but decided to wait for others to comment on the naming.
I have no problem with value1 and value2, though I'd very slightly
prefer d1 and d2 (d for Datum) now.

Robert: thanks again for your work on the patch. Do you want me to make
the changes Tom suggests above, or are you already doing it?

-- Abhijit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Timing events WIP v1

2012-11-20 Thread Craig Ringer

On 11/20/2012 04:48 PM, Pavel Stehule wrote:
> 2012/11/20 Greg Smith :
>> On 11/16/12 12:20 AM, Craig Ringer wrote:
>>
>>> Not that implementing `hstore_to_json` is exactly hard, mind you, as a
>>> quick search shows: https://gist.github.com/2318757
>>
>> Both pulling hstore more firmly into core and adopting something like a
>> hstore_to_json call as the preferred UI for timing event data are all
>> directions I'd be happy to go.  That's why I stopped before trying to even
>> implement that part.  I think the general direction to go is clear, but the
>> details on how to present the resulting data is more of a tarp than even a
>> bikeshed at this point.  It's not a hard problem though.
>>
> I don't like to see current hstore in core - It doesn't support
> nesting, it doesn't support different datatypes, it is not well
> supported by plpgsql
>

... or by many client libraries, though converting hstore values to json
notation for output usually takes care of that.

I can't help but suspect that hstore will be subsumed by native storage
and indexing of json objects, since these are already widely supported
in KV or document stores. Of course, that requires that we come to some
agreement on how to represent literal scalars for interaction with json,
which hasn't seen much progress.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] StrategyGetBuffer questions

2012-11-20 Thread Merlin Moncure

On Tue, Nov 20, 2012 at 4:50 PM, Jeff Janes  wrote:
> On Tue, Nov 20, 2012 at 1:26 PM, Merlin Moncure  wrote:
>> In this sprawling thread on scaling issues [1], the topic meandered
>> into StrategyGetBuffer() -- in particular the clock sweep loop.  I'm
>> wondering:
>>
>> *) If there shouldn't be a a bound in terms of how many candidate
>> buffers you're allowed to skip for having a non-zero usage count.
>> Whenever an unpinned usage_count>0 buffer is found, trycounter is
>> reset (!) so that the code operates from point of view as it had just
>> entered the loop.  There is an implicit assumption that this is rare,
>> but how rare is it?
>
>
> How often is that the trycounter would hit zero if it were not reset?
> I've instrumented something like that in the past, and could only get
> it to fire under pathologically small shared_buffers and workloads
> that caused most of them to be pinned simultaneously.
>
>> *) Shouldn't StrategyGetBuffer() bias down usage_count if it finds
>> itself examining too many unpinned buffers per sweep?
>
> It is a self correcting problem.  If it is examining a lot of unpinned
> buffers, it is also decrementing a lot of them.
>
>>
>> *) Since the purpose of usage_count is to act on advisory basis to
>> keep recently/frequently accessed buffers from being discarded, is it
>> really necessary to rigorously guard the count with a spinlock?  If a
>> ++ or -- operation on the value gets missed here or there, how big of
>> a deal is it really?
>
> I don't think it is all that big of a deal.
>
> I've implemented this patch to do that.  It still applies to head.
>
> http://archives.postgresql.org/pgsql-hackers/2011-08/msg00305.php
>
> It was very effective at removing BufFreelistLock contention on the
> system I had at the time.

Ah, interesting. well, that's more aggressive in that you dispense
with the lwlock completely.  hm.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Tom Lane

Robert Haas  writes:
> Here's a revised patch that takes the approach of just changing void *
> to Datum; it includes a bunch of other cleanups that I felt were
> worthwhile.

The comment in binaryheap.h says

 * For a max-heap, the comparator must return -1 iff a < b, 0 iff a == b,
 * and +1 iff a > b.  For a min-heap, the conditions are reversed.

Is it actually required that the output be exactly these three values,
or is the specification really <0, 0, >0 ?  If the latter, it should
say that, rather than overdefine the implementation of comparison.

Also, wouldn't it be reasonable to const-ify the comparator function, ie

typedef int (*binaryheap_comparator) (const binaryheap_node *a, const 
binaryheap_node *b);

 *  nodes   the first of a list of "space" nodes

s/list/array/, no?  Also, it's standard to mark this sort of hack with
/* VARIABLE LENGTH ARRAY */, or even better use FLEXIBLE_ARRAY_MEMBER
(which would require fixing the space calculation in
binaryheap_allocate, not that that would be a bad thing anyway).

left_offset and friends are defined as taking size_t and returning int,
which is broken on machines where size_t is wider than int, or even on
machines where they're the same width since int is signed.  Personally
I'd replace pretty much every single usage of size_t in this module with
int, as I am unconvinced either that we need 8-byte indexes or that the
logic is correct if it tries to form index -1 (as would happen for
example with binaryheap_build applied to an empty heap).

The Assert() in binaryheap_add_unordered fills me with no confidence.
If we're not going to support expansion of the array (which might be a
better idea anyway) then this needs to be an actual run-time check, not
something that won't be there in production.

What I think *would* be worth asserting for is whether the heap is
correctly ordered or not --- that is, I think you should add a flag that
is cleared by binaryheap_add_unordered and set by binaryheap_build, and
then the functions that presume correct ordering should assert it's set.
An Assert in binaryheap_replace_first that the heap is not empty seems
appropriate as well.

This in binaryheap_replace_first is simply bizarre:

if (key)
heap->nodes[0].key = key;
if (val)
heap->nodes[0].value = val;

Why are these assignments conditional?  If they're sane at all, they
should not be testing the Datums directly in any case, but the result of
DatumGetSomething.

  static int32
! heap_compare_slots(binaryheap_node *a, binaryheap_node *b)
  {
+   MergeAppendState *node = (MergeAppendState *) a->value;

This should use DatumGetPointer(a->value), and the function result
should be int not int32 to comport with the typedef for it.

While I'm thinking about it, why are the fields of a binaryheap_node
called "key" and "value"?  That implies a semantics not actually used
here.  Maybe "value1" and "value2" instead?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] StrategyGetBuffer questions

2012-11-20 Thread Jeff Janes

On Tue, Nov 20, 2012 at 1:26 PM, Merlin Moncure  wrote:
> In this sprawling thread on scaling issues [1], the topic meandered
> into StrategyGetBuffer() -- in particular the clock sweep loop.  I'm
> wondering:
>
> *) If there shouldn't be a a bound in terms of how many candidate
> buffers you're allowed to skip for having a non-zero usage count.
> Whenever an unpinned usage_count>0 buffer is found, trycounter is
> reset (!) so that the code operates from point of view as it had just
> entered the loop.  There is an implicit assumption that this is rare,
> but how rare is it?

How often is that the trycounter would hit zero if it were not reset?
I've instrumented something like that in the past, and could only get
it to fire under pathologically small shared_buffers and workloads
that caused most of them to be pinned simultaneously.

> *) Shouldn't StrategyGetBuffer() bias down usage_count if it finds
> itself examining too many unpinned buffers per sweep?

It is a self correcting problem.  If it is examining a lot of unpinned
buffers, it is also decrementing a lot of them.

>
> *) Since the purpose of usage_count is to act on advisory basis to
> keep recently/frequently accessed buffers from being discarded, is it
> really necessary to rigorously guard the count with a spinlock?  If a
> ++ or -- operation on the value gets missed here or there, how big of
> a deal is it really?

I don't think it is all that big of a deal.

I've implemented this patch to do that.  It still applies to head.

http://archives.postgresql.org/pgsql-hackers/2011-08/msg00305.php

It was very effective at removing BufFreelistLock contention on the
system I had at the time.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Do we need so many hint bits?

2012-11-20 Thread Tom Lane

Merlin Moncure  writes:
> Hm, I wonder if you could squeeze two bits out. ISTM here are the
> interesting cases enumerated:

> 0: xmin unknown
> 1: xmin invalid
> 2: xmin valid, xmax unknown
> 3: xmin valid, xmax invalid
> 4: xmin valid, xmax valid

> Did I miss any?

Yes.  xmin unknown, xmax unknown is possible and different from all the
above, ie a tuple can be deleted by the creating transaction.  (But it
could still be visible to some of that transaction's snapshots, so you
can't equate this state to "xmin invalid".)

There's a fairly big problem with any of these ideas, and it's not
even on-disk compatibility.  It is that we assume that hint bits can be
set without exclusive lock on the buffer.  If any of the transitions
xmin unknown -> xmin committed, xmin unknown -> xmin aborted,
xmax unknown -> xmax committed, xmax unknown -> xmax aborted
aren't expressed by setting a bit that wasn't set before, we probably
lose that property, and thereby a whole lot of concurrency.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Andres Freund

On 2012-11-20 15:47:25 -0500, Robert Haas wrote:
> On Tue, Nov 20, 2012 at 3:19 PM, Andres Freund  wrote:
> > I don't have a fundamental problem with it, but I don't think it's worth
> > the cost. If you have variable sized (from the point of the compiler)
> > values you either have more complex offset computations to the
> > individual elements or one more indirection due to a pointer array.
> >
> > Do you feel strongly about it? If so, go ahead, otherwise I would prefer
> > the current way (with parameters changed to be Datum's of course).
>
> Here's a revised patch that takes the approach of just changing void *
> to Datum; it includes a bunch of other cleanups that I felt were
> worthwhile.  I tested this using the following setup:
>
> CREATE TABLE sample (a int, b numeric);
> DO $$
> BEGIN
> FOR i IN 1 .. 1000 LOOP
> EXECUTE 'CREATE TABLE sample' || i || ' (CHECK (b = ' || i
> || ')) INHERITS (sample)';
> EXECUTE 'INSERT INTO sample' || i
> || ' SELECT g, ' || i || ' FROM
> generate_series(1,10) g;';
> EXECUTE 'ALTER TABLE sample' || i
> || ' ADD PRIMARY KEY (a)';
> END LOOP;
> END
> $$;
> VACUUM ANALYZE;
>
> And this test query:
>
> select sum(1) from (select * from sample order by a limit 250) x;
>
> On my system, on repeated execution, this takes about 113 ms with
> unpatched master and about 114 ms with the patch.  I'm inclined to
> think that's not worth getting excited about, as it could be simply
> code shifting around across cache line boundaries and not indicative
> of an actual regression due to the difference in logic.  Even if there
> is a regression, it's pretty darn small: most people will not have
> 1000 partitions, and those that do will often find query execution
> time dominated by other factors.  Against that, there is positive
> value in having this code somewhat more abstracted.

To make sure were not regressing I ran some more testcases that I could
think of/find and didn't find anything problematic. Either both are
equally fast or the new code is faster.

FWIW for me the new code is very slightly faster in your example, but
only if I apply it ontop of fklocks (where I applied it accidentally at
first), not if ontop of 273986bf0d39e5166eb15ba42ebff4803e23a688 where
its reversed, so your code-layout theory sounds likely.

> With respect to the question of the API, no, I don't feel
> overwhelmingly strongly that we have to whack the API with a bat
> that's larger than the one I've used here.  On the flip side, it
> wouldn't surprise me if, as the code acquires more users, we find that
> we actually do want to make some changes, either the one I already
> proposed or something else.  Fortunately, it's not a lot of code, so
> if we end up refactoring it some more later, it shouldn't be a big
> deal.  So I'm happy enough to commit this the way I have it and sort
> out the rest as we go.  If there are no objections I'll go ahead and
> do that.

Looks good to me. Adapting when we have the actual requirements sounds
fine & sensible as well...

Thanks,

Andres


--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Do we need so many hint bits?

2012-11-20 Thread Merlin Moncure

On Thu, Nov 15, 2012 at 10:37 PM, Simon Riggs  wrote:
> On 15 November 2012 22:21, Tom Lane  wrote:
>
>>> Removing those 3 hints would give us 3 more flag bits (eventually, after
>>> we are sure they aren't just leftover), and it would also reduce the
>>> chance that a page is dirtied for no other reason than to set them.
>>
>> We aren't pressed for flag bits particularly.  I think the main
>> attraction of this idea is precisely to reduce unnecessary page dirties,
>> and so that leads me to suggest a variant: keep the four bits defined as
>> now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the
>> page is already dirty.  This would make it a straight-up trade of more
>> clog consultation for fewer page dirties.
>
> Hmm, I thought Alvaro wanted an extra flag bit for foreign key locks
> but couldn't find it.
>
> Come to think of it, why do we have XMIN_INVALID and XMAX_INVALID? We
> never need both at the same time, so we can just use one...
> XMIN_INVALID -> XACT_INVALID
> XMAX_INVALID == XMIN_COMMITTED | XACT_INVALID

Hm, I wonder if you could squeeze two bits out. ISTM here are the
interesting cases enumerated:

0: xmin unknown
1: xmin invalid
2: xmin valid, xmax unknown
3: xmin valid, xmax invalid
4: xmin valid, xmax valid

Did I miss any? If not, I think case #3 could be covered by utilizing
xmax == InvalidTransactionId or simply ignored.  That makes the check
a little dirtier than a bit test though, but you could be sneaky and
map both xmin=valid cases to a bit.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks

2012-11-20 Thread Andres Freund

On 2012-11-20 17:36:14 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
>
>
> > - I find using a default: clause in switches with enum types where 
> > everything
> >   is expected to be handled like the following a bad idea, this way the
> >   compiler won't warn you if youve missed case's which makes 
> > changing/extending code harder:
> > switch (rc->strength)
> > {
>
> I eliminated some of these default clauses, but the compiler is not
> happy about not having some of them, so they remain.

You have removed the ones that looked removable to me...

>
> > -
> > #if 0
> > /*
> >  * The idea here is to remove the IS_MULTI bit, and 
> > replace the
> >  * xmax with the updater's Xid.  However, we can't 
> > really do it:
> >  * modifying the Xmax is not allowed under our buffer 
> > locking
> >  * rules, unless we have an exclusive lock; but we 
> > don't know that
> >  * we have it.  So the multi needs to remain in place 
> > :-(
> >  */
> > ResetMultiHintBit(tuple, buffer, xmax, true);
> > #endif
> >
> >  Three things:
> > - HeapTupleSatisfiesUpdate is actually always called exclusively locked 
> > ;)
> > - Extending something like LWLockHeldByMe to also return the current
> >   lockmode doesn't sound hard
> > - we seem to be using #ifdef NOT_YET for such cases
>
> After spending some time trying to make this work usefully, I observed
> that it's pointless, at least if we apply it only in
> HeapTupleSatisfiesUpdate, because the IS_MULTI bit is going to be
> removed by compute_new_xmax_infomask if the multi is gone.  Something
> like this would be useful if we could do it in other places; say when
> we're merely scanning page without the specific intention of modifying
> that particular tuple.  Maybe in heap_prune_page, for example.  ISTM we
> can see that as an optimization we can add later.

heap_prune_page sounds like a good fit, yes. One other place would be
when following hot chains, but the locking would be far more critical in
that path, so its less likely to be beneficial.
Pushing it off sounds good to me.

> Some of your other observations have been fixed by commits that I have
> pushed to my github tree.

A short repetition of the previous pgbench run of many SELECT FOR
SHARE's:

10s test:
master:  22673 24637 23874 25527
fklocks: 24835 24601 24606 24868

60s test:
master:  32609 33087
fklocks: 33350 33359

Very nice!

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-20 Thread Pavel Stehule

hello

do you plan to support GiST?

Regards

Pavel

2012/11/20 Alexander Korotkov :
> On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra  wrote:
>>
>> 2) It's common to use upper-case names for macros, but trgm.h defines
>>macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
>>could make it a bit more correct?
>>
>> 3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
>>other in trgm.h - maybe it'd be a good idea to join them?
>>
>> 4) The two new method prototypes at the end of trgm.h use different
>>indendation than the rest (spaces only instead of tabs).
>
> These issues are fixed in attached patch.
> Additionally 3 new macros are introduced: MAX_RESULT_STATES,
> MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during
> regex processing.
>
> --
> With best regards,
> Alexander Korotkov.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Robert Haas

On Tue, Nov 20, 2012 at 3:19 PM, Andres Freund  wrote:
> I don't have a fundamental problem with it, but I don't think it's worth
> the cost. If you have variable sized (from the point of the compiler)
> values you either have more complex offset computations to the
> individual elements or one more indirection due to a pointer array.
>
> Do you feel strongly about it? If so, go ahead, otherwise I would prefer
> the current way (with parameters changed to be Datum's of course).

Here's a revised patch that takes the approach of just changing void *
to Datum; it includes a bunch of other cleanups that I felt were
worthwhile.  I tested this using the following setup:

CREATE TABLE sample (a int, b numeric);
DO $$
BEGIN
FOR i IN 1 .. 1000 LOOP
EXECUTE 'CREATE TABLE sample' || i || ' (CHECK (b = ' || i
|| ')) INHERITS (sample)';
EXECUTE 'INSERT INTO sample' || i
|| ' SELECT g, ' || i || ' FROM
generate_series(1,10) g;';
EXECUTE 'ALTER TABLE sample' || i
|| ' ADD PRIMARY KEY (a)';
END LOOP;
END
$$;
VACUUM ANALYZE;

And this test query:

select sum(1) from (select * from sample order by a limit 250) x;

On my system, on repeated execution, this takes about 113 ms with
unpatched master and about 114 ms with the patch.  I'm inclined to
think that's not worth getting excited about, as it could be simply
code shifting around across cache line boundaries and not indicative
of an actual regression due to the difference in logic.  Even if there
is a regression, it's pretty darn small: most people will not have
1000 partitions, and those that do will often find query execution
time dominated by other factors.  Against that, there is positive
value in having this code somewhat more abstracted.

With respect to the question of the API, no, I don't feel
overwhelmingly strongly that we have to whack the API with a bat
that's larger than the one I've used here.  On the flip side, it
wouldn't surprise me if, as the code acquires more users, we find that
we actually do want to make some changes, either the one I already
proposed or something else.  Fortunately, it's not a lot of code, so
if we end up refactoring it some more later, it shouldn't be a big
deal.  So I'm happy enough to commit this the way I have it and sort
out the rest as we go.  If there are no objections I'll go ahead and
do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

binaryheap-rmh.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] foreign key locks

2012-11-20 Thread Alvaro Herrera

Andres Freund wrote:

> - I find using a default: clause in switches with enum types where everything
>   is expected to be handled like the following a bad idea, this way the
>   compiler won't warn you if youve missed case's which makes 
> changing/extending code harder:
>   switch (rc->strength)
>   {

I eliminated some of these default clauses, but the compiler is not
happy about not having some of them, so they remain.

> -
> #if 0
>   /*
>* The idea here is to remove the IS_MULTI bit, and 
> replace the
>* xmax with the updater's Xid.  However, we can't 
> really do it:
>* modifying the Xmax is not allowed under our buffer 
> locking
>* rules, unless we have an exclusive lock; but we 
> don't know that
>* we have it.  So the multi needs to remain in place 
> :-(
>*/
>   ResetMultiHintBit(tuple, buffer, xmax, true);
> #endif
> 
>  Three things:
> - HeapTupleSatisfiesUpdate is actually always called exclusively locked ;)
> - Extending something like LWLockHeldByMe to also return the current
>   lockmode doesn't sound hard
> - we seem to be using #ifdef NOT_YET for such cases

After spending some time trying to make this work usefully, I observed
that it's pointless, at least if we apply it only in
HeapTupleSatisfiesUpdate, because the IS_MULTI bit is going to be
removed by compute_new_xmax_infomask if the multi is gone.  Something
like this would be useful if we could do it in other places; say when
we're merely scanning page without the specific intention of modifying
that particular tuple.  Maybe in heap_prune_page, for example.  ISTM we
can see that as an optimization we can add later.

For the record, the implementation of ResetMultiHintBit looks like this:

/*
 * When a tuple is found to be marked by a MultiXactId, all whose members are
 * completed transactions, the HEAP_XMAX_IS_MULTI bit can be removed.
 * Additionally, at the request of caller, we can set the Xmax to the given
 * Xid, and set HEAP_XMAX_COMMITTED.
 *
 * Note that per buffer access rules, the caller to this function must hold
 * exclusive content lock on the affected buffer.
 */
static inline void
ResetMultiHintBit(HeapTupleHeader tuple, Buffer buffer,
  TransactionId xid, bool mark_committed)
{
Assert(LWLockModeHeld((&BufferDescriptors[buffer])->content_lock ==
  LW_EXCLUSIVE));
Assert(tuple->t_infomask & HEAP_XMAX_IS_MULTI);
Assert(!(tuple->t_infomask & HEAP_XMAX_INVALID));
Assert(!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)));

tuple->t_infomask &= ~HEAP_XMAX_BITS;
HeapTupleHeaderSetXmax(tuple, xid);
if (!TransactionIdIsValid(xid))
tuple->t_infomask |= HEAP_XMAX_INVALID;
/* note that HEAP_KEYS_UPDATED persists, if set */

if (mark_committed)
SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED, xid);
else
SetBufferCommitInfoNeedsSave(buffer);
}

(This is removed by commit 52f86f82fb3e01a6ed0b8106bac20f319bb3ad0a in
my github tree, but that commit might be lost in the future, hence this
paste.)

Some of your other observations have been fixed by commits that I have
pushed to my github tree.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Andres Freund

On 2012-11-20 15:06:58 -0500, Robert Haas wrote:
> On Tue, Nov 20, 2012 at 2:29 PM, Andres Freund  wrote:
> > On 2012-11-20 14:03:12 -0500, Robert Haas wrote:
> >> On Thu, Nov 15, 2012 at 8:56 PM, Abhijit Menon-Sen  
> >> wrote:
> >> > [ new patch ]
> >>
> >> I went over this again today with a view to getting it committed, but
> >> discovered some compiler warnings that look like this:
> >>
> >> warning: cast to pointer from integer of different size
> >>
> >> The problem seems to be that the binary heap implementation wants the
> >> key and value to be a void *, but the way it's being used in
> >> nodeMergeAppend.c the value is actually an int.  I don't think we want
> >> to commit a binary heap implementation which has an impedance mismatch
> >> of this type with its only client.  The obvious solution seems to be
> >> to make the key and value each a Datum, and then clients can use
> >> WhateverGetDatum and DatumGetWhatever to pass whatever built-in data
> >> type they happen to have.  I'm trying that approach now and will post
> >> an updated patch based on that approach if it seems to work OK and
> >> nobody suggests something better between now and then.
> >
> > Sounds like a good plan, one other alternative would have been declaring
> > the parameters a size_t (and thus explicitly always big enough to store
> > a pointer, but also valid to store inline data) instead of using a
> > void*. But given there is DatumGetPointer/PointerGetDatum et al, using
> > the postgres-y datatype seems fine to me.
> >
> > In the LCR case those really are pointers, so preserving that case is
> > important to me ;), but your proposal does that, so ...

> On further reflection, I'm wondering if it wouldn't be a smarter idea
> to just get rid of the binaryheap_node concept altogether.  Instead,
> when you allocate a binary heap, you specify an "element size" as well
> as a capacity.  Then, the binary heap can truly treat the elements as
> an opaque array of bytes rather than a struct of any particular type.
> Of course, the disadvantage of this approach is that it's likely to be
> slightly slower, because we'll need to do a multiplication by a
> variable rather than simple bit shift.  But the extra flexibility
> might be worth it, because for example for merge append this is kind
> of inefficient from a storage perspective.  We only really need N+1
> pointers, but we've got storage for 2N.  With the above change plus a
> user-specified third argument for the comparator function (passed as
> yet another argument to binaryheap_allocate) we could get around that
> while preserving the option for LCR or any other client code to do as
> it sees fit.

> Thoughts?

I don't have a fundamental problem with it, but I don't think it's worth
the cost. If you have variable sized (from the point of the compiler)
values you either have more complex offset computations to the
individual elements or one more indirection due to a pointer array.

Do you feel strongly about it? If so, go ahead, otherwise I would prefer
the current way (with parameters changed to be Datum's of course).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Robert Haas

On Tue, Nov 20, 2012 at 2:29 PM, Andres Freund  wrote:
> On 2012-11-20 14:03:12 -0500, Robert Haas wrote:
>> On Thu, Nov 15, 2012 at 8:56 PM, Abhijit Menon-Sen  
>> wrote:
>> > [ new patch ]
>>
>> I went over this again today with a view to getting it committed, but
>> discovered some compiler warnings that look like this:
>>
>> warning: cast to pointer from integer of different size
>>
>> The problem seems to be that the binary heap implementation wants the
>> key and value to be a void *, but the way it's being used in
>> nodeMergeAppend.c the value is actually an int.  I don't think we want
>> to commit a binary heap implementation which has an impedance mismatch
>> of this type with its only client.  The obvious solution seems to be
>> to make the key and value each a Datum, and then clients can use
>> WhateverGetDatum and DatumGetWhatever to pass whatever built-in data
>> type they happen to have.  I'm trying that approach now and will post
>> an updated patch based on that approach if it seems to work OK and
>> nobody suggests something better between now and then.
>
> Sounds like a good plan, one other alternative would have been declaring
> the parameters a size_t (and thus explicitly always big enough to store
> a pointer, but also valid to store inline data) instead of using a
> void*. But given there is DatumGetPointer/PointerGetDatum et al, using
> the postgres-y datatype seems fine to me.
>
> In the LCR case those really are pointers, so preserving that case is
> important to me ;), but your proposal does that, so ...

On further reflection, I'm wondering if it wouldn't be a smarter idea
to just get rid of the binaryheap_node concept altogether.  Instead,
when you allocate a binary heap, you specify an "element size" as well
as a capacity.  Then, the binary heap can truly treat the elements as
an opaque array of bytes rather than a struct of any particular type.
Of course, the disadvantage of this approach is that it's likely to be
slightly slower, because we'll need to do a multiplication by a
variable rather than simple bit shift.  But the extra flexibility
might be worth it, because for example for merge append this is kind
of inefficient from a storage perspective.  We only really need N+1
pointers, but we've got storage for 2N.  With the above change plus a
user-specified third argument for the comparator function (passed as
yet another argument to binaryheap_allocate) we could get around that
while preserving the option for LCR or any other client code to do as
it sees fit.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-20 Thread Alexander Korotkov

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra  wrote:

> 2) It's common to use upper-case names for macros, but trgm.h defines
>macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
>could make it a bit more correct?
>
> 3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
>other in trgm.h - maybe it'd be a good idea to join them?
>
> 4) The two new method prototypes at the end of trgm.h use different
>indendation than the rest (spaces only instead of tabs).
>
These issues are fixed in attached patch.
Additionally 3 new macros are
introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are
limiting resources usage during regex processing.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] review: Reduce palloc's in numeric operations

2012-11-20 Thread Tom Lane

Heikki Linnakangas  writes:
> * Add init_var_from_num() function. This is the same as 
> set_var_from_num_nocopy in the original patch, but it doesn't require 
> the caller to have called init_var() first. IMHO this makes the calling 
> code slightly more readable. Also, it's now more evident what these vars 
> are: the digits array points to original array in the original Datum, 
> but 'buf' is NULL. This is the same arrangement that's used in the 
> constant NumericVars like const_zero.

init_var_from_num's header comment could use some more work.  The
statement that one "must not modify the returned var" is false in some
sense, since for instance numeric_ceil() does that.  The key is that you
have to replace the digit buffer not modify it in-place, but most
computational routines do that anyway.  Also it might be worth pointing
out explicitly that free_var() is not required unless the var is
modified subsequently.  (There are instances of both cases, and it might
not be clear to the reader why.)

> * get_str_from_var() no longer scribbles on its input. I noticed that 
> it's always called with a dscale that comes from the input var itself. 
> In other words, the rounding was unnecessary to begin with.

Hmm, it may have been necessary once upon a time, but I agree getting
rid of the rounding step is a win now.  Suggest adding a comment though
that the var is displayed to the number of digits indicated by its dscale.

Looks good otherwise.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Andres Freund

On 2012-11-20 14:03:12 -0500, Robert Haas wrote:
> On Thu, Nov 15, 2012 at 8:56 PM, Abhijit Menon-Sen  
> wrote:
> > [ new patch ]
>
> I went over this again today with a view to getting it committed, but
> discovered some compiler warnings that look like this:
>
> warning: cast to pointer from integer of different size
>
> The problem seems to be that the binary heap implementation wants the
> key and value to be a void *, but the way it's being used in
> nodeMergeAppend.c the value is actually an int.  I don't think we want
> to commit a binary heap implementation which has an impedance mismatch
> of this type with its only client.  The obvious solution seems to be
> to make the key and value each a Datum, and then clients can use
> WhateverGetDatum and DatumGetWhatever to pass whatever built-in data
> type they happen to have.  I'm trying that approach now and will post
> an updated patch based on that approach if it seems to work OK and
> nobody suggests something better between now and then.

Sounds like a good plan, one other alternative would have been declaring
the parameters a size_t (and thus explicitly always big enough to store
a pointer, but also valid to store inline data) instead of using a
void*. But given there is DatumGetPointer/PointerGetDatum et al, using
the postgres-y datatype seems fine to me.

In the LCR case those really are pointers, so preserving that case is
important to me ;), but your proposal does that, so ...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DEALLOCATE IF EXISTS

2012-11-20 Thread Marko Tiikkaja


On Tue, 09 Oct 2012 16:44:07 +0200, Vik Reykja  wrote:

I don't see Sébastien's message, but I made the same mistake in my patch.
Another one is attached with copyfuncs and equalfuncs.  I did a grep for
DeallocateStmt and I don't believe I have missed anything else.


The patch looks pretty straightforward to me, except for one thing:  
previously there was no NOTICE if you sent a Close message to the server  
and the prepared statement didn't exist.  But I'm leaving it for the  
committer to decide whether that's a problem or not, and marking this one  
Ready for Committer.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dumping an Extension's Script

2012-11-20 Thread Simon Riggs

On 19 November 2012 16:25, Robert Haas  wrote:

> Beyond that, I think much of the appeal of the extension feature is
> that it dumps as "CREATE EXTENSION hstore;" and nothing more.  That
> allows you to migrate a dump between systems with different but
> compatible versions of the hstore and have things work as intended.
> I'm not opposed to the idea of being able to make extensions without
> files on disk work ... but I consider it a niche use case; the
> behavior we have right now works well for me and hopefully for others
> most of the time.

Distributing software should only happen by files?

So why does Stackbuilder exist on the Windows binary?

Why does yum exist? What's wrong with ftp huh?

Why does CPAN?

I've a feeling this case might be a sensible way forwards, not a niche at all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] binary heap implementation

2012-11-20 Thread Robert Haas

On Thu, Nov 15, 2012 at 8:56 PM, Abhijit Menon-Sen  wrote:
> [ new patch ]

I went over this again today with a view to getting it committed, but
discovered some compiler warnings that look like this:

warning: cast to pointer from integer of different size

The problem seems to be that the binary heap implementation wants the
key and value to be a void *, but the way it's being used in
nodeMergeAppend.c the value is actually an int.  I don't think we want
to commit a binary heap implementation which has an impedance mismatch
of this type with its only client.  The obvious solution seems to be
to make the key and value each a Datum, and then clients can use
WhateverGetDatum and DatumGetWhatever to pass whatever built-in data
type they happen to have.  I'm trying that approach now and will post
an updated patch based on that approach if it seems to work OK and
nobody suggests something better between now and then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] review: Reduce palloc's in numeric operations

2012-11-20 Thread Pavel Stehule

2012/11/20 Heikki Linnakangas :
> On 19.11.2012 15:17, Pavel Stehule wrote:
>>
>> I tested this patch and I can confirm, so this patch can increase
>> speed about 16-22% (depends on IO waits, load type).
>
>
> Thanks for the review.
>
> I spent some more time on this, continuing with the thought that perhaps it
> would be better if get_str_from_var() didn't scribble on its input. I ended
> up with the attached patch, which contains a bunch of small tweaks:
>
> * Add init_var_from_num() function. This is the same as
> set_var_from_num_nocopy in the original patch, but it doesn't require the
> caller to have called init_var() first. IMHO this makes the calling code
> slightly more readable. Also, it's now more evident what these vars are: the
> digits array points to original array in the original Datum, but 'buf' is
> NULL. This is the same arrangement that's used in the constant NumericVars
> like const_zero.
>
> * get_str_from_var() no longer scribbles on its input. I noticed that it's
> always called with a dscale that comes from the input var itself. In other
> words, the rounding was unnecessary to begin with. I simply removed the
> dscale argument and the round_var() call from get_str_from_var(). If a
> someone wants to display a string with different dscale in the future, he
> can simply call round_var() before get_str_from_var().
>
> * numericvar_to_int8() no long scribbles on its input either. It creates a
> temporary copy to avoid that. To compensate, the callers no longer need to
> create a temporary copy, so the net # of pallocs is the same, but this is
> nicer. (there's room for a micro-optimization to avoid making the temporary
> copy numericvar_to_int8() when the argument is already suitably rounded - I
> left that our for now, dunno if it would make any difference in practice)
>
> * use a constant for the number 10 in get_str_from_var_sci(), when
> calculating 10^exponent. Saves a palloc() and some cycles to convert integer
> 10 to numeric.
>
> Comments? Assuming no-one sees some fatal flaw in this, I'll commit this
> tomorrow.

I have no objections

all regression tests passed, no warnings - has a sense

Regards

Pavel

>
> - Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] C-function, don't load external dll file

2012-11-20 Thread Przemek


I write my dll files in visual studio 2010.The dll file (name fservice.dll), 
which has an external function, code write in c++ (VS2010, I have dll and lib 
files)char * convert(char *)I
 tested my fservice.dll in console application which called function in 
this dll. It was work fine. I have a problem when a write and tests in 
Postgrsql.Dll file witch has c-function, who exports and imports to 
postgresql:  #include "postgres.h"
#include "fmgr.h" #include #include typedef char* (__cdecl *MYPROC)(char * 
value); __declspec( dllexport ) PG_FUNCTION_INFO_V1(transform); __declspec( 
dllexport ) Datum transform (PG_FUNCTION_ARGS) { HINSTANCE hinstLib= 
LoadLibrary("fservice.dll"); char * pointer; text *t = PG_GETARG_TEXT_P(0);if 
(hinstLib != NULL){  ProcAdd = (MYPROC) GetProcAddress(hinstLib, "convert");  
pointer=ProcAdd("text"); FreeLibrary(hinstLib); }else PG_RETRUN_NULL(); /* 
* code */ PG_RETURN_TEXT_P(new_t); }I
 have a problem because, mod is doesn't exists. Path to dll file I check
 before write. Compile this c-function, and when i debug i saw it 
HINSTANCE hinstLib it wasn't created. It wasn't NULL or any value, It 
wasn't exist. Finally my c-function doesn't use my function form 
external dll.How load dll and use my external function ?My external function 
form dll and LoadLibrary() is not called by dll program with called by 
Postgresql, Why? Przemek

Re: [HACKERS] review: Reduce palloc's in numeric operations

2012-11-20 Thread Heikki Linnakangas


On 19.11.2012 15:17, Pavel Stehule wrote:

I tested this patch and I can confirm, so this patch can increase
speed about 16-22% (depends on IO waits, load type).


Thanks for the review.

I spent some more time on this, continuing with the thought that perhaps 
it would be better if get_str_from_var() didn't scribble on its input. I 
ended up with the attached patch, which contains a bunch of small tweaks:


* Add init_var_from_num() function. This is the same as 
set_var_from_num_nocopy in the original patch, but it doesn't require 
the caller to have called init_var() first. IMHO this makes the calling 
code slightly more readable. Also, it's now more evident what these vars 
are: the digits array points to original array in the original Datum, 
but 'buf' is NULL. This is the same arrangement that's used in the 
constant NumericVars like const_zero.


* get_str_from_var() no longer scribbles on its input. I noticed that 
it's always called with a dscale that comes from the input var itself. 
In other words, the rounding was unnecessary to begin with. I simply 
removed the dscale argument and the round_var() call from 
get_str_from_var(). If a someone wants to display a string with 
different dscale in the future, he can simply call round_var() before 
get_str_from_var().


* numericvar_to_int8() no long scribbles on its input either. It creates 
a temporary copy to avoid that. To compensate, the callers no longer 
need to create a temporary copy, so the net # of pallocs is the same, 
but this is nicer. (there's room for a micro-optimization to avoid 
making the temporary copy numericvar_to_int8() when the argument is 
already suitably rounded - I left that our for now, dunno if it would 
make any difference in practice)


* use a constant for the number 10 in get_str_from_var_sci(), when 
calculating 10^exponent. Saves a palloc() and some cycles to convert 
integer 10 to numeric.


Comments? Assuming no-one sees some fatal flaw in this, I'll commit this 
tomorrow.


- Heikki
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 68c1f1d..67c6050 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -276,6 +276,16 @@ static NumericDigit const_two_data[1] = {2};
 static NumericVar const_two =
 {1, 0, NUMERIC_POS, 0, NULL, const_two_data};
 
+#if DEC_DIGITS == 4 || DEC_DIGITS == 2
+static NumericDigit const_ten_data[1] = {10};
+static NumericVar const_ten =
+{1, 0, NUMERIC_POS, 0, NULL, const_ten_data};
+#elif DEC_DIGITS == 1
+static NumericDigit const_ten_data[1] = {1};
+static NumericVar const_ten =
+{1, 1, NUMERIC_POS, 0, NULL, const_ten_data};
+#endif
+
 #if DEC_DIGITS == 4
 static NumericDigit const_zero_point_five_data[1] = {5000};
 #elif DEC_DIGITS == 2
@@ -367,8 +377,9 @@ static void zero_var(NumericVar *var);
 static const char *set_var_from_str(const char *str, const char *cp,
  NumericVar *dest);
 static void set_var_from_num(Numeric value, NumericVar *dest);
+static void init_var_from_num(Numeric num, NumericVar *dest);
 static void set_var_from_var(NumericVar *value, NumericVar *dest);
-static char *get_str_from_var(NumericVar *var, int dscale);
+static char *get_str_from_var(NumericVar *var);
 static char *get_str_from_var_sci(NumericVar *var, int rscale);
 
 static Numeric make_result(NumericVar *var);
@@ -533,18 +544,10 @@ numeric_out(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get the number in the variable format.
-	 *
-	 * Even if we didn't need to change format, we'd still need to copy the
-	 * value to have a modifiable copy for rounding.  set_var_from_num() also
-	 * guarantees there is extra digit space in case we produce a carry out
-	 * from rounding.
 	 */
-	init_var(&x);
-	set_var_from_num(num, &x);
+	init_var_from_num(num, &x);
 
-	str = get_str_from_var(&x, x.dscale);
-
-	free_var(&x);
+	str = get_str_from_var(&x);
 
 	PG_RETURN_CSTRING(str);
 }
@@ -616,12 +619,10 @@ numeric_out_sci(Numeric num, int scale)
 	if (NUMERIC_IS_NAN(num))
 		return pstrdup("NaN");
 
-	init_var(&x);
-	set_var_from_num(num, &x);
+	init_var_from_num(num, &x);
 
 	str = get_str_from_var_sci(&x, scale);
 
-	free_var(&x);
 	return str;
 }
 
@@ -695,8 +696,7 @@ numeric_send(PG_FUNCTION_ARGS)
 	StringInfoData buf;
 	int			i;
 
-	init_var(&x);
-	set_var_from_num(num, &x);
+	init_var_from_num(num, &x);
 
 	pq_begintypsend(&buf);
 
@@ -707,8 +707,6 @@ numeric_send(PG_FUNCTION_ARGS)
 	for (i = 0; i < x.ndigits; i++)
 		pq_sendint(&buf, x.digits[i], sizeof(NumericDigit));
 
-	free_var(&x);
-
 	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
 }
 
@@ -1150,9 +1148,7 @@ numeric_ceil(PG_FUNCTION_ARGS)
 	if (NUMERIC_IS_NAN(num))
 		PG_RETURN_NUMERIC(make_result(&const_nan));
 
-	init_var(&result);
-
-	set_var_from_num(num, &result);
+	init_var_from_num(num, &result);
 	ceil_var(&result, &result);
 
 	res = make_result(&result);
@@ -1177,9 +1173,7 @@ numeric_floor(PG_FUNCTION_ARGS)
 	if (NUMERIC_IS_NAN(num))
 		PG_RETURN_NUMERIC(make_result(&const_nan));
 
-	init_var(&result)

Re: [HACKERS] [pgsql-www] Maintenance announcement for trill.postgresql.org

2012-11-20 Thread Stefan Kaltenbrunner

On 11/19/2012 06:24 PM, Stefan Kaltenbrunner wrote:
> Hi all!
> 
> 
> There will be planned hardware/software maintenance this tomorrow
> Tuesday (20th November 2012) from starting at 16:30 CET - affecting some
> redundant services (ftp and www mirrors) as well as the following public
> hosts:
> 
>  * yridian.postgresql.org (www.postgresql.eu)
>  * antos.postgresql.org (anoncvs.postgresql.org)
>  * malur.postgresql.org (mailinglists)
> 
> During this maintenance window we will be doing some software and
> hardware changes involving a number of reboots and we expect a maximum
> outage of an hour.

this was completed at around 17:00 CET without any incidents, and all
systems should be back online again.




Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Make pg_basebackup configure and start standby [Review]

2012-11-20 Thread Boszormenyi Zoltan


2012-11-20 17:03 keltezéssel, Boszormenyi Zoltan írta:

2012-11-18 17:20 keltezéssel, Magnus Hagander írta:


The whole tar writing part of the code needs a lot more comments. It's
entirely unclear what the code does there. Why does the recovery.conf
writing code need to be split up in multiple locations inside
ReceiveTarFile(), for example? It either needs to be simplified, or
explained why it can't be simplified in comments.


I also simplified (the multiple #ifdef blocks are moved out into a function
to make the code shorter) and added comments to this function.

Best regards,
Zoltán Böszörményi

--
--
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Make pg_basebackup configure and start standby [Review]

2012-11-20 Thread Boszormenyi Zoltan


2012-11-18 17:20 keltezéssel, Magnus Hagander írta:

On Tue, Oct 23, 2012 at 5:08 PM, Magnus Hagander  wrote:

On Oct 23, 2012 4:52 PM, "Alvaro Herrera"  wrote:

Boszormenyi Zoltan escribió:


Also, the check for conflict between -R and -x/-X is now removed.

The documentation for option -R has changed to reflect this and
there is no different error code 2 now: it would make the behaviour
inconsistent between -Fp and -Ft.


The PQconninfo patch is also attached but didn't change since the
last mail.

Magnus,

This patch is all yours to handle.  I'm guessing nothing will happen
until pgconf.eu is done and over, but hopefully you can share a few
beers with Zoltan over the whole subject (and maybe with Peter about the
PQconninfo stuff?)

I'm not closing this just yet, but if you're not able to handle this
soon, maybe it'd be better to punt it to the November commitfest.

It's on my to do list for when I get back, but correct, won't get to it
until after the conference.

I finally got around to looking at this patch now. Sorry about the way
too long delay.

A few thoughts:

First, on the libpq patch:

I'm not sure I like the for_replication flag to PQconninfo(). It seems
we're it's a quite limited API, and not very future proof. What's to
say that an app would only be interested in filtering based on
for_replication? One idea might be to have a bitmap field there, and
assign *all* conninfo options to a category. We could then have
categories for NORMAL and REPLICATION. But we could also for example
have a category for PASSWORD (or similar), so that you could get with
and without those. Passing in a 32-bit integer would allow us to have
32 different categories, and we could then use a bitmask to pick
things out.

It might sound a bit like overengineering, but it's also an API and
it's a PITA to change it in the future if more needs show up..


Check.


Second, I wonder if we really need to add the code for requiressl=1,
or if we should just remove it. The spelling requiressl=1 was
deprecated back in 7.4 - which has obviously been out of support for a
long time now.


I removed this option, the code is simpler, thanks to this.


Third, in fillPGconn. If mapping has both conninfoValue and connvalue,
it does a free() on the old value in memberaddr, but if it doesn't it
just overwrites memberaddr with strdup(). Shouldn't those two paths be
the same, meaning shouldn't the  if (*memberaddr) free(*memberaddr);
check be outside the if block?


This point is now moot, see above.


Fourth, I'm not sure I like the "memberaddr" term. It's not wrong, but
it threw me off a couple of times while reading it. It's not all that
important, and I'm not sure about another idea for it though - maybe
just "connmember"?


The variable is now "connmember".

Also, I noticed that there was already a conninfo_storeval(),
the new patch uses it and there's no need to introduce a
new conninfo_setval() function.


Then, about the pg_basebackup patch:

What's the reason for the () around 512 for TARCHUNKSZ?


Removed the () from around the value.


We have a lot of hardcoded entries of the 512 for tar in that file. We
should either keep using 512 as a constant, or change all those to
*also* use the #define. Right now, the code will obviously break if
you change the #define (for example, it compares to 512, but then uses
hdrleft = TARCHUNKSZ - tarhdrsz; to do calculation).


All 512 constants are now using the #define.


The name choice of "basebackup" for the bool in ReceiveTarFile() is
unfortunate, since we use the term base backup to refer to the
complete thing, not just the main tablespace. Probably
"basetablespcae" instead. And it should then be assigned at the top of
the function (to the result of PQgetisnull()), and used in the main
if() branch as well.


Done without your typo, so the variable is "basetablespace". ;-)


Should we really print the status message even if not in verbose mode?
We only print the "base backup complete" messages in verbose mode, for
example.


The message is written only in verbose mode now.


It seems wrong to free() recoveryconf in ReceiveTarFile(). It's
allocated globally at the beginning. While that loop should only be
called once (since only one tablespace can be the main one), it's a
confusing location for the free.


See below.


The whole tar writing part of the code needs a lot more comments. It's
entirely unclear what the code does there. Why does the recovery.conf
writing code need to be split up in multiple locations inside
ReceiveTarFile(), for example? It either needs to be simplified, or
explained why it can't be simplified in comments.

_tarCreateHeader() is really badly named, since it specifically
creates a tar header for recovery.conf only. Either that needs to be a
parameter, or it needs to have a name that indicates this is the only
thing it does. The former is probably better.


_tarCreateHeader() now accepts the file name and the file size arguments.


Much of the tar stuff is very simi

Re: [HACKERS] [v9.3] writable foreign tables

2012-11-20 Thread Albe Laurenz

Kohei KaiGai wrote:
> Probably, it is helpful to provide a helper function that fetches an
attribute-
> number of pseudo "rowid" column from the supplied targetlist.
> If we have GetPseudoRowidColumn() at foreign/foreign.c, the avove
> routine can be rewritten as:
> 
> static AttrNumber
> fileGetForeignRelWidth(PlannerInfo *root,
>   RelOptInfo *baserel,
>   Relation foreignrel,
>   bool inhparent, List
*targetList)
> {
> FileFdwPlanState *fdw_private;
> AttrNumbernattrs = RelationGetNumberOfAttributes(foreignrel);
> AttrNumberanum_rowid;
> 
> fdw_private = palloc0(sizeof(FileFdwPlanState));
> anum_rowid = GetPseudoRowidColumn(..., targetList);
> if (anum_rowid > 0)
> {
> Assert(anum_rowid > nattrs);
> fdw_private->anum_rowid
>= makeDefElem("anum_rowid", (Node
*)makeInteger(anum_rowid));
> nattrs = anum_rowid;
> }
> baserel->fdw_private = fdw_private;
> 
> return nattrs;
> }
> 
> In case when FDW drive wants to push-down other target entry into
foreign-
> side, thus, it needs multiple pseudo-columns, it is decision of the
extension.
> In addition, it does not take API change in the future, if some more
additional
> pseudo-column is required by some other new features.
> 
> How about your opinion?

I think that this is better than what I suggested.

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.3] OAT_POST_ALTER object access hooks

2012-11-20 Thread Kohei KaiGai

>> The second hunk to alter.c does not apply anymore; please rebase.
>>
> OK,

Oops, I assumed the patch for ALTER RENAME TO reworks. Sorry.

2012/11/20 Alvaro Herrera :
> Kohei KaiGai wrote:
>
>> I'd like to have catalog/objectaccess.c to wrap-up invocation of hooks, 
>> rather
>> than doing all the stuffs with macros. It allows to use local variables, 
>> unlike
>> macros; that has a risk of naming conflict with temporary variable for
>> ObjectAccessPostCreate.
>
> No objection from me.
>
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.3] writable foreign tables

2012-11-20 Thread Kohei KaiGai

2012/11/20 Albe Laurenz :
> Kohei KaiGai wrote:
 This design tries to kill two-birds with one-stone.
 It enables to add multiple number of pseudo-columns,
 not only "rowid", and makes possible to push-down
 complex calculation of target list into external computing
 resource.

 For example, when user gives the following query:

   SELECT ((c1 - c2) * (c2 - c3))^2 FROM ftable

 it contains a complex calculation in the target-list,
 thus, it also takes CPU cycles of local process.

 If we can replace the "((c1 - c2) * (c2 - c3))^2" by
 a reference to a pseudo-column that also references
 the calculation result on external node, it effectively
 off-load CPU cycles.

 In this case, all we need to do is (1) acquire a slot
 for pseudo-column at GetForeignRelInfo (2) replace
 TargetEntry::expr by Var node that reference this
 pseudo-column.

 It makes sense for performance optimization, so I don't
 want to restrict this handler for "rowid" only.
>
>>> I understand.
>>>
>>> But I think that you still can do that with the change that
>>> I suggest.  I suggest that GetForeignRelInfo (or whatever the
>>> name ends up being) gets the AttrNumber of the proposed "rowid"
>>> column in addition to the parameters you need for what
>>> you want to do.
>>>
>>> Then nothing would keep you from defining those
>>> pseudo-columns.  But all the setup necessary for the "rowid"
>>> column could be moved out of the FDW.  So for the 99% of all
>>> FDW which are only interested in the rowid, things would
>>> get much simpler and they don't all have to implement the
>>> same code.
>
>> All we have to do at get_relation_info() to deal with pseudo-
>> columns (including alternatives of complex calculation, not
>> only "rowid") is just expansion of rel->max_attr.
>> So, if FDW is not interested in something except for "rowid",
>> it can just inform the caller "Yeah, we need just one slot for
>> a pseudo-column of rowid". Otherwise, it can return another
>> value to acquire the slot for arbitrary pseudo-column.
>> I don't think it is a problematic design.
>>
>> However, I'm skeptical 99% of FDWs don't support target-list
>> push-down. At least, it was very desired feature when I had
>> a talk at PGconf.EU last month. :-)
>
> I agree that PostgreSQL should make this technique possible.
>
> My idea should not make this any more difficult.
>
>> So, if we rename it to GetForeignRelWidth, is it defined as
>> follows?
>>
>> extern AttrNumber
>> GetForeignRelWidth(PlannerInfo *root,
>>   RelOptInfo *baserel,
>>   Oid foreigntableid,
>>   bool inhparent,
>>   List *targetList);
>>
>> Right now, inhparent makes no sense because foreign table
>> does not support table inheritance, but it seems to me we
>> shall have this functionality near future.
>
> I am thinking of this declaration:
>
> extern bool
> GetForeignRelWidth(PlannerInfo *root,
>RelOptInfo *baserel,
>Oid foreigntableid,
>bool inhparent,
>List *targetList,
>AttrNumber rowidAttr);
>
> Let me illustrate my idea with some code.
>
> Here's your fileGetForeignRelInfo:
>
>
> static void
> fileGetForeignRelInfo(PlannerInfo *root,
>   RelOptInfo *baserel,
>   Oid foreigntableid,
>   bool inhparent, List *targetList)
> {
> FileFdwPlanState *fdw_private;
> ListCell   *cell;
>
> fdw_private = (FileFdwPlanState *)
> palloc0(sizeof(FileFdwPlanState));
>
> foreach(cell, targetList)
> {
> TargetEntry *tle = lfirst(cell);
>
> if (tle->resjunk &&
> tle->resname && strcmp(tle->resname, "rowid")==0)
> {
> Bitmapset  *temp = NULL;
> AttrNumber  anum_rowid;
> DefElem*defel;
>
> pull_varattnos((Node *)tle, baserel->relid, &temp);
> anum_rowid = bms_singleton_member(temp)
> + FirstLowInvalidHeapAttributeNumber;
> /* adjust attr_needed of baserel */
> if (anum_rowid > baserel->max_attr)
> baserel->max_attr = anum_rowid;
> defel = makeDefElem("anum_rowid",
> (Node *)makeInteger(anum_rowid));
> fdw_private->anum_rowid = defel;
> }
> }
> baserel->fdw_private = fdw_private;
> }
>
>
> I hope that this can be reduced to:
>
>
> static bool
> fileGetForeignRelRowid(PlannerInfo *root,
>RelOptInfo *baserel,
>Oid foreigntableid,
>bool inhparent,
>List *targetList,
>AttrNumber rowidAttr)
> {
> FileFdwPlanState *fdw_private;
> fdw_private = (FileFdwPlanState *)
> palloc0(siz

Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-20 Thread Amit Kapila

On Monday, November 19, 2012 9:07 PM Amit Kapila wrote:
> On Monday, November 19, 2012 8:36 PM Alvaro Herrera wrote:
> > Amit Kapila escribió:
> >
> > > The only point I can see against SET PERSISTENT is that other
> variants
> > of
> > > SET command can be used in
> > > transaction blocks means for them ROLLBACK TO SAVEPOINT
> functionality
> > works,
> > > but for SET PERSISTENT,
> > > it can't be done.
> > > So to handle that might be we need to mention this point in User
> > Manual, so
> > > that users can be aware of this usage.
> > > If that is okay, then I think SET PERSISTENT is good to go.
> >
> > I think that's okay.  There are other commands which have some forms
> > that can run inside a transaction block and others not.  CLUSTER is
> > one example (maybe the only one?  Not sure).
> 
> In that case, it can have one more advantage that all configuration
> setting
> can be done with one command
> and in future we might want to have option like BOTH where the command
> will
> take effect for memory as well
> as file.
> 
> Can you think of any strong reason why not to have with Alter System
> Command?
> 
> In any case SET PERSISTENT is fine.

If no objections to SET PERSISTENT .. syntax, I shall update the patch for
implementation of same.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.3] OAT_POST_ALTER object access hooks

2012-11-20 Thread Alvaro Herrera

Kohei KaiGai wrote:

> I'd like to have catalog/objectaccess.c to wrap-up invocation of hooks, rather
> than doing all the stuffs with macros. It allows to use local variables, 
> unlike
> macros; that has a risk of naming conflict with temporary variable for
> ObjectAccessPostCreate.

No objection from me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.3] OAT_POST_ALTER object access hooks

2012-11-20 Thread Kohei KaiGai

2012/11/19 Alvaro Herrera :
> Kohei KaiGai wrote:
>> Sorry, I missed the attached version.
>> Please use this revision.
>
> All those direct uses of object_access_hook make me think that the
> InvokeObjectAccessHook() macro we have is insufficient.  Maybe we could
> have InvokeObjectAccessHookArg1() so that instead of
>
> +   if (object_access_hook)
> +   {
> +   ObjectAccessPostCreate  pc_arg;
> +
> +   memset(&pc_arg, 0, sizeof(ObjectAccessPostCreate));
> +   pc_arg.is_internal = is_internal;
> +   (*object_access_hook)(OAT_POST_CREATE, AttrDefaultRelationId,
> + RelationGetRelid(rel), attnum, (void *)&pc_arg);
> +   }
>
> we can have
>
> InvokeObjectAccessHookWithArgs1(OAT_POST_CREATE, 
> AttrDefaultRelationId,
> RelationGetRelid(rel), attnum,
> ObjectAccessPostCreate, is_internal, 
> is_internal)
>
> which would expand into the above.  (Since ObjectAccessPostCreate has
> two members, we presumably need InvokeObjectAccessHookWithArgs2 as
> well.  But that doesn't seem to be used anywhere, so maybe that struct
> member is not necessary?)
>
I'd like to have catalog/objectaccess.c to wrap-up invocation of hooks, rather
than doing all the stuffs with macros. It allows to use local variables, unlike
macros; that has a risk of naming conflict with temporary variable for
ObjectAccessPostCreate.

So, how about to have a following definition for example?

void
InvokePostAlterHookArgs(Oid classId, Oid objectId, int subId,
 Oid auxiliaryId, bool is_internal)
{
if (object_access_hook)
{
ObjectAccessPostAlterpa_arg;

memset(&pa_arg, 0, sizeof(ObjectAccessPostAlter));
pa_arg.auxiliary_id = auxiliary_id;
pa_arg.is_internal = is_internal;
(*object_access_hook)(OAT_POST_ALTER,
 classId, objectId, subId,
(void *) &pa_arg);
}
}

and

#define InvokePostAlterHook(classId, objectId, subId) \
InvokePostAlterHookArgs(classId, objectId, subId, InvalidOid, false)

for most points to call.

Even if ObjectAccessPostCreate is extended in the future version, it does not
take changes on most of hook invocation points.
We will be also able to have same semantics on OAT_POST_CREATE and
OAT_DROP as well.

> The second hunk to alter.c does not apply anymore; please rebase.
>
OK,

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Switching timeline over streaming replication

2012-11-20 Thread Amit Kapila

On Monday, November 19, 2012 10:54 PM Heikki Linnakangas wrote:
> On 10.10.2012 17:54, Thom Brown wrote:
>  > Hmm... I get something different.  When I promote standby B, standby
> > C's log shows:
>  >

> > The following problems are observed while testing of the patch.
> > Defect-1:
> >
> >1. start primary A
> >2. start standby B following A
> >3. start cascade standby C following B.
> >4. Promote standby B.
> >5. After successful time line switch in cascade standby C, stop
> C.
> >6. Restart C, startup is failing with the following error.
> >
> >  LOG:  database system was shut down in recovery at 2012-11-16
> > 16:26:29 IST
> >  FATAL:  requested timeline 2 does not contain minimum
> > recovery point 0/30143A0 on timeline 1
> >  LOG:  startup process (PID 415) exited with exit code 1
> >  LOG:  aborting startup due to startup process failure
> >
> > The above defect is already discussed in the following link.
> > http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77
> > 10$@ka
> > p...@huawei.com
> 
> Fixed now, sorry for neglecting this earlier. The problem was that if
> the primary switched to a new timeline at position X, and the standby
> followed that switch, on restart it would set minRecoveryPoint to X, and
> the new

Not sure, if above is fixed as I don't see any code change for this and in
test also it again fails.

Below is result of further testing:

Some strange logs are observed during testing. 

Note: Stop the node means doing a smart shutdown. 

Scenario-1: 
1. start primary A 
2. start standby B following A 
3. start cascade standby C following B. 
4. Execute the following commands in the primary A. 
   create table tbl(f int); 
   insert into tbl values(generate_series(1,1000)); 
5. Promote standby B. 
6. Execute the following commands in the primary B. 
   insert into tbl values(generate_series(1001,2000)); 
   insert into tbl values(generate_series(2001,3000)); 

The following logs are presents on the following standby C. 
please check these are proper or not? 

LOG:  restarted WAL streaming at position 0/B00 on tli 2 
LOG:  record with zero length at 0/B024C68 
LOG:  record with zero length at 0/B035528 
LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
0002000B, offset 0 

Following two defects are found while testing the new patch. 

Defect - 1: 

1. start primary A 
2. start standby B following A 
3. start cascade standby C following B. 
4. start another standby D following C. 
5. Promote standby B. 
6. After successful time line switch in cascade standby C & D, stop D. 
7. Restart D, Startup is successful and connecting to standby C. 
8. Stop C. 
9. Restart C, startup is failing. 

Defect-2: 
1. start primary A 
2. start standby B following A 
3. start cascade standby C following B. 
4. Start another standby D following C. 
5. Execute the following commands in the primary A. 
   create table tbl(f int); 
   insert into tbl values(generate_series(1,1000)); 
6. Promote standby B. 
7. Execute the following commands in the primary B. 
   insert into tbl values(generate_series(1001,2000)); 
   insert into tbl values(generate_series(2001,3000)); 

The following logs are observed on standby C: 

LOG:  restarted WAL streaming at position 0/700 on tli 2 
ERROR:  requested WAL segment 00020007 has already been
removed 
LOG:  record with zero length at 0/7028190 
LOG:  record with zero length at 0/7048540 
LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
00020007, offset 0 

The following logs are observed on standby D: 

LOG:  restarted WAL streaming at position 0/700 on tli 2 
LOG:  replication terminated by primary server 
DETAIL:  End of WAL reached on timeline 2 
FATAL:  error reading result of streaming command: ERROR:  requested WAL
segment 00020007 has already been removed 

LOG:  streaming replication successfully connected to primary 

8. Stop standby D normally and restart D. Restart is failing.

Code Review
--
1. 
> Agreed. Cleaning up at end-of-xact won't help walsender or other
non-backend processes, though, because they don't do 
> transactions. But a top-level ResourceOwner that's reset in the
sigsetjmp() cleanup routine would work. 

Do you think cleanup of files be done as part of this patch or should it be
handled separately, 
as it already exists in other paths of code. In that case may be one ToDo
item can be added. 

2. Also for forbidden_in_wal_sender(firstchar);, instead of handling it as
part of each message, 
   isn't it better if we call only once, something like 
   is_command_allowed(firstchar); 

   switch (fi

Re: [HACKERS] FDW for PostgreSQL

2012-11-20 Thread Kohei KaiGai

2012/11/15 Shigeru Hanada :
> Hi Kaigai-san,
>
> Sorry for delayed response.   I updated the patch, although I didn't change
> any about timing issue you and Fujita-san concern.
>
> 1) add some FDW options for cost estimation.  Default behavior is not
> changed.
> 2) get rid of array of libpq option names, similary to recent change of
> dblink
> 3) enhance document, especially remote query optimization
> 4) rename to postgres_fdw, to avoid naming conflict with the validator which
> exists in core
> 5) cope with changes about error context handling
>
> On Tue, Nov 6, 2012 at 7:36 PM, Kohei KaiGai  wrote:
>>
>> Isn't it possible to pick-up only columns to be used in targetlist or
>> local qualifiers,
>> without modification of baserestrictinfo?
>
>
> IMO, it's possible.  postgres_fdw doesn't modify baserestrictinfo at all; it
> just create two new lists which exclusively point RestrictInfo elements in
> baserestrictinfo.  Pulling vars up from conditions which can't be pushed
> down would gives us list of  necessary columns.  Am I missing something?
>
Hanada-san,

Let me comments on several points randomly.

I also think the new "use_remote_explain" option is good. It works fine
when we try to use this fdw over the network with latency more or less.
It seems to me its default is "false", thus, GetForeignRelSize will return
the estimated cost according to ANALYZE, instead of remote EXPLAIN.
Even though you mention the default behavior was not changed, is it
an expected one? My preference is the current one, as is.

The deparseFuncExpr() still has case handling whether it is explicit case,
implicit cast or regular functions. If my previous proposition has no flaw,
could you fix up it using regular function invocation manner? In case when
remote node has incompatible implicit-cast definition, this logic can make
a problem.

At InitPostgresFdwOptions(), the source comment says we don't use
malloc() here for simplification of code. Hmm. I'm not sure why it is more
simple. It seems to me we have no reason why to avoid malloc here, even
though libpq options are acquired using malloc().

Regarding to the regression test.

[kaigai@iwashi sepgsql]$ cat contrib/postgres_fdw/regression.diffs
*** /home/kaigai/repo/sepgsql/contrib/postgres_fdw/expected/postgres_fdw.out
   Sat Nov 17 21:31:19 2012
--- /home/kaigai/repo/sepgsql/contrib/postgres_fdw/results/postgres_fdw.out
Tue Nov 20 13:53:32 2012
***
*** 621,627 
  -- ===
  ALTER FOREIGN TABLE ft1 ALTER COLUMN c5 TYPE int;
  SELECT * FROM ft1 WHERE c1 = 1;  -- ERROR
! ERROR:  invalid input syntax for integer: "1970-01-02 00:00:00"
  CONTEXT:  column c5 of foreign table ft1
  ALTER FOREIGN TABLE ft1 ALTER COLUMN c5 TYPE timestamp;
  -- ===
--- 621,627 
  -- ===
  ALTER FOREIGN TABLE ft1 ALTER COLUMN c5 TYPE int;
  SELECT * FROM ft1 WHERE c1 = 1;  -- ERROR
! ERROR:  invalid input syntax for integer: "Fri Jan 02 00:00:00 1970"
  CONTEXT:  column c5 of foreign table ft1
  ALTER FOREIGN TABLE ft1 ALTER COLUMN c5 TYPE timestamp;
  -- ===

==

I guess this test tries to check a case when remote column has incompatible
data type with local side. Please check timestamp_out(). Its output
format follows
"datestyle" setting of GUC, and it affects OS configuration on initdb time.
(Please grep "datestyle" at initdb.c !) I'd like to recommend to use
another data type
for this regression test to avoid false-positive detection.

Elsewhere, I could not find problems right now.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] faster ts_headline

2012-11-20 Thread Marcin Mańk

Hello,
I've started implementing a system for faster headline generation. WIP
patch is attached.

The idea is to make a new type currently called hltext (different
names welcome), that stores the text along with the lexization result.
It conceptually stores an array of tuples like
(word text, type int, lexemes text[] )

A console log is also attached - it shows 5x preformance increase. The
problem is not academic, I have such long texts in an app, making 20
headlines takes 3s+.

The patch lacks documentation, regression tests, and most auxillary
functions (especially I/O functions).


I have a question about the I/O functions of the new type. What format
to choose?

I could make the input function read something like 'english: the
text' where english is the name of the text search configuration . The
input function would do the lexizing.

I could make it read some custom format, which would contain the
tokens, token types and lexemes. Can I use flex/bison, or is there a
good reason not to, and I should make it a hand-made parser?

finally, I could make the type actually "create type
hltex_element(word text, type int, lexemes text[] )", by manually
filling in the applicable catalogs, and make the user make columns as
hltext_element[]. Is there a nice way to manipulate objects of such a
type from within the backend? Is there an example? I suppose that in
this case storage would not be as efficient as I made it.

which one to choose? Other ideas?

Regards
Marcin Mańk
$ psql -p 5454 postgres -c 'create table tmp(t text, hlt hltext);';
CREATE TABLE
$ bash -c 'echo "insert into tmp(t) values(\$CUTCUT\$" ; curl -d "" 
http://en.wikipedia.org/wiki/Michael_Jackson; echo "\$CUTCUT\$)"' | ./bin/psql 
-p 5454 postgres
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100  636k  100  636k0 0   224k  0  0:00:02  0:00:02 --:--:--  258k
INSERT 0 1
$ psql -p 5454 postgres
psql (9.0.5, server 9.3devel)
WARNING: psql version 9.0, server version 9.3.
 Some psql features might not work.
Type "help" for help.

postgres=# update tmp set hlt = to_hltext('english', t);
UPDATE 1
postgres=# \timing
Timing is on.
postgres=# select ts_headline('english', t, to_tsquery('janet & jackson'), 
'MaxFragments=2 MinWords=5 MaxWords=15') from tmp;

 ts_headline
 
-
 Jackson-Style. (video production; Michael and Janet 
Jackson video) .  29 . Theatre Crafts International ... Green Day Look 
Forward To Janet Jackson's VMA Tribute To Michael" .  MTV . 
September
(1 row)

Time: 414,588 ms
postgres=# select ts_headline('english', t, to_tsquery('janet & jackson'), 
'MaxFragments=2 MinWords=5 MaxWords=15') from tmp;

 ts_headline
 
-
 Jackson-Style. (video production; Michael and Janet 
Jackson video) .  29 . Theatre Crafts International ... Green Day Look 
Forward To Janet Jackson's VMA Tribute To Michael" .  MTV . 
September
(1 row)

Time: 75,912 ms
postgres=# select ts_headline('english', hlt, to_tsquery('janet & jackson'), 
'MaxFragments=2 MinWords=5 MaxWords=15') from tmp;

 ts_headline
 
-
 Jackson-Style. (video production; Michael and Janet 
Jackson video) .  29 . Theatre Crafts International ... Green Day Look 
Forward To Janet Jackson's VMA Tribute To Michael" .  MTV . 
September
(1 row)

Time: 17,539 ms
postgres=# select ts_headline('english', hlt, to_tsquery('janet & jackson'), 
'MaxFragments=2 MinWords=5 MaxWords=15') from tmp;

 ts_headline

Re: [HACKERS] logical changeset generation v3 - Source for Slony

2012-11-20 Thread Andres Freund

Hi,

On 2012-11-19 19:50:32 -0500, Steve Singer wrote:
> On 12-11-18 11:07 AM, Andres Freund wrote:
> >I think we should provide some glue code to do this, otherwise people
> >will start replicating all the bugs I hacked into this... More
> >seriously: I think we should have support code here, no user will want
> >to learn the intracacies of feedback messages and such. Where that would
> >live? No idea.

> libpglogicalrep.so ?

Yea. We don't really have the infrastructure for that yet
though... Robert and me were just talking about that recently...


> >I wholeheartedly aggree. It should also be cleaned up a fair bit before
> >others copy it should we not go for having some client side library.
> >
> >Imo the library could very roughly be something like:
> >
> >state = SetupStreamingLLog(replication-slot, ...);
> >while((message = StreamingLLogNextMessage(state))
> >{
> >  write(outfd, message->data, message->length);
> >  if (received_100_messages)
> >  {
> >   fsync(outfd);
> >   StreamingLLogConfirm(message);
> >  }
> >}
> >
> >Although I guess thats not good enough because StreamingLLogNextMessage
> >would be blocking, but that shouldn't be too hard to work around.
> >
>
> How about we pass a timeout value to StreamingLLogNextMessage (..) where it
> returns if no data is available after the timeout to give the caller a
> chance to do something else.

Doesn't really integrate into the sort of loop thats often built around
poll(2), select(2) and similar. It probably should return NULL if
there's nothing there yet and we should have a
StreamingLLogWaitForMessage() or such.

> >>This is basically the Slony 2.2 sl_log format minus a few columns we no
> >>longer need (txid, actionseq).
> >>command_args is a postgresql text array of column=value pairs.  Ie [
> >>{id=1},{name='steve'},{project='slony'}]
> >It seems to me that that makes escaping unneccesarily complicated, but
> >given you already have all the code... ;)
>
> When I look at the actual code/representation we picked it is closer to
> {column1,value1,column2,value2...}

Still means you need to escape and later pasrse columnN, valueN
values. I would have expected something like (length:data, length:data)+

> >>I don't t think our output plugin will be much more complicated than the
> >>test_decoding plugin.
> >Good. Thats the idea ;). Are you ok with the interface as it is now or
> >would you like to change something?
>
> I'm going to think about this some more and maybe try to write an example
> plugin before I can say anything with confidence.

That would be very good.

> >Yes. We will also need something like that. If you remember the first
> >prototype we sent to the list, it included the concept of an
> >'origin_node' in wal record. I think you actually reviewed that one ;)
> >
> >That was exactly aimed at something like this...
> >
> >Since then my thoughts about how the origin_id looks like have changed a
> >bit:
> >- origin id is internally still represented as an uint32/Oid
> >   - never visible outside of wal/system catalogs
> >- externally visible it gets
> >   - assigned an uuid
> >   - optionally assigned a user defined name
> >- user settable (permissions?) origin when executing sql:
> >   - SET change_origin_uuid = 'uuid';
> >   - SET change_origin_name = 'user-settable-name';
> >   - defaults to the local node
> >- decoding callbacks get passed the origin of a change
> >   - txn->{origin_uuid, origin_name, origin_internal?}
> >- the init decoding callback can setup an array of interesting origins,
> >   so the others don't even get the ReorderBuffer treatment
> >
> >I have to thank the discussion on -hackers and a march through prague
> >with Marko here...

> So would the uuid and optional name assignment be done in the output plugin
> or some else?

That would be postgres infrastructure. The output plugin would get
passed at least uuid and name and potentially the internal name as well
(might be useful to build some internal caching of information).

> When/how does the uuid get generated and where do we store it so the same
> uuid gets returned when postgres restarts.  Slony today stores all this type
> of stuff in user-level tables and user-level functions (because it has no
> other choice).

Would need to be its own system catalog.

> What is the connection between these values and the
> 'slot-id' in your proposal for the init arguments? Does the slot-id need to
> be the external uuid of the other end or is there no direct connection?

None really. The "slot-id" really is only an identifier for a
replication connection (which should live longer than a single
postmaster run) which contains information about the point up to which
you replicated. We need to manage some local resources based on that.

> Today slony allows us to replicate between two databases in the same
> postgresql cluster (I use this for testing all the time)
> Slony also allows for two different 'slony clusters' to be setup in the

Re: [HACKERS] logical changeset generation v3

2012-11-20 Thread Andres Freund

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:
> On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund wrote:
> > On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:
> > > I am just looking at this patch and will provide some comments.
> > > By the way, you forgot the installation part of pg_receivellog, please see
> > > patch attached.
> >
> > That actually was somewhat intended, I thought people wouldn't like the
> > name and I didn't want a binary that's going to be replaced anyway lying
> > around ;)
> >
> OK no problem. For sure this is going to happen, I was wondering myself if
> it could be possible to merge pg_receivexlog and pg_receivellog into a
> single utility with multiple modes :)

Don't really see that, the differences already are significant and imo
are bound to get bigger. Shouldn't live in pg_basebackup/ either..

> Btw, here are some extra comments based on my progress, hope it will be
> useful for other people playing around with your patches.
> 1) Necessary to install the contrib module test_decoding on server side or
> the test case will not work.
> 2) Obtention of the following logs on server:
> LOG:  forced to assume catalog changes for xid 1370 because it was running
> to early
> WARNING:  ABORT 1370
> Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

> 3) Assertion failure while running pgbench, I was  just curious to see how
> it reacted when logical replication was put under a little bit of load.
> TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
> ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
> Line: 877)
> => pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-20 Thread Heikki Linnakangas

Glad to see this patch hasn't been totally forgotten. Being able to use 
indexes for regular expressions would be really cool!


Back in January, I asked for some high-level description of how the 
algorithm works 
(http://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com). 
That's still sorely needed. Googling around, I found the slides for your 
presentation on this from PGConf.EU - it would be great to have the 
information from that presentation included in the patch.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dumping an Extension's Script

2012-11-20 Thread Dimitri Fontaine

Robert Haas  writes:
> I'm not opposed to the idea of being able to make extensions without
> files on disk work ... but I consider it a niche use case; the
> behavior we have right now works well for me and hopefully for others
> most of the time.

Apparently I'm not the only one doing extensions without anything to
compile, all SQL:

   http://keithf4.com/extension_tips_3

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.3] writable foreign tables

2012-11-20 Thread Albe Laurenz

Kohei KaiGai wrote:
>>> This design tries to kill two-birds with one-stone.
>>> It enables to add multiple number of pseudo-columns,
>>> not only "rowid", and makes possible to push-down
>>> complex calculation of target list into external computing
>>> resource.
>>>
>>> For example, when user gives the following query:
>>>
>>>   SELECT ((c1 - c2) * (c2 - c3))^2 FROM ftable
>>>
>>> it contains a complex calculation in the target-list,
>>> thus, it also takes CPU cycles of local process.
>>>
>>> If we can replace the "((c1 - c2) * (c2 - c3))^2" by
>>> a reference to a pseudo-column that also references
>>> the calculation result on external node, it effectively
>>> off-load CPU cycles.
>>>
>>> In this case, all we need to do is (1) acquire a slot
>>> for pseudo-column at GetForeignRelInfo (2) replace
>>> TargetEntry::expr by Var node that reference this
>>> pseudo-column.
>>>
>>> It makes sense for performance optimization, so I don't
>>> want to restrict this handler for "rowid" only.

>> I understand.
>>
>> But I think that you still can do that with the change that
>> I suggest.  I suggest that GetForeignRelInfo (or whatever the
>> name ends up being) gets the AttrNumber of the proposed "rowid"
>> column in addition to the parameters you need for what
>> you want to do.
>>
>> Then nothing would keep you from defining those
>> pseudo-columns.  But all the setup necessary for the "rowid"
>> column could be moved out of the FDW.  So for the 99% of all
>> FDW which are only interested in the rowid, things would
>> get much simpler and they don't all have to implement the
>> same code.

> All we have to do at get_relation_info() to deal with pseudo-
> columns (including alternatives of complex calculation, not
> only "rowid") is just expansion of rel->max_attr.
> So, if FDW is not interested in something except for "rowid",
> it can just inform the caller "Yeah, we need just one slot for
> a pseudo-column of rowid". Otherwise, it can return another
> value to acquire the slot for arbitrary pseudo-column.
> I don't think it is a problematic design.
> 
> However, I'm skeptical 99% of FDWs don't support target-list
> push-down. At least, it was very desired feature when I had
> a talk at PGconf.EU last month. :-)

I agree that PostgreSQL should make this technique possible.

My idea should not make this any more difficult.

> So, if we rename it to GetForeignRelWidth, is it defined as
> follows?
> 
> extern AttrNumber
> GetForeignRelWidth(PlannerInfo *root,
>   RelOptInfo *baserel,
>   Oid foreigntableid,
>   bool inhparent,
>   List *targetList);
> 
> Right now, inhparent makes no sense because foreign table
> does not support table inheritance, but it seems to me we
> shall have this functionality near future.

I am thinking of this declaration:

extern bool
GetForeignRelWidth(PlannerInfo *root,
   RelOptInfo *baserel,
   Oid foreigntableid,
   bool inhparent,
   List *targetList,
   AttrNumber rowidAttr);

Let me illustrate my idea with some code.

Here's your fileGetForeignRelInfo:


static void
fileGetForeignRelInfo(PlannerInfo *root,
  RelOptInfo *baserel,
  Oid foreigntableid,
  bool inhparent, List *targetList)
{
FileFdwPlanState *fdw_private;
ListCell   *cell;

fdw_private = (FileFdwPlanState *)
palloc0(sizeof(FileFdwPlanState));

foreach(cell, targetList)
{
TargetEntry *tle = lfirst(cell);

if (tle->resjunk &&
tle->resname && strcmp(tle->resname, "rowid")==0)
{
Bitmapset  *temp = NULL;
AttrNumber  anum_rowid;
DefElem*defel;

pull_varattnos((Node *)tle, baserel->relid, &temp);
anum_rowid = bms_singleton_member(temp)
+ FirstLowInvalidHeapAttributeNumber;
/* adjust attr_needed of baserel */
if (anum_rowid > baserel->max_attr)
baserel->max_attr = anum_rowid;
defel = makeDefElem("anum_rowid",
(Node *)makeInteger(anum_rowid));
fdw_private->anum_rowid = defel;
}
}
baserel->fdw_private = fdw_private;
}


I hope that this can be reduced to:


static bool
fileGetForeignRelRowid(PlannerInfo *root,
   RelOptInfo *baserel,
   Oid foreigntableid,
   bool inhparent,
   List *targetList,
   AttrNumber rowidAttr)
{
FileFdwPlanState *fdw_private;
fdw_private = (FileFdwPlanState *)
palloc0(sizeof(FileFdwPlanState));

defel = makeDefElem("anum_rowid",
(Node *)makeInteger(rowidAttr));
fdw_private->anum_rowid = defel;

baserel->fdw_private = fdw_private;

return true;  /* we'll use ro

Re: [HACKERS] Timing events WIP v1

2012-11-20 Thread Pavel Stehule

2012/11/20 Greg Smith :
> On 11/16/12 12:20 AM, Craig Ringer wrote:
>
>> Not that implementing `hstore_to_json` is exactly hard, mind you, as a
>> quick search shows: https://gist.github.com/2318757
>
>
> Both pulling hstore more firmly into core and adopting something like a
> hstore_to_json call as the preferred UI for timing event data are all
> directions I'd be happy to go.  That's why I stopped before trying to even
> implement that part.  I think the general direction to go is clear, but the
> details on how to present the resulting data is more of a tarp than even a
> bikeshed at this point.  It's not a hard problem though.
>

I don't like to see current hstore in core - It doesn't support
nesting, it doesn't support different datatypes, it is not well
supported by plpgsql

regards

Pavel

> --
> Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Timing events WIP v1

2012-11-20 Thread Greg Smith


On 11/16/12 12:20 AM, Craig Ringer wrote:


Not that implementing `hstore_to_json` is exactly hard, mind you, as a
quick search shows: https://gist.github.com/2318757


Both pulling hstore more firmly into core and adopting something like a 
hstore_to_json call as the preferred UI for timing event data are all 
directions I'd be happy to go.  That's why I stopped before trying to 
even implement that part.  I think the general direction to go is clear, 
but the details on how to present the resulting data is more of a tarp 
than even a bikeshed at this point.  It's not a hard problem though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

54 matches

Mail list logo