Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-01 Thread Kyotaro HORIGUCHI
Hello,

 Another thing: if realloc() fails, the old pointer stays valid.
 So it needs to be assigned to temp variable first, before
 overwriting old pointer.

 mmm. I've misunderstood of the realloc.. I'll fix there in the
next patch.

 And seems malloc() is preferable to palloc() to avoid
 exceptions inside row processor.  Although latter
 one could be made to work, it might be unnecessary
 complexity.  (store current pqresult into remoteConn?)

Hmm.. palloc may throw ERRCODE_OUT_OF_MEMORY so I must catch it
and return NULL. That seems there is no difference to using
malloc after all.. However, the inhibition of throwing exceptions
in RowProcessor is not based on any certain fact, so palloc here
may make sense if we can do that.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index-only scan performance regression

2012-02-01 Thread Dean Rasheed
On 31 January 2012 23:04, Tom Lane t...@sss.pgh.pa.us wrote:
 Dean Rasheed dean.a.rash...@gmail.com writes:
 The thing I'm unsure about is whether sending sinval
 messages when the visibility map is extended is a good idea.

 Seems perfectly reasonable to me.  They'd occur so seldom as to be
 more than repaid if we can scrape some cost out of the mainline paths.


OK, thanks. That's good.


 The real objection to this probably is that if it only saves anything
 for tables that don't have a VM yet, it's dubious whether it's worth
 doing.  But if we can avoid wasted checks for VM extension as well,
 then I think it's probably a no-brainer.

                        regards, tom lane

Yes it applies in the same way to VM extension - if the table has
grown and the VM has not yet been extended, but I don't see why that
is any worse than the case of not having a VM yet.

Actually I think that this is not such an uncommon case - for a table
which has only had data inserted - no deletes or updates - it is
tempting to think that ANALYSE is sufficient, and that there is no
need to VACUUM. If it were simply the case that this caused an
index-only scan to have no real benefit, you might be willing to live
with normal index scan performance. But actually it causes a very
significant performance regression beyond that, to well below 9.1
performance.

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-01 Thread Marko Kreen
On Wed, Feb 1, 2012 at 10:32 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@oss.ntt.co.jp wrote:
 Another thing: if realloc() fails, the old pointer stays valid.
 So it needs to be assigned to temp variable first, before
 overwriting old pointer.

  mmm. I've misunderstood of the realloc.. I'll fix there in the
 next patch.

Please wait a moment, I started doing small cleanups,
and now have some larger ones too.  I'll send it soon.

OTOH, if you have already done something, you can send it,
I have various states in GIT so it should not be hard
to merge things.

 And seems malloc() is preferable to palloc() to avoid
 exceptions inside row processor.  Although latter
 one could be made to work, it might be unnecessary
 complexity.  (store current pqresult into remoteConn?)

 Hmm.. palloc may throw ERRCODE_OUT_OF_MEMORY so I must catch it
 and return NULL. That seems there is no difference to using
 malloc after all.. However, the inhibition of throwing exceptions
 in RowProcessor is not based on any certain fact, so palloc here
 may make sense if we can do that.

No, I was thinking about storing the result in connection
struct and then letting the exception pass, as the
PGresult can be cleaned later.  Thus we could get rid
of TRY/CATCH in per-row handler.  (At that point
the PGresult is already under PGconn, so maybe
it's enough to clean that one later?)

But for now, as the TRY is already there, it should be
also simple to move palloc usage inside it.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Refactoring log_newpage

2012-02-01 Thread Simon Riggs
At present log_newpage() produces log records called XLOG_HEAP_NEWPAGE.

That routine is used by HEAP, BTREE, GIN, SPGIST rmgrs, as well as
various forks.

WAL contains no information as to which rmgr the data refers to,
making debugging much harder and skewing efforts to optimise WAL
traffic and is a pretty gross modularity violation of the whole rmgr
concept.

This refactoring adds an RmgrId field onto each new page record and
makes clearer that certain heap routines are actually generic. The
WAL records are still marked as HEAP rmgr and have XLOG_NEWPAGE record
type, but at least we can tell them apart. (We already had forknum,
just not rmgrid).

Another refactoring option would be to have specific record types for
each rmgr, and would normally be my preferred option but that seems
likely to use up too many record type numbers in the index rmgrs.

For immediate commit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index fe06bdc..055ce0b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -528,10 +528,10 @@ ginbuildempty(PG_FUNCTION_ARGS)
 	MarkBufferDirty(RootBuffer);
 
 	/* XLOG the new pages */
-	log_newpage(index-rd_smgr-smgr_rnode.node, INIT_FORKNUM,
+	log_newpage(RM_GIN_ID, index-rd_smgr-smgr_rnode.node, INIT_FORKNUM,
 BufferGetBlockNumber(MetaBuffer),
 BufferGetPage(MetaBuffer));
-	log_newpage(index-rd_smgr-smgr_rnode.node, INIT_FORKNUM,
+	log_newpage(RM_GIN_ID, index-rd_smgr-smgr_rnode.node, INIT_FORKNUM,
 BufferGetBlockNumber(RootBuffer),
 BufferGetPage(RootBuffer));
 	END_CRIT_SECTION();
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 99a431a..feabe28 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4502,7 +4502,7 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 }
 
 /*
- * Perform XLogInsert of a HEAP_NEWPAGE record to WAL. Caller is responsible
+ * Perform XLogInsert of an XLOG_NEWPAGE record to WAL. Caller is responsible
  * for writing the page to disk after calling this routine.
  *
  * Note: all current callers build pages in private memory and write them
@@ -4514,10 +4514,10 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  * not do anything that assumes we are touching a heap.
  */
 XLogRecPtr
-log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
-			Page page)
+log_newpage(RmgrId rmid, RelFileNode *rnode, ForkNumber forkNum,
+			BlockNumber blkno, Page page)
 {
-	xl_heap_newpage xlrec;
+	xl_newpage xlrec;
 	XLogRecPtr	recptr;
 	XLogRecData rdata[2];
 
@@ -4527,9 +4527,10 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 	xlrec.node = *rnode;
 	xlrec.forknum = forkNum;
 	xlrec.blkno = blkno;
+	xlrec.rmid = rmid;
 
 	rdata[0].data = (char *) xlrec;
-	rdata[0].len = SizeOfHeapNewpage;
+	rdata[0].len = SizeOfXLogNewpage;
 	rdata[0].buffer = InvalidBuffer;
 	rdata[0].next = (rdata[1]);
 
@@ -4538,7 +4539,7 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 	rdata[1].buffer = InvalidBuffer;
 	rdata[1].next = NULL;
 
-	recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_NEWPAGE, rdata);
+	recptr = XLogInsert(rmid, XLOG_NEWPAGE, rdata);
 
 	/*
 	 * The page may be uninitialized. If so, we can't set the LSN and TLI
@@ -4800,25 +4801,25 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
 	}
 }
 
-static void
-heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
+/*
+ * Generic helper function used by many Rmgrs to restore new pages, so do
+ * not do anything that assumes we are touching a heap or a specific fork.
+ */
+void
+redo_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
 {
-	xl_heap_newpage *xlrec = (xl_heap_newpage *) XLogRecGetData(record);
+	xl_newpage *xlrec = (xl_newpage *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
 
-	/*
-	 * Note: the NEWPAGE log record is used for both heaps and indexes, so do
-	 * not do anything that assumes we are touching a heap.
-	 */
 	buffer = XLogReadBufferExtended(xlrec-node, xlrec-forknum, xlrec-blkno,
 	RBM_ZERO);
 	Assert(BufferIsValid(buffer));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
 
-	Assert(record-xl_len == SizeOfHeapNewpage + BLCKSZ);
-	memcpy(page, (char *) xlrec + SizeOfHeapNewpage, BLCKSZ);
+	Assert(record-xl_len == SizeOfXLogNewpage + BLCKSZ);
+	memcpy(page, (char *) xlrec + SizeOfXLogNewpage, BLCKSZ);
 
 	/*
 	 * The page may be uninitialized. If so, we can't set the LSN and TLI
@@ -5515,8 +5516,8 @@ heap_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_HEAP_HOT_UPDATE:
 			heap_xlog_update(lsn, record, true);
 			break;
-		case XLOG_HEAP_NEWPAGE:
-			heap_xlog_newpage(lsn, record);
+		case XLOG_NEWPAGE:
+			redo_xlog_newpage(lsn, record);
 			break;
 		case 

[HACKERS] patch for implementing SPI_gettypemod()

2012-02-01 Thread Chetan Suttraway
Hi All,

This is regarding the TODO item :
Add SPI_gettypmod() to return a field's typemod from a TupleDesc

The related message is:
http://archives.postgresql.org/pgsql-hackers/2005-11/msg00250.php

This basically talks about having an SPI_gettypemod() which returns the
typmod of a field of tupdesc

Please refer the attached patch based on the suggested implementation.


Regards,
Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 81f284c..659122e 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -955,6 +955,24 @@ SPI_gettypeid(TupleDesc tupdesc, int fnumber)
 		return (SystemAttributeDefinition(fnumber, true))-atttypid;
 }
 
+int4
+SPI_gettypemod(TupleDesc tupdesc, int fnumber)
+{
+	SPI_result = 0;
+
+	if (fnumber  tupdesc-natts || fnumber == 0 ||
+		fnumber = FirstLowInvalidHeapAttributeNumber)
+	{
+		SPI_result = SPI_ERROR_NOATTRIBUTE;
+		return -1;
+	}
+
+	if (fnumber  0)
+		return tupdesc-attrs[fnumber - 1]-atttypmod;
+	else
+		return (SystemAttributeDefinition(fnumber, true))-atttypmod;
+}
+
 char *
 SPI_getrelname(Relation rel)
 {
diff --git a/src/include/executor/spi.h b/src/include/executor/spi.h
index cfbaa14..a358710 100644
--- a/src/include/executor/spi.h
+++ b/src/include/executor/spi.h
@@ -113,6 +113,7 @@ extern char *SPI_getvalue(HeapTuple tuple, TupleDesc tupdesc, int fnumber);
 extern Datum SPI_getbinval(HeapTuple tuple, TupleDesc tupdesc, int fnumber, bool *isnull);
 extern char *SPI_gettype(TupleDesc tupdesc, int fnumber);
 extern Oid	SPI_gettypeid(TupleDesc tupdesc, int fnumber);
+extern int4 SPI_gettypemod(TupleDesc tupdesc, int fnumber);
 extern char *SPI_getrelname(Relation rel);
 extern char *SPI_getnspname(Relation rel);
 extern void *SPI_palloc(Size size);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-02-01 Thread Ashutosh Bapat
Looking at the code, it seems that the fake aliases (eref) for relations
(may be views as well) are not generated per say, but they do not get
changed when the relation name changes OR in case of inherited tables, they
do not get changed when the inheritance is expanded
(expand_inherited_rtentry). So, there is not question of generating them so
as to not collide with other aliases in the query. However I did not find
answers to these questions
1. What is the use of eref in case of relation when the relation name
itself can be provided by the reloid?
2. Can we use schema qualified relation name in get_from_clause_item() and
get_variable() instead of use eref-aliasname. I have noticed that the
logic in get_rte_attribute_name() gives preference to the column names in
catalog tables over eref-colnames.

Anyone?

On Mon, Jan 30, 2012 at 10:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Ashutosh Bapat ashutosh.ba...@enterprisedb.com writes:
  So, as I understand we have two problems here
  1. Prefixing schemaname to the fake alises if there is another RTE with
  same name. There may not be a relation with that name (fake alias name
  given) in the schema chosen as prefix.
  2. Fake aliases themselves can be conflicting.

 Well, the issue is more that a fake alias might unintentionally collide
 with a regular alias elsewhere in the query.  There's no guard against
 that in the current behavior, and ISTM there needs to be.

 The one possibly-simplifying thing about this whole issue is that we
 needn't cater for references that couldn't have been made in the
 original query.  For instance, if the inner and outer queries both have
 explicit aliases tx, it's impossible for the inner query to have
 referred to any columns of the outer tx --- so we don't have to try to
 make it possible in the dumped form.

  If I understand correctly, if we solve the second problem, first problem
  will not occur. Is that correct?

 I don't believe that the problem has anything to do with the names of
 other tables that might happen to exist in the database.  It's a matter
 of what RTE names/aliases are exposed for variable references in
 different parts of the query.

regards, tom lane




-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company


[HACKERS] Review of pg_archivecleanup -x option patch

2012-02-01 Thread Alex Shulgin


Hello,

This is my first patch review ever, so please bear with me.

The patch[1] is in the context diff format and applies cleanly to
current git HEAD.  Documentation is updated by the patch.

The code does implement what's the patch is supposed to do.

Do we want that?  According to Greg's original mail, yes.

One problem I've found while trying the example workflow is this:

~/local/postgresql/HEAD$ ./bin/pg_archivecleanup -d -x .gz data/archive/ 
00010002.0020.backup.gz
pg_archivecleanup: invalid filename input
Try pg_archivecleanup --help for more information.

I think we could accept the suffixed WAL filename and strip the suffix;
another idea is to actually detect the suffix (if any) from the last
command line argument, thus rendering the -x option unnecessary.

By the way, I for one would use the word 'suffix' instead of 'extension'
which sounds like some MS-DOS thingy, but after briefly grepping the
docs I can see that both words are used in this context already (and the
ratio is not in favor of my choice of wording.)

Other than that the patch looks good.

--
Alex

[1] http://archives.postgresql.org/pgsql-hackers/2011-02/msg00547.php

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] libpq: fix sslcompression leak

2012-02-01 Thread Marko Kreen
freePGconn() does not free sslcompression.  Fix.

Found with valgrind.

-- 
marko

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 5add143..4605e49 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2803,6 +2803,8 @@ freePGconn(PGconn *conn)
 		free(conn-sslrootcert);
 	if (conn-sslcrl)
 		free(conn-sslcrl);
+	if (conn-sslcompression)
+		free(conn-sslcompression);
 	if (conn-requirepeer)
 		free(conn-requirepeer);
 #if defined(KRB5) || defined(ENABLE_GSS) || defined(ENABLE_SSPI)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dry-run mode for pg_archivecleanup

2012-02-01 Thread Josh Kupershmidt
On Tue, Jan 31, 2012 at 10:29 AM, Gabriele Bartolini
gabriele.bartol...@2ndquadrant.it wrote:

 Here is my final version which embeds comments from Josh. I have also added
 debug information to be printed in case '-d' is given.

Looks fine; will mark Ready For Committer.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-01 Thread Peter Geoghegan
On 31 January 2012 19:47, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 Patch is attached. I have not changed the duplicate functions. This is
 because I concluded that it was the lesser of two evils to have to get
 the template to generate both declarations in the header file, and
 definitions in the .c file - that seemed particularly obscure. We're
 never going to have to expose/duplicate any more comparators anyway.
 Do you agree?

 Not really.  You don't really need macros to generate the prototypes;
 you could just write them out longhand.

Well, since that would have the advantage of forcing the client of
those macros to explicitly acknowledge what specialisations they were
getting in order to get a warning-free build, that might be a better
approach (though uncalled static functions are discarded anyway).
However, I don't really see how you expect me to make the
comparator-proper functions within nbtcompare.c inline without some
duplication. How can I have inline functions in that .c file, while
still having those functions callable from other TUs, while avoiding
duplication/moving the functions? I suppose that I could do some trick
with macros, but once again I believe that the cure is worse than the
disease - it it really such a big deal to duplicate a handful of
trivial functions, given than we're never going to have to expand that
number?

 I think there's a mess of naming confusion in here, though, as perhaps
 best illlustrated by this macro definition:

 #define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR)                               \
 DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap,                                \
        SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline)               \
 DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap,                                \
        MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc),                         \
            TYPE##regheapcomparetup_inline)

 The idea here is that when we have only a single sort key, we include
 SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we
 have more than one, we instead include MULT_ADDITIONAL_CODE.  Right
 there, I think we have a naming problem, because abbreviating single
 to sing and multiple to mult is less than entirely clear.  For a
 minute or two I was trying to figure out whether our sorting code was
 musically inclined, and I'm a native english speaker.  But then we
 switch to another set of terminology completely for the generated
 functions: inlheap for the single-key case, and regheap for the
 multiple-key case.   I find that even more confusing.

I'm happy to clean that up, though I think that when you try and
browbeat C into the role of supporting generic programming, confusing
code is more or less inevitable.

 I think we ought to get rid of this:

 +typedef enum TypeCompar
 +{
 +       TYPE_COMP_OTHER,
 +       TYPE_COMP_INT4,
 +       TYPE_COMP_INT8,
 +       TYPE_COMP_FLOAT4,
 +       TYPE_COMP_FLOAT8,
 +       TYPE_COMP_FULL_SPECIALISATION
 +} TypeCompar;

 Instead, just modify SortSupportData to have two function pointers as
 members, one for the single-key case and another for the multiple-key
 case, and have the sortsupport functions initialize them to the actual
 functions that should be called.  The layer of indirection, AFAICS,
 serves no purpose.

It anticipates a time when the system will have both btree and heap
variants of the specialisations. It also serves to usefully document
the intent of the optimisation - the most direct interface isn't
necessarily the best. That said, I really am looking for the path of
least resistance towards getting this committed at this stage, so if
you still think it's a bad idea, I won't complain if you modify the
patch to have multiple function pointers as you described.

 It's pretty easy to remove a specialisation at any time - just remove
 less than 10 lines of code. It's also pretty difficult to determine,
 with everyone's absolute confidence, where the right balance lies.
 Perhaps the sensible thing to do is to not be so conservative in what
 we initially commit, while clearly acknowledging that we may not have
 the balance right, and that it may have to change. We then have the
 entire beta part of the cycle in which to decide to roll back from
 that position, without any plausible downside. If, on the other hand,
 we conservatively lean towards fewer specialisations in the initial
 commit, no one will complain about the lack of an improvement in
 performance that they never had.

 Eh, really?  Typically when we do something good, the wolves are
 howling at the door to make it work in more cases.

Well, if anyone was complaining about a regression, because the
optimisation represented a net loss for their reasonable use-case,
then clearly we wouldn't be doing well, so we'd have to reconsider our
position, and no sensible wolf is going to argue with that. Obviously

Re: [HACKERS] JSON for PG 9.2

2012-02-01 Thread Merlin Moncure
On Tue, Jan 31, 2012 at 11:46 PM, Andrew Dunstan and...@dunslane.net wrote:
 The array(select...) locution turns out to have less flexibility than the
 array_agg(record-ref) locution.

Less flexible maybe, but it can cleaner for exactly the type of
queries that will tend to come up in exactly the type of functionality
people are looking for with JSON output.  libpqtypes does exactly the
same stuff but for C clients -- so I've done tons of this kind of
programming and am maybe a bit ahead of the curve here.  Note: while
the following contrived example may seem a bit complex it has a
certain elegance and shows how the postgres type system can whip out
document style 'nosql' objects to clients who can handle them.
Perhaps there is more simplification through syntax possible, but as
it stands things are pretty functional.   The equivalent production
through array_agg I find to be pretty awful looking although it can
produce a better plan since it doesn't force everything through
flattened subqueries:

create table foo
(
  foo_id serial primary key,
  a int
);

create table bar
(
  bar_id serial primary key,
  foo_id int references foo,
  b int
);

create table baz
(
  baz_id serial primary key,
  bar_id int references bar,
  c int
);

create type baz_t as
(
  c int
);

create type bar_t as
(
  bazs baz_t[],
  b int
);

create type foo_t as
(
  bars bar_t[],
  a int
);

INSERT INTO foo(a) VALUES (1);
INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 100);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 1000);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 2000);
INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 200);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 3000);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 4000);
INSERT INTO foo(a) VALUES (2);
INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 300);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 5000);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 6000);
INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 400);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 7000);
INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 8000);

-- nosql!
select array(
  select row(
array(
  select row(
array(
  select row(
c
  )::baz_t from baz where baz.bar_id = bar.bar_id
)::baz_t[],
b
  )::bar_t from bar where bar.foo_id = foo.foo_id
)::bar_t[],
a
  )::foo_t from foo
)::foo_t[];

foo_t
---
 
{(\{\\(\\{(1000),(2000)}\\,100)\\,\\(\\{(3000),(4000)}\\,200)\\}\,1),(\{\\(\\{(5000),(6000)}\\,300)\\,\\(\\{(7000),(8000)}\\,400)\\}\,2)}

as you can see, the postgres default escaping format sucks for sending
nested data -- throw even one quote or backslash in there and your
data can explode in size 10+ times -- this is why we insisted on
binary.  json, of course, is much better suited for this type of
communication.   despite the complicated-ness look of the above, this
type of code is in fact very easy to write once you get the knack.
This type of coding also leads to much simpler coding on the cilent
since relationships are directly built into the structure and don't
have to be inferred or duplicated.


merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq: fix sslcompression leak

2012-02-01 Thread Magnus Hagander
On Wed, Feb 1, 2012 at 13:56, Marko Kreen mark...@gmail.com wrote:
 freePGconn() does not free sslcompression.  Fix.

Thanks, applied.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-02-01 Thread Robert Haas
On Tue, Jan 31, 2012 at 3:47 PM, Joey Adams joeyadams3.14...@gmail.com wrote:
 I'm mostly in favor of allowing \u.  Banning \u means users
 can't use JSON strings to marshal binary blobs, e.g. by escaping
 non-printable characters and only using U+..U+00FF.  Instead, they
 have to use base64 or similar.

I agree.  I mean, representing data using six bytes per source byte is
a bit unattractive from an efficiency point of view, but I'm sure
someone is going to want to do it.  It's also pretty clear that JSON
string - PG text data type is going to admit of a number of error
conditions (transcoding errors and perhaps invalid surrogate pairs) so
throwing one more on the pile doesn't cost much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-02-01 Thread Simon Riggs
On Wed, Feb 1, 2012 at 2:56 AM, Robert Haas robertmh...@gmail.com wrote:

 I improved the grammar issues in the attached version of the patch -
 the syntax is now simpler and more consistent, IF EXISTS now works,
 and RESTRICT is accepted (without changing the behavior) while CASCADE
 fails with a nicer error message.  I also fixed a bug in
 RangeVarCallbackForDropRelation.

Plus tests as well. Many thanks.

I fixed the main bug you observed and your test case now works
perfectly. I used pgbench to continuously drop/create an index, so a
little more than manual testing.

v5 Attached.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/ref/drop_index.sgml b/doc/src/sgml/ref/drop_index.sgml
index 7177ef2..343f7ac 100644
--- a/doc/src/sgml/ref/drop_index.sgml
+++ b/doc/src/sgml/ref/drop_index.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  refsynopsisdiv
 synopsis
-DROP INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ...] [ CASCADE | RESTRICT ]
+DROP INDEX [ CONCURRENTLY ] [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ...] [ CASCADE | RESTRICT ]
 /synopsis
  /refsynopsisdiv
 
@@ -50,6 +50,29 @@ DROP INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ..
/varlistentry
 
varlistentry
+termliteralCONCURRENTLY/literal/term
+listitem
+ para
+  When this option is used, productnamePostgreSQL/ will drop the
+  index without taking any locks that prevent concurrent selects, inserts,
+  updates, or deletes on the table; whereas a standard index drop
+  waits for a lock that locks out everything on the table until it's done.
+  Concurrent drop index is a two stage process. First, we mark the index
+  both invalid and not ready then commit the change. Next we wait until
+  there are no users locking the table who can see the index.
+ /para
+ para
+  There are several caveats to be aware of when using this option.
+  Only one index name can be specified if the literalCONCURRENTLY/literal
+  parameter is specified.  Regular commandDROP INDEX/ command can be
+  performed within a transaction block, but
+  commandDROP INDEX CONCURRENTLY/ cannot.
+  The CASCADE option is not supported when dropping an index concurrently.
+ /para
+/listitem
+   /varlistentry
+
+   varlistentry
 termreplaceable class=PARAMETERname/replaceable/term
 listitem
  para
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index db86262..58d8f5c 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -173,8 +173,8 @@ static void reportDependentObjects(const ObjectAddresses *targetObjects,
 	   const ObjectAddress *origObject);
 static void deleteOneObject(const ObjectAddress *object,
 			Relation depRel, int32 flags);
-static void doDeletion(const ObjectAddress *object);
-static void AcquireDeletionLock(const ObjectAddress *object);
+static void doDeletion(const ObjectAddress *object, int flags);
+static void AcquireDeletionLock(const ObjectAddress *object, int flags);
 static void ReleaseDeletionLock(const ObjectAddress *object);
 static bool find_expr_references_walker(Node *node,
 			find_expr_references_context *context);
@@ -232,7 +232,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.	(Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object);
+	AcquireDeletionLock(object, 0);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
@@ -316,7 +316,7 @@ performMultipleDeletions(const ObjectAddresses *objects,
 		 * Acquire deletion lock on each target object.  (Ideally the caller
 		 * has done this already, but many places are sloppy about it.)
 		 */
-		AcquireDeletionLock(thisobj);
+		AcquireDeletionLock(thisobj, flags);
 
 		findDependentObjects(thisobj,
 			 DEPFLAG_ORIGINAL,
@@ -350,7 +350,11 @@ performMultipleDeletions(const ObjectAddresses *objects,
 	/* And clean up */
 	free_object_addresses(targetObjects);
 
-	heap_close(depRel, RowExclusiveLock);
+	/*
+	 * We closed depRel earlier in deleteOneObject if doing a drop concurrently
+	 */
+	if ((flags  PERFORM_DELETION_CONCURRENTLY) != PERFORM_DELETION_CONCURRENTLY)
+		heap_close(depRel, RowExclusiveLock);
 }
 
 /*
@@ -380,7 +384,7 @@ deleteWhatDependsOn(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.	(Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object);
+	AcquireDeletionLock(object, 0);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
@@ -611,7 +615,7 @@ findDependentObjects(const ObjectAddress *object,
  * deletion of the owning object.)
  */
 ReleaseDeletionLock(object);
-

Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-02-01 Thread Tom Lane
Ashutosh Bapat ashutosh.ba...@enterprisedb.com writes:
 Looking at the code, it seems that the fake aliases (eref) for relations
 (may be views as well) are not generated per say, but they do not get
 changed when the relation name changes OR in case of inherited tables, they
 do not get changed when the inheritance is expanded
 (expand_inherited_rtentry). So, there is not question of generating them so
 as to not collide with other aliases in the query.

Well, what I was considering was exactly generating new aliases that
don't collide with anything else in the query.  The fact that the code
doesn't do that now doesn't mean we can't make it do that.

 However I did not find answers to these questions
 1. What is the use of eref in case of relation when the relation name
 itself can be provided by the reloid?

eref is stored mainly so that parsing code doesn't have to repeatedly
look up what the effective RTE name is.  The alias field is meant to
represent whether there was an AS clause or not, and if so exactly what
it said.  So eref is a derived result whereas alias is essentially raw
grammar output.  Because of the possibility that the relation gets
renamed, it's probably best if we don't rely on eref anymore after
initial parsing of a query, ie ruleutils.c probably shouldn't use it.
(Too lazy to go check right now if that's already true, but it seems
like a good goal to pursue if we're going to change this code.)

 2. Can we use schema qualified relation name in get_from_clause_item() and
 get_variable() instead of use eref-aliasname.

No.  If there is an alias, it is flat wrong to use the relation name
instead, with or without schema name.  You might want to go study the
SQL spec a bit in this area.

 I have noticed that the
 logic in get_rte_attribute_name() gives preference to the column names in
 catalog tables over eref-colnames.

Hm.  What it should probably do is look at alias first, and if the alias
field doesn't specify a column name, then go to the catalogs to get the
current name.


Thinking about this some more, it seems like there are ways for a user
to shoot himself in the foot pretty much irretrievably.  Consider

CREATE TABLE t (x int);
CREATE VIEW v AS SELECT y FROM t t(y);
ALTER TABLE t ADD COLUMN y int;

On dump and reload, we'll have

CREATE TABLE t (x int, y int);
CREATE VIEW v AS SELECT y FROM t t(y);

and now the CREATE VIEW will fail, complaining (correctly) that the
column reference y is ambiguous.  Should ruleutils be expected to take
it upon itself to prevent that?  We could conceive of fixing it by
inventing column aliases out of whole cloth:

CREATE VIEW v AS SELECT y FROM t t(y, the_other_y);

but that seems a little much, not to mention that such a view definition
would be horribly confusing to work with.  On the other hand it isn't
all that far beyond what I had in mind of inventing relation aliases
to cure relation-name conflicts.  Should we take the existence of such
cases as evidence that we shouldn't try hard in this area?  It seems
reasonable to me to try to handle relation renames but draw the line
at disambiguating column names.  But others might find that distinction
artificial.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-02-01 Thread Robert Haas
On Tue, Jan 31, 2012 at 4:46 PM, Joachim Wieland j...@mcknight.de wrote:
 On Tue, Jan 31, 2012 at 9:05 AM, Robert Haas robertmh...@gmail.com wrote:
 And just for added fun and excitement, they all have inconsistent
 naming conventions and inadequate documentation.

 I think if we need more refactoring in order to support multiple
 database connections, we should go do that refactoring.  The current
 situation is not serving anyone well.

 I guess I'd find it cleaner to have just one connection per Archive
 (or ArchiveHandle). If you need two connections, why not just have two
 Archive objects, as they would have different characteristics anyway,
 one for dumping data, one to restore.

I think we're more-or-less proposing to rename Archive to
Connection, aren't we?

And then ArchiveHandle can store all the things that aren't related to
a specific connection.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with C++ exception handling in an FDW

2012-02-01 Thread Soules, Craig
 I suggest that you generalise from the example of PLV8. The basic
 problem is that the effect of longjmp()ing over an area of the stack
 with a C++ non-POD type is undefined. I don't think you can even use
 structs, as they have implicit destructors in C++.

I had thought that this was only an issue if you tried to longjmp() over a 
section of C++ code starting from a postgres backend C function?  From the 
PostgreSQL documentation:

 If calling backend functions from C++ code, be sure that the C++ call stack 
contains only plain old data structures (POD). This is necessary because 
backend errors generate a distant longjmp() that does not properly unroll a C++ 
call stack with non-POD objects.

But this is not what our code is doing.  Our code is a C++ function that only 
does the following:

try {
throw 1;
} catch (int e) {
} catch (...) {
}

which causes an immediate segmentation fault.  To answer another responders 
question, the stack trace looks as follows:

#0  0x2b3ce8f40fa5 in __cxa_allocate_exception ()
   from /usr/lib64/libstdc++.so.6
#1  0x2b3ce77b6256 in initMBSource (state=0x1ab87a80)
at 
/data/soules/metaboxA-bugfix/Metabox/debug_build/src/lib/query/dsFdwShim.cpp:16791
#2  0x2b3ce6c0b0aa in dsBeginForeignScan (node=0x1ab872d0, 
eflags=value optimized out) at dataseries_fdw.c:819
#3  0x0057606c in ExecInitForeignScan ()
#4  0x0055c715 in ExecInitNode ()
#5  0x0056874c in ExecInitAgg ()
#6  0x0055c6a5 in ExecInitNode ()
#7  0x0055b944 in standard_ExecutorStart ()
#8  0x00621b96 in PortalStart ()
#9  0x0061edad in exec_simple_query ()
#10 0x0061f624 in PostgresMain ()
#11 0x005e4c5c in ServerLoop ()
#12 0x005e595c in PostmasterMain ()
#13 0x0058a77e in main ()

Note: #2 is the entry into our C library and #1 is the entry into our C++ 
library

This appears to be some kind of allocation error, but the machine on which I'm 
running has plenty of free ram:

[~]$ free
 total   used   free sharedbuffers cached
Mem:  24682888   10505920   14176968  012204967412352
-/+ buffers/cache:1873072   22809816
Swap:  2096472  02096472

I also don't understand how it could truly be an allocation issue since we 
new/delete plenty of memory during a successful run (as well as using plenty of 
C++ containers which do internal allocation).

Hopefully this helps jog thoughts on my issue!

Thanks again!
Craig

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-01 Thread Robert Haas
On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I think we should be working to commit XLogInsert and then Group
 Commit, then come back to the discussion.

 I definitely agree that those two have way more promise than anything
 else on the table.  However, now that I understand how badly we're
 getting screwed by checkpoints, I'm curious to do some more
 investigation of what's going on there.  I can't help thinking that in
 these particular cases the full page writes may be a bigger issue than
 the checkpoint itself.  If that turns out to be the case it's not
 likely to be something we're able to address for 9.2, but I'd like to
 at least characterize it.

I did another run with full_page_writes=off:

http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png

This is run with the master branch as of commit
4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests,
so that the results are comparable.

The graph is pretty dramatically different from the earlier one:

http://wiki.postgresql.org/wiki/File:Tps-master.png

So are the TPS numbers:

full page writes on: tps = 2200.848350 (including connections establishing)
full page writes off: tps = 9494.757916 (including connections establishing)

It seems pretty clear that, even with full_page_writes=off, the
checkpoint is affecting things negatively: the first 700 seconds or so
show much better and more consistent performance than the remaining
portion of the test.  I'm not sure why that is, but I'm guessing there
was a checkpoint around that time.  But the effect is much worse with
full_page_writes=on: the distinctive parabolic shape of those graphs
is apparently caused by the gradually decreasing frequency of full
page writes as the number of transactions processed since the last
checkpoint grows.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] disable prompting by default in createuser

2012-02-01 Thread Peter Eisentraut
On sön, 2012-01-15 at 18:14 -0500, Josh Kupershmidt wrote:
 I see this patch includes a small change to dropuser, to make the
 'username' argument mandatory if --interactive is not set, for
 symmetry with createuser's new behavior. That's dandy, though IMO we
 shouldn't have -i be shorthand for --interactive with dropuser,
 and something different with createuser (i.e. we should just get rid
 of the i alias for dropuser).

Well, all the other tools also support -i for prompting.  I'd rather get
rid of -i for --inherit, but I fear that will break things as well.  I'm
not sure what to do.
 
 Another little inconsistency I see with the behavior when no username
 to create or drop is given:
 
 $  createuser
 createuser: creation of new role failed: ERROR:  role josh already
 exists
 $ dropuser
 dropuser: missing required argument role name
 Try dropuser --help for more information.
 
 i.e. createuser tries taking either $PGUSER or the current username as
 a default user to create, while dropuser just bails out. Personally, I
 prefer just bailing out if no create/drop user is specified, but
 either way I think they should be consistent.

That is intentional long-standing behavior.  createdb/dropdb work the
same way.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix float8 parsing of denormal values (on some platforms?)

2012-02-01 Thread Tom Lane
Marti Raudsepp ma...@juffo.org writes:
 On Wed, Dec 21, 2011 at 18:21, Marti Raudsepp ma...@juffo.org wrote:
 I think the least invasive fix, as proposed by Jeroen, is to fail only
 when ERANGE is set *and* the return value is 0.0 or +/-HUGE_VAL.
 Reading relevant specifications, this seems to be a fairly safe
 assumption. That's what the attached patch does.

 Oops, now attached the patch too.

Applied with minor revisions.  Notably, after staring at the code a bit
I got uncomfortable with its assumption that pg_strncasecmp() cannot
change errno, so I fixed it to not assume that.  Also, on some platforms
HUGE_VAL isn't infinity but the largest finite value, so I made the
range tests be like = HUGE_VAL not just == HUGE_VAL.  I know the
man page for strtod() specifies it should return exactly HUGE_VAL for
overflow, but who's to say that math.h is on the same page as the
actual function?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql case preserving completion

2012-02-01 Thread Peter Eisentraut
On tis, 2012-01-17 at 16:46 +0900, Fujii Masao wrote:
 When I tested the patch, create ta was converted unexpectedly to
 create TABLE
 though alter ta was successfully converted to alter table. As far
 as I read the patch,
 you seems to have forgotten to change create_or_drop_command_generator() or
 something.

Thanks, fixed and committed.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-01 Thread Simon Riggs
On Wed, Feb 1, 2012 at 5:47 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I think we should be working to commit XLogInsert and then Group
 Commit, then come back to the discussion.

 I definitely agree that those two have way more promise than anything
 else on the table.  However, now that I understand how badly we're
 getting screwed by checkpoints, I'm curious to do some more
 investigation of what's going on there.  I can't help thinking that in
 these particular cases the full page writes may be a bigger issue than
 the checkpoint itself.  If that turns out to be the case it's not
 likely to be something we're able to address for 9.2, but I'd like to
 at least characterize it.

 I did another run with full_page_writes=off:

 http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png

 This is run with the master branch as of commit
 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests,
 so that the results are comparable.

 The graph is pretty dramatically different from the earlier one:

 http://wiki.postgresql.org/wiki/File:Tps-master.png

 So are the TPS numbers:

 full page writes on: tps = 2200.848350 (including connections establishing)
 full page writes off: tps = 9494.757916 (including connections establishing)

 It seems pretty clear that, even with full_page_writes=off, the
 checkpoint is affecting things negatively: the first 700 seconds or so
 show much better and more consistent performance than the remaining
 portion of the test.  I'm not sure why that is, but I'm guessing there
 was a checkpoint around that time.  But the effect is much worse with
 full_page_writes=on: the distinctive parabolic shape of those graphs
 is apparently caused by the gradually decreasing frequency of full
 page writes as the number of transactions processed since the last
 checkpoint grows.

Sounds like time to test the checkpoint smoothing patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-02-01 Thread Peter Eisentraut
On sön, 2012-01-29 at 22:01 +, Simon Riggs wrote:
 Patch now locks index in AccessExclusiveLock in final stage of drop.

Doesn't that defeat the point of doing the CONCURRENTLY business in the
first place?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-02-01 Thread Simon Riggs
On Wed, Feb 1, 2012 at 7:31 PM, Peter Eisentraut pete...@gmx.net wrote:
 On sön, 2012-01-29 at 22:01 +, Simon Riggs wrote:
 Patch now locks index in AccessExclusiveLock in final stage of drop.

 Doesn't that defeat the point of doing the CONCURRENTLY business in the
 first place?

That was my initial reaction.

We lock the index in AccessExclusiveLock only once we are certain
nobody else is looking at it any more.

So its a Kansas City Shuffle, with safe locking in case of people
doing strange low level things.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE

2012-02-01 Thread Peter Eisentraut
On ons, 2012-01-18 at 18:17 -0500, Robert Haas wrote:
 I agree with Peter that we should have we should have CHECK ONLY.
 ONLY is really a property of the constraint, not the ALTER TABLE
 command -- if it were otherwise, we wouldn't need to store it the
 system catalogs, but of course we do.  The fact that it's not a
 standard property isn't a reason not to have proper syntax for it.

Clearly, we will eventually want to support inherited and non-inherited
constraints of all types.  Currently, each type of constraint has an
implicit default regarding this property:

check - inherited
not null - inherited
foreign key - not inherited
primary key - not inherited
unique - not inherited
exclusion - not inherited

As discussed above, we need to have a syntax that is attached to the
constraint, not the table operation that creates the constraint, so that
we can also create these in CREATE TABLE.

How should we resolve these different defaults?

Also, in ALTER TABLE, if you want to add either an inherited or not
inherited constraint to a parent table, you should really say ALTER
TABLE ONLY in either case.  Because it's conceivably valid that ALTER
TABLE foo ADD CHECK () NOINHERIT would add an independent, not inherited
check constraint to each child table.

So, there are all kinds of inconsistencies and backward compatibility
problems lurking here.  We might need either a grand transition plan or
document the heck out of these inconsistencies.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of patch renaming constraints

2012-02-01 Thread Peter Eisentraut
On tor, 2012-01-12 at 22:43 -0600, Joshua Berkus wrote:
 Most normal uses of alter table ... rename constraint ... worked
 normally.  However, the patch does not deal correctly with constraints
 which are not inherited, such as primary key constraints:

New patch which works around that issue.

diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 951b63b..c3039c8 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -25,6 +25,8 @@ ALTER TABLE [ IF EXISTS ] [ ONLY ] replaceable class=PARAMETERname/replacea
 replaceable class=PARAMETERaction/replaceable [, ... ]
 ALTER TABLE [ IF EXISTS ] [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ]
 RENAME [ COLUMN ] replaceable class=PARAMETERcolumn/replaceable TO replaceable class=PARAMETERnew_column/replaceable
+ALTER TABLE [ IF EXISTS ] [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ]
+RENAME CONSTRAINT replaceable class=PARAMETERconstraint_name/replaceable TO replaceable class=PARAMETERnew_constraint_name/replaceable
 ALTER TABLE [ IF EXISTS ] replaceable class=PARAMETERname/replaceable
 RENAME TO replaceable class=PARAMETERnew_name/replaceable
 ALTER TABLE [ IF EXISTS ] replaceable class=PARAMETERname/replaceable
@@ -569,8 +571,8 @@ ALTER TABLE [ IF EXISTS ] replaceable class=PARAMETERname/replaceable
 listitem
  para
   The literalRENAME/literal forms change the name of a table
-  (or an index, sequence, or view) or the name of an individual column in
-  a table. There is no effect on the stored data.
+  (or an index, sequence, or view), the name of an individual column in
+  a table, or the name of a constraint of the table. There is no effect on the stored data.
  /para
 /listitem
/varlistentry
@@ -883,7 +885,8 @@ ALTER TABLE [ IF EXISTS ] replaceable class=PARAMETERname/replaceable
 
para
 If a table has any descendant tables, it is not permitted to add,
-rename, or change the type of a column in the parent table without doing
+rename, or change the type of a column, or rename an inherited constraint
+in the parent table without doing
 the same to the descendants.  That is, commandALTER TABLE ONLY/command
 will be rejected.  This ensures that the descendants always have
 columns matching the parent.
@@ -983,6 +986,13 @@ ALTER TABLE distributors RENAME TO suppliers;
   /para
 
   para
+   To rename an existing constraint:
+programlisting
+ALTER TABLE distributors RENAME CONSTRAINT zipchk TO zip_check;
+/programlisting
+  /para
+
+  para
To add a not-null constraint to a column:
 programlisting
 ALTER TABLE distributors ALTER COLUMN street SET NOT NULL;
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 9175405..4dd9927 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -57,6 +57,10 @@ ExecRenameStmt(RenameStmt *stmt)
 			RenameCollation(stmt-object, stmt-newname);
 			break;
 
+		case OBJECT_CONSTRAINT:
+			RenameConstraint(stmt);
+			break;
+
 		case OBJECT_CONVERSION:
 			RenameConversion(stmt-object, stmt-newname);
 			break;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 07dc326..8dcaedc 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2324,6 +2324,108 @@ renameatt(RenameStmt *stmt)
 	   stmt-behavior);
 }
 
+
+/*
+ * same logic as renameatt_internal
+ */
+static void
+rename_constraint_internal(Oid myrelid,
+		   const char *oldconname,
+		   const char *newconname,
+		   bool recurse,
+		   bool recursing,
+		   int expected_parents)
+{
+	Relation	targetrelation;
+	Oid			constraintOid;
+	HeapTuple   tuple;
+	Form_pg_constraint con;
+
+	targetrelation = relation_open(myrelid, AccessExclusiveLock);
+	/* don't tell it whether we're recursing; we allow changing typed tables here */
+	renameatt_check(myrelid, RelationGetForm(targetrelation), false);
+
+	constraintOid = get_constraint_oid(myrelid, oldconname, false);
+
+	tuple = SearchSysCache1(CONSTROID, ObjectIdGetDatum(constraintOid));
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, cache lookup failed for constraint %u,
+			 constraintOid);
+	con = (Form_pg_constraint) GETSTRUCT(tuple);
+
+	if (con-contype == CONSTRAINT_CHECK  !con-conisonly)
+	{
+		if (recurse)
+		{
+			List	   *child_oids,
+*child_numparents;
+			ListCell   *lo,
+*li;
+
+			child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
+			 child_numparents);
+
+			forboth(lo, child_oids, li, child_numparents)
+			{
+Oid			childrelid = lfirst_oid(lo);
+int			numparents = lfirst_int(li);
+
+if (childrelid == myrelid)
+	continue;
+
+rename_constraint_internal(childrelid, oldconname, newconname, false, true, numparents);
+			}
+		}
+		else
+		{
+			if (expected_parents == 0 
+find_inheritance_children(myrelid, NoLock) != NIL)
+ereport(ERROR,
+		

Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-02-01 Thread Robert Haas
On Wed, Feb 1, 2012 at 12:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 On the other hand it isn't
 all that far beyond what I had in mind of inventing relation aliases
 to cure relation-name conflicts.  Should we take the existence of such
 cases as evidence that we shouldn't try hard in this area?  It seems
 reasonable to me to try to handle relation renames but draw the line
 at disambiguating column names.  But others might find that distinction
 artificial.

I sure do.

I mean, in Oracle, if you rename a table or column involved in a view,
then the view breaks.  Blammo!  The reference is by object name, not
by some internal identifier a la OID.  If you put back an object with
the correct name (either the original one or a different one), you can
re-enable the view.

We've decide that we don't want that behavior: instead, our references
are to the object itself rather than to the name of the object.
Renaming the object doesn't change what the reference points to.  But
given that position, it seems to me that we ought to be willing to
work pretty hard to make sure that when we dump-and-reload the
database, things stay sane.  Otherwise, we're sort of in this
unsatisfying in-between place where references are *mostly* by
internal identifier but everyone once in a while it falls apart and
name collisions can break everything.  Yech!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-02-01 Thread Dimitri Fontaine
Hi,

Sorry to be late in the thread, I'm too busy right now.  Cédric called
it to my immediate attention though.

Martijn van Oosterhout klep...@svana.org writes:
 Perhaps a better way of dealing with this is providing a way of dumping
 extensions explicitly. Then you could say:

 pg_dump --extension=postgis -s

That's something I'm working on in this commit fest under the “inline
extensions” topic, and we should have that facility in 9.2 baring major
obstacles (consensus is made).

Tom Lane t...@sss.pgh.pa.us writes:
 On Mon, Jan 30, 2012 at 11:18 PM, Tom Lanet...@sss.pgh.pa.us  wrote:
 What's not apparent to me is whether there's an argument for doing more
 than that.  It strikes me that the current design is not very friendly
 towards the idea of an extension that creates a table that's meant
 solely to hold user data --- you'd have to mark it as config which
 seems a bit unfortunate terminology for that case.  Is it important to
 do something about that, and if so what?

 My thought exactly --- maybe it's only a minor cosmetic issue that will
 affect few people, or maybe this will someday be a major use-case.
 I don't know.  I was hoping Dimitri had an opinion.

So, being able to stuff data into an extension has been made possible to
address two use cases:

 - postgis
 - (sql only) data extensions

The former is very specific and as we didn't hear back from them I guess
we addressed it well enough, the latter is still WIP. It's about being
able to ship data as an extension (think timezone updates, geo ip, bank
cards database, exchange rates, etc). You need to be able to easily ship
those (CSV isn't the best we can do here, as generally it doesn't
include the schema nor the COPY recipe that can be non-trivial) and to
easily update those.

The case for a table that is partly user data and partly extension data
is very thin, I think that if I had this need I would use inheritance
and a CHECK(user_data is true/false) constraint to filter the data.

So I sure would appreciate being able to call that data rather than
config, and to mark any table at once. If that doesn't need any pg_dump
stretching I think providing that in 9.2 would be great.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-01 Thread Jim Nasby
On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
 But if we want to put it on a diet, the first thing I'd probably be
 inclined to lose is the float4 specialization.  Some members of the
 audience will recall that I take dim view of floating point arithmetic
 generally, but I'll concede that there are valid reasons for using
 float8.  I have a harder time coming up with a good reason to use
 float4 - ever, for anything you care about.  So I would be inclined to
 think that if we want to trim this back a bit, maybe that's the one to
 let go.  If we want to be even more aggressive, the next thing I'd
 probably lose is the optimization of multiple sortkey cases, on the
 theory that single sort keys are probably by far the most common
 practical case.

I do find float4 to be useful, though it's possible that my understanding is 
flawed…

We end up using float to represent ratios in our database; things that really, 
honest to God do NOT need to be exact.

In most cases, 7 digits of precision (which AFAIK is what you're guaranteed 
with float4) is plenty, so we use float4 rather than bloat the database 
(though, since we're on 64bit hardware I guess that distinction is somewhat 
moot…).

Is there something I'm missing that would make float4 useless as compared to 
float8?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring log_newpage

2012-02-01 Thread Jim Nasby
On Feb 1, 2012, at 4:25 AM, Simon Riggs wrote:
 At present log_newpage() produces log records called XLOG_HEAP_NEWPAGE.
 
 That routine is used by HEAP, BTREE, GIN, SPGIST rmgrs, as well as
 various forks.
 
 WAL contains no information as to which rmgr the data refers to,
 making debugging much harder and skewing efforts to optimise WAL
 traffic and is a pretty gross modularity violation of the whole rmgr
 concept.
 
 This refactoring adds an RmgrId field onto each new page record and
 makes clearer that certain heap routines are actually generic. The
 WAL records are still marked as HEAP rmgr and have XLOG_NEWPAGE record
 type, but at least we can tell them apart. (We already had forknum,
 just not rmgrid).


But we already had RelFileNode; wouldn't that be enough to tell what rmgr was 
responsible for the new page? Can 2 different rmgrs write to the same file node?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] spgist text_ops and LIKE

2012-02-01 Thread Robert Haas
Is spgist intended to support prefix searches with LIKE?

I ask because, first, it seems like something spgist ought to be good
at (unless I'm confused), and, second, the text_ops opfamily includes
these operators:

~~(text,text)
 ~=~(text,text)
 ~=~(text,text)
 ~~(text,text)

...which seems to be the same operators that are used for btree
pattern-matching searches:

rhaas=# explain select count(*) from person where last_name like 'WAR%';
QUERY PLAN
--
 Aggregate  (cost=2519.27..2519.28 rows=1 width=0)
   -  Bitmap Heap Scan on person  (cost=24.70..2496.75 rows=9005 width=0)
 Filter: (last_name ~~ 'WAR%'::text)
 -  Bitmap Index Scan on person_tpo  (cost=0.00..22.45
rows=900 width=0)
   Index Cond: ((last_name ~=~ 'WAR'::text) AND
(last_name ~~ 'WAS'::text))
(5 rows)

...but when I create an index like this:

create index person_spg on person using spgist (last_name text_ops);

...I can't get LIKE to use it, even if I disable seqscans.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] feature request - datum_compute_size and datum write_should be public

2012-02-01 Thread Jim Nasby
On Feb 1, 2012, at 12:45 AM, Pavel Stehule wrote:
 I looked to sources and I found a some useful routines for people who
 write extensions and probably PL too.
 
 There are datum_compute_size and datum_write from range_types.c. These
 routines can be used in PL libs and maybe in other places.
 
 Should be these routines moved to varlena.c and be public?
 
 Why?  It is not common for types to contain other types, and it
 certainly isn't likely to happen without needing lots of other
 infrastructure --- the existing examples are arrays, records, and
 rangetypes, and all of those come with lots of baggage.  And there
 are a number of choices in those functions that are pretty specific to
 rangetypes, as illustrated by the fact that they're not already sharing
 code with either arrays or records.
 
 For example I can use this code in my implementation of set of enum
 (enumset datatype) because I have to wrap a array sometimes (I reuse a
 array infrastructure).
 
 In orafce I can use this code for serialisation and deserialisation
 Datums - it is used more times there

I'm not certain this in what  Pavel is referring to, but I have often wished 
that I could pass something like an array into a function and have the function 
tell me exactly how much space that would require on-disk. It's pretty easy to 
figure that out for things like varchar and numeric, but doing so for arrays or 
composite types requires pretty detailed knowledge of PG internals.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] JSON output functions.

2012-02-01 Thread Andrew Dunstan


I've just been running some timings of my JSON-producing functions, in 
particular array_to_json, and comparing them with the current 
XML-producing functions. Here's a typical result:


   andrew=# explain analyse select array_to_json(array_agg(q),true)
   from (select * from pg_attribute) q;
 QUERY PLAN
   
-
 Aggregate  (cost=70.77..70.78 rows=1 width=203) (actual
   time=38.919..38.920 rows=1 loops=1)
   -  Seq Scan on pg_attribute  (cost=0.00..65.01 rows=2301
   width=203) (actual time=0.007..1.454 rows=2253 loops=1)
 Total runtime: 39.300 ms
   (3 rows)

   Time: 62.753 ms
   andrew=# explain analyse select table_to_xml('pg_attribute',
   true,false,'');
   QUERY PLAN
   

 Result  (cost=0.00..0.26 rows=1 width=0) (actual
   time=519.170..526.737 rows=1 loops=1)
 Total runtime: 526.780 ms
   (2 rows)


As you can see, producing the JSON is a heck of a lot faster than 
producing the equivalent XML. I had thought it might be necessary for 
good performance to cache the type output info in the FunctionCallInfo 
structure, rather than fetch it for each Datum we output, but that 
doesn't seem to be so. For now I'm inclined not to proceed with that, 
and leave it as an optimization to be considered later if necessary. 
Thoughts?


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-02-01 Thread Jim Nasby
On Jan 31, 2012, at 10:58 AM, Alvaro Herrera wrote:
 I think it's butt-ugly, but it's only slightly uglier than
 relfrozenxid which we're already stuck with.  The slight amount of
 additional ugliness is that you're going to use an XID column to store
 a uint4 that is not an XID - but I don't have a great idea how to fix
 that.  You could mislabel it as an OID or a (signed) int4, but I'm not
 sure that either of those is any better.  We could also create an mxid
 data type, but that seems like it might be overkill.
 
 Well, we're already storing a multixact in Xmax, so it's not like we
 don't assume that we can store multis in space normally reserved for
 Xids.  What I've been wondering is not how ugly it is, but rather of the
 fact that we're bloating pg_class some more.

FWIW, users have been known to request uint datatypes; so if this really is a 
uint perhaps we should just create a uint datatype...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index-only scan performance regression

2012-02-01 Thread Robert Haas
On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 The real objection to this probably is that if it only saves anything
 for tables that don't have a VM yet, it's dubious whether it's worth
 doing.  But if we can avoid wasted checks for VM extension as well,
 then I think it's probably a no-brainer.

 Yes it applies in the same way to VM extension - if the table has
 grown and the VM has not yet been extended, but I don't see why that
 is any worse than the case of not having a VM yet.

 Actually I think that this is not such an uncommon case - for a table
 which has only had data inserted - no deletes or updates - it is
 tempting to think that ANALYSE is sufficient, and that there is no
 need to VACUUM. If it were simply the case that this caused an
 index-only scan to have no real benefit, you might be willing to live
 with normal index scan performance. But actually it causes a very
 significant performance regression beyond that, to well below 9.1
 performance.

So, I guess the trade-off here is that, since sinval messages aren't
processed immediately, we often won't notice the VM extension until
the next statement starts, whereas with the current implementation, we
notice it right away.  On the other hand, noticing it right away is
costing us a system call or two per tuple.  So on further thought, I
think we should do this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-02-01 Thread Alvaro Herrera

Excerpts from Jim Nasby's message of mié feb 01 21:33:47 -0300 2012:
 
 On Jan 31, 2012, at 10:58 AM, Alvaro Herrera wrote:
  I think it's butt-ugly, but it's only slightly uglier than
  relfrozenxid which we're already stuck with.  The slight amount of
  additional ugliness is that you're going to use an XID column to store
  a uint4 that is not an XID - but I don't have a great idea how to fix
  that.  You could mislabel it as an OID or a (signed) int4, but I'm not
  sure that either of those is any better.  We could also create an mxid
  data type, but that seems like it might be overkill.
  
  Well, we're already storing a multixact in Xmax, so it's not like we
  don't assume that we can store multis in space normally reserved for
  Xids.  What I've been wondering is not how ugly it is, but rather of the
  fact that we're bloating pg_class some more.
 
 FWIW, users have been known to request uint datatypes; so if this really is a 
 uint perhaps we should just create a uint datatype...

Yeah.  This is just for internal consumption, though, not a full-blown
datatype.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] heap_tuple_needs_freeze false positive

2012-02-01 Thread Alvaro Herrera
Hi,

I noticed that heap_tuple_needs_freeze might return true in cases where
the Xmax is leftover junk from somebody who set HEAP_XMAX_INVALID in the
far past without resetting the Xmax value itself to Invalid.  I think
this is incorrect usage; the rule, I think, is that one shouldn't even
read Xmax at all unless HEAP_XMAX_INVALID is reset.

This might cause unnecessary acquisitions of the cleanup lock, if a
tuple is deemed freezable when in fact it isn't.

Suggested patch attached.  I'd backpatch this as far as it applies
cleanly.

-- 
Álvaro Herrera alvhe...@alvh.no-ip.org


heap_tuple_needs_freeze.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] feature request - datum_compute_size and datum write_should be public

2012-02-01 Thread Alvaro Herrera

Excerpts from Jim Nasby's message of mié feb 01 20:47:05 -0300 2012:

 I'm not certain this in what  Pavel is referring to, but I have often wished 
 that I could pass something like an array into a function and have the 
 function tell me exactly how much space that would require on-disk. It's 
 pretty easy to figure that out for things like varchar and numeric, but doing 
 so for arrays or composite types requires pretty detailed knowledge of PG 
 internals.

I think you can just use pg_column_size on a composite datum (such as a
ROW() construct) and it will give you the right number.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-01 Thread Alvaro Herrera

Excerpts from Jim Nasby's message of mié feb 01 19:12:58 -0300 2012:
 On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
  But if we want to put it on a diet, the first thing I'd probably be
  inclined to lose is the float4 specialization.  Some members of the
  audience will recall that I take dim view of floating point arithmetic
  generally, but I'll concede that there are valid reasons for using
  float8.  I have a harder time coming up with a good reason to use
  float4 - ever, for anything you care about.  So I would be inclined to
  think that if we want to trim this back a bit, maybe that's the one to
  let go.  If we want to be even more aggressive, the next thing I'd
  probably lose is the optimization of multiple sortkey cases, on the
  theory that single sort keys are probably by far the most common
  practical case.
 
 I do find float4 to be useful, though it's possible that my understanding is 
 flawed…
 
 We end up using float to represent ratios in our database; things that 
 really, honest to God do NOT need to be exact.

But then, do you sort using those ratios?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index-only scan performance regression

2012-02-01 Thread Robert Haas
On Wed, Feb 1, 2012 at 7:44 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 The real objection to this probably is that if it only saves anything
 for tables that don't have a VM yet, it's dubious whether it's worth
 doing.  But if we can avoid wasted checks for VM extension as well,
 then I think it's probably a no-brainer.

 Yes it applies in the same way to VM extension - if the table has
 grown and the VM has not yet been extended, but I don't see why that
 is any worse than the case of not having a VM yet.

 Actually I think that this is not such an uncommon case - for a table
 which has only had data inserted - no deletes or updates - it is
 tempting to think that ANALYSE is sufficient, and that there is no
 need to VACUUM. If it were simply the case that this caused an
 index-only scan to have no real benefit, you might be willing to live
 with normal index scan performance. But actually it causes a very
 significant performance regression beyond that, to well below 9.1
 performance.

 So, I guess the trade-off here is that, since sinval messages aren't
 processed immediately, we often won't notice the VM extension until
 the next statement starts, whereas with the current implementation, we
 notice it right away.  On the other hand, noticing it right away is
 costing us a system call or two per tuple.  So on further thought, I
 think we should do this.

Patch committed.  I moved the smgr inval to after the actual extension
is done, which seems superior, and adjusted the comments slightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-02-01 Thread Ashutosh Bapat
On Wed, Feb 1, 2012 at 10:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Ashutosh Bapat ashutosh.ba...@enterprisedb.com writes:
  Looking at the code, it seems that the fake aliases (eref) for relations
  (may be views as well) are not generated per say, but they do not get
  changed when the relation name changes OR in case of inherited tables,
 they
  do not get changed when the inheritance is expanded
  (expand_inherited_rtentry). So, there is not question of generating them
 so
  as to not collide with other aliases in the query.

 Well, what I was considering was exactly generating new aliases that
 don't collide with anything else in the query.  The fact that the code
 doesn't do that now doesn't mean we can't make it do that.

  However I did not find answers to these questions
  1. What is the use of eref in case of relation when the relation name
  itself can be provided by the reloid?

 eref is stored mainly so that parsing code doesn't have to repeatedly
 look up what the effective RTE name is.  The alias field is meant to
 represent whether there was an AS clause or not, and if so exactly what
 it said.  So eref is a derived result whereas alias is essentially raw
 grammar output.  Because of the possibility that the relation gets
 renamed, it's probably best if we don't rely on eref anymore after
 initial parsing of a query, ie ruleutils.c probably shouldn't use it.
 (Too lazy to go check right now if that's already true, but it seems
 like a good goal to pursue if we're going to change this code.)

  2. Can we use schema qualified relation name in get_from_clause_item()
 and
  get_variable() instead of use eref-aliasname.

 No.  If there is an alias, it is flat wrong to use the relation name
 instead, with or without schema name.  You might want to go study the
 SQL spec a bit in this area.


To clarify matters a bit, item 2 is in conjunction with item 1. Aliases, if
provided, are output irrespective of whether we get the relation name from
eref or catalogs.  ruleutils should just ignore eref (for RTE_RELATION
only) and get the relation name from given OID.



  I have noticed that the
  logic in get_rte_attribute_name() gives preference to the column names in
  catalog tables over eref-colnames.

 Hm.  What it should probably do is look at alias first, and if the alias
 field doesn't specify a column name, then go to the catalogs to get the
 current name.


It does give preference to aliases today. I compared preferences of
colnames in eref and that obtained from catalogs.



 Thinking about this some more, it seems like there are ways for a user
 to shoot himself in the foot pretty much irretrievably.  Consider

 CREATE TABLE t (x int);
 CREATE VIEW v AS SELECT y FROM t t(y);
 ALTER TABLE t ADD COLUMN y int;

 On dump and reload, we'll have

 CREATE TABLE t (x int, y int);
 CREATE VIEW v AS SELECT y FROM t t(y);

 and now the CREATE VIEW will fail, complaining (correctly) that the
 column reference y is ambiguous.  Should ruleutils be expected to take
 it upon itself to prevent that?  We could conceive of fixing it by
 inventing column aliases out of whole cloth:

 CREATE VIEW v AS SELECT y FROM t t(y, the_other_y);

 but that seems a little much, not to mention that such a view definition
 would be horribly confusing to work with.  On the other hand it isn't
 all that far beyond what I had in mind of inventing relation aliases
 to cure relation-name conflicts.  Should we take the existence of such
 cases as evidence that we shouldn't try hard in this area?  It seems
 reasonable to me to try to handle relation renames but draw the line
 at disambiguating column names.  But others might find that distinction
 artificial.


I agree. The example of the colnames was only to show that the preference
alias  relation information from catalogs  eref exists somewhere in the
code.



regards, tom lane




-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company


Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-02-01 Thread Ashutosh Bapat
On Wed, Feb 1, 2012 at 11:01 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Feb 1, 2012 at 12:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  On the other hand it isn't
  all that far beyond what I had in mind of inventing relation aliases
  to cure relation-name conflicts.  Should we take the existence of such
  cases as evidence that we shouldn't try hard in this area?  It seems
  reasonable to me to try to handle relation renames but draw the line
  at disambiguating column names.  But others might find that distinction
  artificial.

 I sure do.

 I mean, in Oracle, if you rename a table or column involved in a view,
 then the view breaks.  Blammo!  The reference is by object name, not
 by some internal identifier a la OID.  If you put back an object with
 the correct name (either the original one or a different one), you can
 re-enable the view.

 We've decide that we don't want that behavior: instead, our references
 are to the object itself rather than to the name of the object.
 Renaming the object doesn't change what the reference points to.  But
 given that position, it seems to me that we ought to be willing to
 work pretty hard to make sure that when we dump-and-reload the
 database, things stay sane.  Otherwise, we're sort of in this
 unsatisfying in-between place where references are *mostly* by
 internal identifier but everyone once in a while it falls apart and
 name collisions can break everything.  Yech!


For me the relation names problem and column aliases problems are two
independent problems. While the first one looks easy to fix, the other
problem may be hard to solve. We can solve the first problem and things
will be better than what we have today. If you agree, I will provide a
patch to fix the relation names problems by ignoring the eref (for
RTE_RELATION only) in ruleutils.


 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company




-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company


Re: [HACKERS] pg_stats_recovery view

2012-02-01 Thread Fujii Masao
On Fri, Jan 27, 2012 at 6:01 AM, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Thu, Jan 26, 2012 at 4:03 AM, Bernd Helmle maili...@oopsware.de wrote:


 --On 15. Januar 2012 02:50:00 -0500 Jaime Casanova ja...@2ndquadrant.com
 wrote:

 Attached is a patch thats implements a pg_stat_recovery view that
 keeps counters about processed wal records. I just notice that it
 still lacks documentation but i will add it during the week.


 Hi Jaime,

 do you have an updated patch? The current v1 patch doesn't apply cleanly
 anymore, and before i go and rebase the patch i thought i'm asking...


 here's the patch rebased to this morning's HEAD

Before reviewing the patch, I'd like to know: what's the purpose of this view?
It's only debug purpose? ISTM that most users don't care about this view at all.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-01 Thread Shigeru Hanada
(2012/02/01 3:56), Robert Haas wrote:
 2012/1/29 Kohei KaiGaikai...@kaigai.gr.jp:
  Remote SQL: DECLARE pgsql_fdw_cursor_0 SCROLL CURSOR FOR SELECT
 a, b FROM public.t1 WHERE (a  2)
   (3 rows)
 
 Shouldn't we be using protocol-level cursors rather than SQL-level cursors?

Yes, we should, if we have protocol-level cursor :)
I checked libpq interface but I couldn't find any function for
protocol-level cursor.

 [Design comment]
 When BEGIN should be issued on the remote-side?
 The connect_pg_server() is an only chance to issue BEGIN command
 at the remote-side on connection being opened. However, the connection
 shall be kept unless an error is not raised. Thus, the remote-side will
 continue to work within a single transaction block, even if local-side 
 iterates
 a pair of BEGIN and COMMIT.
 I'd like to suggest to close the transaction block at the timing of either
 end of the scan, transaction or sub-transaction.
 
 I suspect this is ultimately going to need to be configurable.  Some
 people might want to close the transaction on the remote side ASAP,
 while other people might want to hold it open until commit.  For a
 first version I think it's most likely best to do whatever seems
 simplest to code, planning to add more options later.

I fixed pgsql_fdw to abort remote transaction at the end of each local
query.  I chose this timing because local query might include multiple
scans on same foreign server.  I think this would be ASAP timing in
your comment.

It would be useful to make length of remote transaction same as local's,
I'll try RegisterXactCallback for this purpose, though we need to
preload FDW module to catch BEGIN preceding query using foreign tables.

-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] feature request - datum_compute_size and datum write_should be public

2012-02-01 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Jim Nasby's message of mié feb 01 20:47:05 -0300 2012:
 I'm not certain this in what  Pavel is referring to, but I have often wished 
 that I could pass something like an array into a function and have the 
 function tell me exactly how much space that would require on-disk. It's 
 pretty easy to figure that out for things like varchar and numeric, but 
 doing so for arrays or composite types requires pretty detailed knowledge of 
 PG internals.

 I think you can just use pg_column_size on a composite datum (such as a
 ROW() construct) and it will give you the right number.

If it's a freshly-computed value, pg_column_size will give you the size
of the raw datum.  The actual size on disk might be less due to
compression, but I don't think we give you any way to find that out
short of actually storing it in a table.  Note that the rangetype
internal functions Pavel suggests we should expose won't give you the
latter either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spgist text_ops and LIKE

2012-02-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Is spgist intended to support prefix searches with LIKE?

Too lazy to look at the code right now, but I think indxpath.c contains
hardwired assumptions that LIKE prefix optimizations are only possible
with btree indexes.  Possibly it would be worth relaxing that.  (The
whole special index operator mechanism is undesirably special-purpose,
but I currently have no ideas about how to make it more flexible.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap_tuple_needs_freeze false positive

2012-02-01 Thread Tom Lane
Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 ! if (!(tuple-t_infomask  HEAP_XMAX_INVALID))
   {
 ! if (!(tuple-t_infomask  HEAP_XMAX_IS_MULTI))

How about just one test,

if (!(tuple-t_infomask  (HEAP_XMAX_INVALID | HEAP_XMAX_IS_MULTI)))

But other than that quibble, yeah, it's a bug.  XMAX_INVALID means just
that: the xmax is not to be thought valid.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] freezing multixacts

2012-02-01 Thread Alvaro Herrera

So freezing multixacts is not all that easy.  I mean, you just scan the
page looking for multis lesser than the cutoff; for those that are dead,
they can just be removed completely, but what about ones that still have
members running?  This is pretty unlikely but not impossible.

If there's only one remaining member, the problem is easy: replace it
with that transaction's xid, and set the appropriate hint bits.  But if
there's more than one, the only way out is to create a new multi.  This
increases multixactid consumption, but I don't see any other option.

However, there are cases where not even that is possible -- consider
tuple freezing during WAL recovery.  Recovery is going to need to
replace those multis with other multis, but it cannot create new multis
itself.  The only solution here appears to be that when multis are
frozen in the master, replacement multis have to be logged too.  So the
heap_freeze_tuple Xlog record will have a map of old multi to new.  That
way, recovery can just determine the new multi to use for any particular
old multi; since multixact creation is also logged, we're certain that
the replacement value has already been defined.

Sounds ugly, but not horrible.

Thoughts, opinions?

-- 
Álvaro Herrera alvhe...@alvh.no-ip.org

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] keywords in initdb are case-sensitive?

2012-02-01 Thread Alvaro Herrera
In miracee's review of Peter's patch for new -A options in initdb (in
commitfest app only), it is noted that pg_hba.conf keyword parsing is
done in a case sensitive manner.  So if you write Trust rather than
trust, it's not recognized.

This seemed pretty nonsensical to me, and it's not documented, so I came
up with the trivial attached patch.

Comparisons to user and database names and the like are unchanged and
thus require matching case.

Thoughts?

-- 
Álvaro Herrera alvhe...@alvh.no-ip.org


hba-casecmp.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring log_newpage

2012-02-01 Thread Simon Riggs
On Wed, Feb 1, 2012 at 10:42 PM, Jim Nasby j...@nasby.net wrote:
 On Feb 1, 2012, at 4:25 AM, Simon Riggs wrote:
 At present log_newpage() produces log records called XLOG_HEAP_NEWPAGE.

 That routine is used by HEAP, BTREE, GIN, SPGIST rmgrs, as well as
 various forks.

 WAL contains no information as to which rmgr the data refers to,
 making debugging much harder and skewing efforts to optimise WAL
 traffic and is a pretty gross modularity violation of the whole rmgr
 concept.

 This refactoring adds an RmgrId field onto each new page record and
 makes clearer that certain heap routines are actually generic. The
 WAL records are still marked as HEAP rmgr and have XLOG_NEWPAGE record
 type, but at least we can tell them apart. (We already had forknum,
 just not rmgrid).


 But we already had RelFileNode; wouldn't that be enough to tell what rmgr was 
 responsible for the new page? Can 2 different rmgrs write to the same file 
 node?

No, but which one? No way to tell unless you have full list of
relfilenodes and check each one. So btree changes look like heap
changes etc..

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring log_newpage

2012-02-01 Thread Heikki Linnakangas

On 02.02.2012 08:19, Simon Riggs wrote:

On Wed, Feb 1, 2012 at 10:42 PM, Jim Nasbyj...@nasby.net  wrote:

On Feb 1, 2012, at 4:25 AM, Simon Riggs wrote:

At present log_newpage() produces log records called XLOG_HEAP_NEWPAGE.

That routine is used by HEAP, BTREE, GIN, SPGIST rmgrs, as well as
various forks.

WAL contains no information as to which rmgr the data refers to,
making debugging much harder and skewing efforts to optimise WAL
traffic and is a pretty gross modularity violation of the whole rmgr
concept.

This refactoring adds an RmgrId field onto each new page record and
makes clearer that certain heap routines are actually generic. The
WAL records are still marked as HEAP rmgr and have XLOG_NEWPAGE record
type, but at least we can tell them apart. (We already had forknum,
just not rmgrid).



But we already had RelFileNode; wouldn't that be enough to tell what rmgr was 
responsible for the new page? Can 2 different rmgrs write to the same file node?


No, but which one? No way to tell unless you have full list of
relfilenodes and check each one.


Well, you can obviously check the catalogs for that, but you must be 
assuming that you don't have access to the catalogs or this would be a 
non-issue.


You can also identify the kind of page by looking at the special area of 
the stored page. See: 
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00392.php


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stats_recovery view

2012-02-01 Thread Jaime Casanova
On Wed, Feb 1, 2012 at 9:18 PM, Fujii Masao masao.fu...@gmail.com wrote:

 --On 15. Januar 2012 02:50:00 -0500 Jaime Casanova ja...@2ndquadrant.com
 wrote:

 Attached is a patch thats implements a pg_stat_recovery view that
 keeps counters about processed wal records. I just notice that it
 still lacks documentation but i will add it during the week.



 Before reviewing the patch, I'd like to know: what's the purpose of this view?
 It's only debug purpose? ISTM that most users don't care about this view at 
 all.


yeah! you're right. most users won't care about it... did i tell that
i added a track_recovery GUC so only users that wanted pay for it? i
probably did not tell that :D

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] unsigned and signed chars in libpq API

2012-02-01 Thread Dmitriy Igrishin
Hey all,

Could you tell me please an objective reason why PQunescapeBytea()
returns unsigned char* rather than just char* ?
I am asking because a bit confused. How this intermixes with LO's API,
which based on signed chars (although as we all know large object -
is a just bytea splitted on chunks)?
And also PQgetvalue() returns char* rather than unsigned char*.

Thanks.

-- 
// Dmitriy.


Re: [HACKERS] pg_stats_recovery view

2012-02-01 Thread Magnus Hagander
On Thu, Feb 2, 2012 at 08:26, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Wed, Feb 1, 2012 at 9:18 PM, Fujii Masao masao.fu...@gmail.com wrote:

 --On 15. Januar 2012 02:50:00 -0500 Jaime Casanova ja...@2ndquadrant.com
 wrote:

 Attached is a patch thats implements a pg_stat_recovery view that
 keeps counters about processed wal records. I just notice that it
 still lacks documentation but i will add it during the week.



 Before reviewing the patch, I'd like to know: what's the purpose of this 
 view?
 It's only debug purpose? ISTM that most users don't care about this view at 
 all.


 yeah! you're right. most users won't care about it... did i tell that
 i added a track_recovery GUC so only users that wanted pay for it? i
 probably did not tell that :D

I haven't looked through the code in detail, but one direct comment:
do we really need/want to send this through the stats collector? It
will only ever have one sender - perhaps we should just either store
it in shared memory or store it locally and only send it on demand?

(apologies if it already does the on-demand thing, I only spent about
30 seconds looking for it and noticed it did go through the stats
collector...)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-01 Thread Kyotaro HORIGUCHI
Hello, This is new version of dblink.c

- Memory is properly freed when realloc returns NULL in storeHandler().

- The bug that free() in finishStoreInfo() will be fed with
  garbage pointer when malloc for sinfo-valbuflen fails is
  fixed.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 36a8e3e..28c967c 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -63,11 +63,23 @@ typedef struct remoteConn
 	bool		newXactForCursor;		/* Opened a transaction for a cursor */
 } remoteConn;
 
+typedef struct storeInfo
+{
+	Tuplestorestate *tuplestore;
+	int nattrs;
+	MemoryContext oldcontext;
+	AttInMetadata *attinmeta;
+	char** valbuf;
+	int *valbuflen;
+	bool error_occurred;
+	bool nummismatch;
+	ErrorData *edata;
+} storeInfo;
+
 /*
  * Internal declarations
  */
 static Datum dblink_record_internal(FunctionCallInfo fcinfo, bool is_async);
-static void materializeResult(FunctionCallInfo fcinfo, PGresult *res);
 static remoteConn *getConnectionByName(const char *name);
 static HTAB *createConnHash(void);
 static void createNewConnection(const char *name, remoteConn *rconn);
@@ -90,6 +102,10 @@ static char *escape_param_str(const char *from);
 static void validate_pkattnums(Relation rel,
    int2vector *pkattnums_arg, int32 pknumatts_arg,
    int **pkattnums, int *pknumatts);
+static void initStoreInfo(storeInfo *sinfo, FunctionCallInfo fcinfo);
+static void finishStoreInfo(storeInfo *sinfo);
+static int storeHandler(PGresult *res, void *param, PGrowValue *columns);
+
 
 /* Global */
 static remoteConn *pconn = NULL;
@@ -503,6 +519,7 @@ dblink_fetch(PG_FUNCTION_ARGS)
 	char	   *curname = NULL;
 	int			howmany = 0;
 	bool		fail = true;	/* default to backward compatible */
+	storeInfo   storeinfo;
 
 	DBLINK_INIT;
 
@@ -559,15 +576,36 @@ dblink_fetch(PG_FUNCTION_ARGS)
 	appendStringInfo(buf, FETCH %d FROM %s, howmany, curname);
 
 	/*
+	 * Result is stored into storeinfo.tuplestore instead of
+	 * res-result retuned by PQexec below
+	 */
+	initStoreInfo(storeinfo, fcinfo);
+	PQregisterRowProcessor(conn, storeHandler, storeinfo);
+
+	/*
 	 * Try to execute the query.  Note that since libpq uses malloc, the
 	 * PGresult will be long-lived even though we are still in a short-lived
 	 * memory context.
 	 */
 	res = PQexec(conn, buf.data);
+	finishStoreInfo(storeinfo);
+
 	if (!res ||
 		(PQresultStatus(res) != PGRES_COMMAND_OK 
 		 PQresultStatus(res) != PGRES_TUPLES_OK))
 	{
+		/* finishStoreInfo saves the fields referred to below. */
+		if (storeinfo.nummismatch)
+		{
+			/* This is only for backward compatibility */
+			ereport(ERROR,
+	(errcode(ERRCODE_DATATYPE_MISMATCH),
+	 errmsg(remote query result rowtype does not match 
+			the specified FROM clause rowtype)));
+		}
+		else if (storeinfo.edata)
+			ReThrowError(storeinfo.edata);
+
 		dblink_res_error(conname, res, could not fetch from cursor, fail);
 		return (Datum) 0;
 	}
@@ -579,8 +617,8 @@ dblink_fetch(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INVALID_CURSOR_NAME),
  errmsg(cursor \%s\ does not exist, curname)));
 	}
+	PQclear(res);
 
-	materializeResult(fcinfo, res);
 	return (Datum) 0;
 }
 
@@ -640,6 +678,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	remoteConn *rconn = NULL;
 	bool		fail = true;	/* default to backward compatible */
 	bool		freeconn = false;
+	storeInfo   storeinfo;
 
 	/* check to see if caller supports us returning a tuplestore */
 	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -715,164 +754,225 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	rsinfo-setResult = NULL;
 	rsinfo-setDesc = NULL;
 
+
+	/*
+	 * Result is stored into storeinfo.tuplestore instead of
+	 * res-result retuned by PQexec/PQgetResult below
+	 */
+	initStoreInfo(storeinfo, fcinfo);
+	PQregisterRowProcessor(conn, storeHandler, storeinfo);
+
 	/* synchronous query, or async result retrieval */
 	if (!is_async)
 		res = PQexec(conn, sql);
 	else
-	{
 		res = PQgetResult(conn);
-		/* NULL means we're all done with the async results */
-		if (!res)
-			return (Datum) 0;
-	}
 
-	/* if needed, close the connection to the database and cleanup */
-	if (freeconn)
-		PQfinish(conn);
+	finishStoreInfo(storeinfo);
 
-	if (!res ||
-		(PQresultStatus(res) != PGRES_COMMAND_OK 
-		 PQresultStatus(res) != PGRES_TUPLES_OK))
+	/* NULL res from async get means we're all done with the results */
+	if (res || !is_async)
 	{
-		dblink_res_error(conname, res, could not execute query, fail);
-		return (Datum) 0;
+		if (freeconn)
+			PQfinish(conn);
+
+		if (!res ||
+			(PQresultStatus(res) != PGRES_COMMAND_OK 
+			 PQresultStatus(res) != PGRES_TUPLES_OK))
+		{
+			/* finishStoreInfo saves the fields referred to below. */
+			if (storeinfo.nummismatch)
+			{
+/* This is only for backward compatibility */
+ereport(ERROR,
+		(errcode(ERRCODE_DATATYPE_MISMATCH),
+		 errmsg(remote query result rowtype does