date:20140703

Michael Paquier michael.paqu...@gmail.com writes:
 On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI
 horiguchi.kyot...@lab.ntt.co.jp wrote:
 Pehaps this is showing that no tidy or generalized way to tell
 what a page is used for. Many of the modules which have their
 own page format has a magic value and the values seem to be
 selected carefully. But no one-for-all method to retrieve that.

 You have a point here.

Yeah, it's a bit messy, but I believe it's currently always possible to
tell which access method a PG page belongs to.  Look at pg_filedump.
The last couple times we added index access methods, we took pains to
make sure pg_filedump could figure out what their pages were.  (IIRC,
it's a combination of the special-space size and contents, but I'm too
tired to go check the details right now.)

 For gin, I'll investigate if it is possible to add a identifier like
 GIN_PAGE_ID, it would make the page analysis more consistent with the
 others. I am not sure for what the 8 bytes allocated for the special
 area are used now for though.

There is exactly zero chance that anyone will accept an on-disk format
change just to make this prettier.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL replay bugs

2014-07-03 Thread Michael Paquier

On Thu, Jul 3, 2014 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Michael Paquier michael.paqu...@gmail.com writes:
 On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI
 horiguchi.kyot...@lab.ntt.co.jp wrote:
 Pehaps this is showing that no tidy or generalized way to tell
 what a page is used for. Many of the modules which have their
 own page format has a magic value and the values seem to be
 selected carefully. But no one-for-all method to retrieve that.

 You have a point here.

 Yeah, it's a bit messy, but I believe it's currently always possible to
 tell which access method a PG page belongs to.  Look at pg_filedump.
 The last couple times we added index access methods, we took pains to
 make sure pg_filedump could figure out what their pages were.  (IIRC,
 it's a combination of the special-space size and contents, but I'm too
 tired to go check the details right now.)
Yes, that's what the current code does btw, in this *messy* way.

 For gin, I'll investigate if it is possible to add a identifier like
 GIN_PAGE_ID, it would make the page analysis more consistent with the
 others. I am not sure for what the 8 bytes allocated for the special
 area are used now for though.

 There is exactly zero chance that anyone will accept an on-disk format
 change just to make this prettier.
Yeah thought so.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Autonomous Transaction (WIP)

2014-07-03 Thread Amit Kapila

On Thu, Jul 3, 2014 at 12:03 PM, Rajeev rastogi rajeev.rast...@huawei.com
wrote:
 On 01 July 2014 12:00, Amit Kapila Wrote:
 Simon has mentioned that exactly this idea has been rejected at

 PGCon 2 years back. Please refer that in below mail:

 
http://www.postgresql.org/message-id/bf2827dcce55594c8d7a8f7ffd3ab7713dde1...@szxeml508-mbx.china.huawei.com



 As far as I can see, you never came back with the different solution.



 Yeah right. So for this I tried to search archived mails to get the
details about the discussion but I could not find anything regarding design.
 So I am not sure how shall I make my solution different from earlier as
earlier solution is not accessible to me.

I haven't read your idea/patch in any detail, so can't comment
on whether it is good or bad.  However I think if one of the
Committers has already mentioned that the same idea has been
rejected previously, then it makes little sense to further review
or update the patch unless you know the reason of rejection and
handle it in an acceptable way.

Now as far as I can understand, the problem seems to be in a way
you have defined Autonomous Transaction Storage which can lead
to consumption of additional client slots, this is just what I could make
sense from above mail but I am not completely sure on this matter.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Fabien COELHO



Hello Gavin,


 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


I would suggest that probabilities should NEVER be expressed in percentages! 
As a percentage probability looks weird, and is never used for serious 
statistical work - in my experience at least.


I think probabilities should be expressed in the range 0 ... 1 - i.e. 0.35 
rather than 35%.


I could agree about the mathematics, but ISTM that 11.5% is more 
readable and intuitive than 0.115.


I could change probability and replace it with frequency or maybe 
occurence, what would you think about that?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Andreas Joseph Krogh

Hi.   I'm up for testing 9.4 but my JDBC-driver fails to connect due to PG's 
minor-version string: 4beta1. Is it possible to set this somewhere without 
recompiling PG?   Thanks.   -- Andreas Jospeh Krogh CTO / Partner - Visena AS 
Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com 
www.visena.com https://www.visena.com  https://www.visena.com

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

2014-07-03 Thread Rushabh Lathia

On Thu, Jul 3, 2014 at 6:56 AM, Ian Barwick i...@2ndquadrant.com wrote:

 On 01/07/14 21:00, Rushabh Lathia wrote:


 I spent some more time on the patch and here are my review comments.

 .) Patch gets applied through patch -p1 (git apply fails)

 .) trailing whitespace in the patch at various places


 Not sure where you see this, unless it's in the tests, where it's
 required.


Please ignore the whitespace comment. Don't know somehow that was due to
my local changes in the branch.



  .) Unnecessary new line + and - in the patch.
 (src/backend/rewrite/rewriteManip.c::getInsertSelectQuery())
 (src/include/rewrite/rewriteManip.h)


 Fixed.


  .) patch has proper test coverage and regression running cleanly.

 .) In map_primary_key_to_list() patch using INDEX_ATTR_BITMAP_IDENTITY_KEY
 bitmap to get the keycols. In IndexAttrBitmapKind there is also
 INDEX_ATTR_BITMAP_KEY, so was confuse between INDEX_ATTR_BITMAP_KEY and
 INDEX_ATTR_BITMAP_IDENTITY_KEY and also haven't found related comments in
 the code. Later with use of testcase and debugging found confirmed that
 INDEX_ATTR_BITMAP_IDENTITY_KEY only returns the Primary key.


 Revised patch version (see other mail) fixes this by introducing
 INDEX_ATTR_BITMAP_PRIMARY_KEY.


Looks good.



  .) At present in patch when RETURNING PRIMARY KEY is used on table having
 no
 primary key it throw an error. If I am not wrong JDBC will be using that
 into
 getGeneratedKeys(). Currently this feature is implemented in the JDBC
 driver by
 appending RETURNING * to the supplied statement. With this
 implementation
 it will replace RETURNING * with RETURNING PRIMARY KEY, right ? So
 just
 wondering what JDBC expect getGeneratedKeys() to return when query don't
 have primary key and user called executeUpdate() with
 Statement.RETURN_GENERATED_KEYS? I looked at JDBC specification but its
 not
 clear what it will return when table don't have keys. Can you please let
 us
 know your finding on this ?


 The spec [*] is indeed frustratingly vague:

 The method Statement.getGeneratedKeys, which can be called to retrieve
 the generated
 value, returns a ResultSet object with a column for each automatically
 generated value.
 The methods execute, executeUpdate or Connection.prepareStatement
 accept an optional
 parameter, which can be used to indicate that any auto generated
 values should be
 returned when the statement is executed or prepared.

 [*] http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-
 spec/jdbc4.1-fr-spec.pdf

 I understand this to mean that no rows will be returned if no
 auto-generated values
 are not present.

 As-is, the patch will raise an error if the target table does not have a
 primary key,
 which makes sense from the point of view of the proposed syntax, but which
 will
 make it impossible for the JDBC driver to implement the above
 understanding of the spec
 (i.e. return nothing if no primary key exists).

 The basic reason for this patch is to allow JDBC to use this syntax for
getGeneratedKeys().
Now because this patch throw an error when table don't have any primary
key, this patch
won't be useful for the getGeneratedKeys() implementation.


 It would be simple enough not to raise an error in this case, but that
 means the
 query would be effectively failing silently and I don't think that's
 desirable
 behaviour.


Yes, not to raise an error won't be desirable behavior.


 A better solution would be to have an optional IF EXISTS clause:

   RETURNING PRIMARY KEY [ IF EXISTS ]

 which would be easy enough to implement.


 Thoughts?


Hmm liked the idea about adding [ IF EXISTS ]. Looking at the grammar so
far we
having [ IF EXISTS ] option only for DDL commands (mainly DROP) not sure
whether
its ok to use such syntax for DML commands.

Others please share your thoughts/comments.



 Ian Barwick

 --
  Ian Barwick   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services




-- 
Rushabh Lathia

Re: [HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Abhijit Menon-Sen

At 2014-07-03 11:02:34 +0200, andr...@visena.com wrote:

 Is it possible to set this somewhere without 
 recompiling PG?

I'm afraid not.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Andreas Joseph Krogh

På torsdag 03. juli 2014 kl. 11:13:44, skrev Abhijit Menon-Sen 
a...@2ndquadrant.com mailto:a...@2ndquadrant.com: At 2014-07-03 11:02:34 
+0200, andr...@visena.com wrote:
 
  Is it possible to set this somewhere without
  recompiling PG?

 I'm afraid not.   Ok   -- Andreas Jospeh Krogh CTO / Partner - Visena AS 
Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com 
www.visena.com https://www.visena.com  https://www.visena.com

Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Gavin Flower


On 03/07/14 20:58, Fabien COELHO wrote:


Hello Gavin,


 decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
 probability of fist/last percent of the range: 11.3% 0.0%


I would suggest that probabilities should NEVER be expressed in 
percentages! As a percentage probability looks weird, and is never 
used for serious statistical work - in my experience at least.


I think probabilities should be expressed in the range 0 ... 1 - i.e. 
0.35 rather than 35%.


I could agree about the mathematics, but ISTM that 11.5% is more 
readable and intuitive than 0.115.


I could change probability and replace it with frequency or maybe 
occurence, what would you think about that?




You may well be hitting a situation, where you meet opposition whatever 
you do!  :-)


frequency implies a positive integer (though relative frequency 
might be okay) - and if you use occurrence, someone else is bound to 
complain...


Though, I'd opt for relative frequency, if you can't use values in the 
range 0 ... 1 for probabilities, if %'s are used - so long as it does 
not generate a flame war.


I suspect it may not be worth the grief to change.


Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD

2014-07-03 Thread Tatsuo Ishii

Hi,

 Hi,
 
 Attached you can find a short (compile tested only ) patch implementing
 a 'shared_memory_type' GUC, akin to 'dynamic_shared_memory_type'. Will
 only apply to 9.4, not 9.3, but it should be easy to convert for it.
 
 Greetings,
 
 Andres Freund

I have rebased Andres's patch against 9.3-STABLE tree. Please take a
look at attached patches and let me know if you find anything strange.

I am going to test the patch on a huge HP machine: DL980G7/64
cores/2TB mem.  With the patch I would be able to report back if using
SysV shared mem fixes the 9.3 performance problem.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index f746c81..e82054a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -72,6 +72,7 @@ static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
 static PGShmemHeader *PGSharedMemoryAttach(IpcMemoryKey key,
 	 IpcMemoryId *shmid);
+static void *CreateAnonymousSegment(Size *size);
 
 
 /*
@@ -389,49 +390,19 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
 	 * developer use, this shouldn't be a big problem.
 	 */
 #ifndef EXEC_BACKEND
+	if (shared_memory_type == SHMEM_TYPE_MMAP)
 	{
-		long		pagesize = sysconf(_SC_PAGE_SIZE);
-
-		/*
-		 * Ensure request size is a multiple of pagesize.
-		 *
-		 * pagesize will, for practical purposes, always be a power of two.
-		 * But just in case it isn't, we do it this way instead of using
-		 * TYPEALIGN().
-		 */
-		if (pagesize  0  size % pagesize != 0)
-			size += pagesize - (size % pagesize);
-
-		/*
-		 * We assume that no one will attempt to run PostgreSQL 9.3 or later
-		 * on systems that are ancient enough that anonymous shared memory is
-		 * not supported, such as pre-2.4 versions of Linux.  If that turns
-		 * out to be false, we might need to add a run-time test here and do
-		 * this only if the running kernel supports it.
-		 */
-		AnonymousShmem = mmap(NULL, size, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS,
-			  -1, 0);
-		if (AnonymousShmem == MAP_FAILED)
-		{
-			int		saved_errno = errno;
-
-			ereport(FATAL,
-	(errmsg(could not map anonymous shared memory: %m),
-	 (saved_errno == ENOMEM) ?
-errhint(This error usually means that PostgreSQL's request 
-	 for a shared memory segment exceeded available memory 
-	  or swap space. To reduce the request size (currently 
-	  %lu bytes), reduce PostgreSQL's shared memory usage, 
-		perhaps by reducing shared_buffers or 
-		max_connections.,
-		(unsigned long) size) : 0));
-		}
+		AnonymousShmem = CreateAnonymousSegment(size);
 		AnonymousShmemSize = size;
-
 		/* Now we need only allocate a minimal-sized SysV shmem block. */
 		sysvsize = sizeof(PGShmemHeader);
 	}
+	else
 #endif
+	{
+		Assert(shared_memory_type == SHMEM_TYPE_SYSV);
+		sysvsize = size;
+	}
 
 	/* Make sure PGSharedMemoryAttach doesn't fail without need */
 	UsedShmemSegAddr = NULL;
@@ -631,3 +602,47 @@ PGSharedMemoryAttach(IpcMemoryKey key, IpcMemoryId *shmid)
 
 	return hdr;
 }
+
+/*
+ * Creates an anonymous mmap()ed shared memory segment.
+ *
+ * Pass the requested size in *size.  This function will modify *size to the
+ * actual size of the allocation, if it ends up allocating a segment that is
+ * larger than requested.
+ */
+#ifndef EXEC_BACKEND
+static void *
+CreateAnonymousSegment(Size *size)
+{
+	Size		allocsize = *size;
+	void	   *ptr = MAP_FAILED;
+	int			mmap_errno = 0;
+
+	/*
+	 * use the original size, not the rounded up value, when falling back
+	 * to non-huge pages.
+	 */
+	allocsize = *size;
+	ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+			   PG_MMAP_FLAGS, -1, 0);
+	mmap_errno = errno;
+
+	if (ptr == MAP_FAILED)
+	{
+		errno = mmap_errno;
+		ereport(FATAL,
+(errmsg(could not map anonymous shared memory: %m),
+ (mmap_errno == ENOMEM) ?
+ errhint(This error usually means that PostgreSQL's request 
+	for a shared memory segment exceeded available memory, 
+	  swap space or huge pages. To reduce the request size 
+		 (currently %zu bytes), reduce PostgreSQL's shared 
+	   memory usage, perhaps by reducing shared_buffers or 
+		 max_connections.,
+		 *size) : 0));
+	}
+
+	*size = allocsize;
+	return ptr;
+}
+#endif
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 918ac51..51dccdc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -39,6 +39,8 @@
 #include storage/sinvaladt.h
 #include storage/spin.h
 
+/* GUCs */
+int			shared_memory_type = DEFAULT_SHARED_MEMORY_TYPE;
 
 shmem_startup_hook_type shmem_startup_hook = NULL;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2b6527f..2945a68 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7

Re: [HACKERS] WAL replay bugs

2014-07-03 Thread Kyotaro HORIGUCHI

Hello, This is the additional comments for other part.

I haven't see significant defect in the code so far.


= bufcapt.c: 

- buffer_capture_remember() or so.

 Pages for unlogged tables are avoided to be written taking
 advantage that the lsn for such pages stays 0/0. I'd like to see
 a comment mentioning for, say, buffer_capture_is_changed? or
 buffer_capture_forget or somewhere.

- buffer_capture_forget()

 However this error is likely not to occur, in the error message
 could not find image..., the buffer id seems to bring no
 information. LSN, or quadraplet of LSN, rnode, forknum and
 blockno seems to be more informative.

- buffer_capture_is_changed()

 The name for the function semes to be misleading. What this
 function does is comparing LSNs between stored page image and
 current page.  lsn_is_changed(BufferImage) or something like
 would be more clearly.

== bufmgr.c:

- ConditionalLockBuffer()

 Sorry for a trivial comment, but the variable 'res' conceales
 the meaning. acquired seems to be more preferable, isn't it?

- LockBuffer()

 There is a 'XXX' comment.  The discussion written there seems to
 be right, for me. If you mind that peeking into there is a bad
 behavior, adding a macro into lwlock.h would help:p

 lwlock.h: #define LWLockHoldedExclusive(l) ((l)-exclusive  0)
 lwlock.h: #define LWLockHoldedShared(l) ((l)-shared  0)

# I don't see this usable so much..
 
 bufmgr.c: if (LWLockHoldedExclusive(buf-content_lock))

 If there isn't any particular concern, 'XXX:' should be removed.

= bufpage.c:

-  #include storage/bufcapt.h

  The include seems to be needless.

= bufcapt.h:

- header comment

  The file name is misspelled as 'bufcaptr.h'.
  Copyright notice of UC is needed? (Other files are the same)

- The includes in this header except for buf.h seem not to be
  necessary.

= init_env.sh:

- pg_get_test_port()

  It determines server port using PG_VERSION_NUM, but is it
  necessary? Addition to it, the theoretical maximum initial
  PGPORT semes to be 65535 but this script search for free port
  up to the number larger by 16 from the start, which would be
  beyond the edge.

- pg_get_test_port()

  It stops with error after 16 times of failure, but the message
  complains only about the final attempt. If you want to mention
  the port numbers, it might should be 'port $PGPORTSTART-$PGPORT
  ..' or would be 'All 16 ports attempted failed' or something..



regards,


-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-03 Thread Abhijit Menon-Sen

FYI, I've attached a patch that does what you suggested. I haven't done
anything else (i.e. testing) with it yet.

-- Abhijit
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 5f9fc49..dc90c02 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -501,33 +501,9 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
 	 * here during crash recovery.
 	 */
 	if (HotStandbyActiveInReplay())
-	{
-		BlockNumber blkno;
-
-		for (blkno = xlrec-lastBlockVacuumed + 1; blkno  xlrec-block; blkno++)
-		{
-			/*
-			 * We use RBM_NORMAL_NO_LOG mode because it's not an error
-			 * condition to see all-zero pages.  The original btvacuumpage
-			 * scan would have skipped over all-zero pages, noting them in FSM
-			 * but not bothering to initialize them just yet; so we mustn't
-			 * throw an error here.  (We could skip acquiring the cleanup lock
-			 * if PageIsNew, but it's probably not worth the cycles to test.)
-			 *
-			 * XXX we don't actually need to read the block, we just need to
-			 * confirm it is unpinned. If we had a special call into the
-			 * buffer manager we could optimise this so that if the block is
-			 * not in shared_buffers we confirm it as unpinned.
-			 */
-			buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, blkno,
-			RBM_NORMAL_NO_LOG);
-			if (BufferIsValid(buffer))
-			{
-LockBufferForCleanup(buffer);
-UnlockReleaseBuffer(buffer);
-			}
-		}
-	}
+		XLogLockBlockRangeForCleanup(xlrec-node, MAIN_FORKNUM,
+	 xlrec-lastBlockVacuumed + 1,
+	 xlrec-block);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b7829ff..d84f83a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -287,10 +287,6 @@ XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
  *
  * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
  * relation is extended with all-zeroes pages up to the given block number.
- *
- * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
- * exist, and we don't check for all-zeroes.  Thus, no log entry is made
- * to imply that the page should be dropped or truncated later.
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
@@ -331,8 +327,6 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			log_invalid_page(rnode, forknum, blkno, false);
 			return InvalidBuffer;
 		}
-		if (mode == RBM_NORMAL_NO_LOG)
-			return InvalidBuffer;
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
@@ -375,6 +369,62 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogBlockRangeForCleanup is used in Hot Standby mode to emulate the
+ * locking effects imposed by VACUUM for nbtrees.
+ */
+void
+XLogLockBlockRangeForCleanup(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber startBlkno,
+			 BlockNumber uptoBlkno)
+{
+	BlockNumber blkno;
+	BlockNumber lastblock;
+	BlockNumber	endingBlkno;
+	Buffer		buffer;
+	SMgrRelation smgr;
+
+	Assert(startBlkno != P_NEW);
+	Assert(uptoBlkno != P_NEW);
+
+	/* Open the relation at smgr level */
+	smgr = smgropen(rnode, InvalidBackendId);
+
+	/*
+	 * Create the target file if it doesn't already exist.  This lets us cope
+	 * if the replay sequence contains writes to a relation that is later
+	 * deleted.  (The original coding of this routine would instead suppress
+	 * the writes, but that seems like it risks losing valuable data if the
+	 * filesystem loses an inode during a crash.  Better to write the data
+	 * until we are actually told to delete the file.)
+	 */
+	smgrcreate(smgr, forkNum, true);
+
+	lastblock = smgrnblocks(smgr, forkNum);
+
+	endingBlkno = uptoBlkno;
+	if (lastblock  endingBlkno)
+		endingBlkno = lastblock;
+
+	for (blkno = startBlkno; blkno  endingBlkno; blkno++)
+	{
+		/*
+		 * All we need to do here is prove that we can lock each block
+		 * with a cleanup lock. It's not an error to see all-zero pages
+		 * here because the original btvacuumpage would not have thrown
+		 * an error either.
+		 *
+		 * We don't actually need to read the block, we just need to
+		 * confirm it is unpinned.
+		 */
+		buffer = GetBufferWithoutRelcache(rnode, forkNum, blkno);
+		if (BufferIsValid(buffer))
+		{
+			LockBufferForCleanup(buffer);
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+}
 
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..d3c7eb7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -222,8 +222,6 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * current physical EOF; that is

Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-03 Thread Simon Riggs

On 3 July 2014 06:45, Amit Khandekar amit.khande...@enterprisedb.com wrote:

 In GetBufferWithoutRelcache(), I was wondering, rather than calling
 PinBuffer(), if we do this :
 LockBufHdr(buf);
 PinBuffer_Locked(buf);
 valid = ((buf-flags  BM_VALID) != 0);
 then we can avoid having the new buffer access strategy BAS_DISCARD that is
 introduced in this patch. And so the code changes in freelist.c would not be
 necessary.

That looks like a good idea, thanks.

I think we should say this though

LockBufHdr(buf);
valid = ((buf-flags  BM_VALID) != 0);
if (valid)
PinBuffer_Locked(buf);
else
UnlockBufHdr(buf);

since otherwise we would access the buffer flags without the spinlock
and we would leak a pin if the buffer was not valid

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

On Wed, Jul 2, 2014 at 6:05 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:

 Hello Mitsumasa-san,

 And I'm also interested in your decile percents output like under
 followings,
 decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%


 Sure, I'm really fine with that.


 I think that it is easier than before. Sum of decile percents is just
 100%.


 That's a good property:-)

 However, I don't prefer highest/lowest percentage because it will be
 confused with decile percentage for users, and anyone cannot understand this
 digits. I cannot understand 4.9%, 0.0% when I see the first time. Then, I
 checked the source code, I understood it:( It's not good design... #Why this
 parameter use 100?


 What else? People have ten fingers and like powers of 10, and are used to
 percents?


 So I'd like to remove it if you like. It will be more simple.


 I think that for the exponential distribution it helps, especially for high
 threshold, to have the lowest/highest percent density. For low thresholds,
 the decile is also definitely useful. So I'm fine with both outputs as you
 have put them.

 I have just updated the wording so that it may be clearer:

  decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
  probability of fist/last percent of the range: 11.3% 0.0%


 Attached patch is fixed version, please confirm it.


 Attached a v15 which just fixes a typo and the above wording update. I'm
 validating it for committers.


 #Of course, World Cup is being held now. I'm not hurry at all.


 I'm not a soccer kind of person, so it does not influence my
 availibility.:-)


 Suggested commit message:

 Add drawing random integers with a Gaussian or truncated exponentional
 distributions to pgbench.

 Test variants with these distributions are also provided and triggered
 with options --gaussian=... and --exponential=

IIRC we've not reached consensus about whether we should support
such options in pgbench. Several hackers disagreed to support them.
OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-07-03 Thread Andres Freund

On 2014-07-03 21:27:53 +0900, Fujii Masao wrote:
  Add drawing random integers with a Gaussian or truncated exponentional
  distributions to pgbench.
 
  Test variants with these distributions are also provided and triggered
  with options --gaussian=... and --exponential=
 
 IIRC we've not reached consensus about whether we should support
 such options in pgbench. Several hackers disagreed to support them.

Yea. I certainly disagree with the patch in it's current state because
it copies the same 15 lines several times with a two word
difference. Independent of whether we want those options, I don't think
that's going to fly.

 OTOH, we've almost reached the consensus that supporting gaussian
 and exponential options in \setrandom. So I think that you should
 separate those two features into two patches, and we should apply
 the \setrandom one first. Then we can discuss whether the other patch
 should be applied or not.

Sounds like a good plan.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should extension--unpackaged-$ver.sql's \echo use FROM unpackaged;?

On Thu, Jul 3, 2014 at 4:26 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 Fujii noticed that I'd used
 \echo Use CREATE EXTENSION pg_buffercache to load this file. \quit
 in an extension upgrade script. That's obviously wrong, because it
 should say ALTER EXTENSION. The reason it's that way is that I copied
 the line from the unpackaged--1.0.sql file.
 I noticed all unpackaged--$version transition files miss the FROM
 unpackaged in that echo.
 I guess that's just a oversight which we should correct?

Maybe for user-friendness. But I think that's not so important fix enough
to backpatch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section

On Thu, Jul 3, 2014 at 4:34 AM, Michael Banck michael.ba...@credativ.de wrote:
 Hi,

 While reading through the Explicit Locking section of the manual today,
 I felt the last paragraph of section 13.3.2. (Row-level Locks) might
 merit its own subsection.  It talks about page-level locks as distinct
 from table- and row-level locks.  Then again, it is just one paragraph,
 so maybe this was deliberate and/or rejected before (though I couldn't
 find prior discussion off-hand).  Proposed patch attached.

This seems to make sense. Barring objection, I will commit this only in HEAD.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-03 Thread Atri Sharma

On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra t...@fuzzy.cz wrote:

 On 30.6.2014 23:12, Tomas Vondra wrote:
  Hi,
 
  attached is v5 of the patch. The main change is that scaling the number
  of buckets is done only once, after the initial hash table is build. The
  main advantage of this is lower price. This also allowed some cleanup of
  unecessary code.
 
  However, this new patch causes warning like this:
 
  WARNING:  temporary file leak: File 231 still referenced
 
  I've been eyeballing the code for a while now, but I still fail to see
  where this comes from :-( Any ideas?

 Meh, the patches are wrong - I haven't realized the tight coupling
 between buckets/batches in ExecHashGetBucketAndBatch:

   *bucketno = hashvalue  (nbuckets - 1);
   *batchno = (hashvalue  hashtable-log2_nbuckets)  (nbatch - 1);

 The previous patches worked mostly by pure luck (the nbuckets was set
 only in the first batch), but once I moved the code at the end, it
 started failing. And by worked I mean didn't throw an error, but
 probably returned bogus results with (nbatch1).

 However, ISTM this nbuckets-nbatch coupling is not really necessary,
 it's only constructed this way to assign independent batch/bucket. So I
 went and changed the code like this:

   *bucketno = hashvalue  (nbuckets - 1);
   *batchno = (hashvalue  (32 - hashtable-log2_nbatch));

 I.e. using the top bits for batchno, low bits for bucketno (as before).

 Hopefully I got it right this time. At least it seems to be working for
 cases that failed before (no file leaks, proper rowcounts so far).


Hi,

I had a look at the patch and I was wondering if there is a way we can set
a cap on the resized size, and instead increase the number of batches
instead? Since this patch evaluates the growth of tuples vs increase of
space, we could look at increasing the number of batches itself instead of
number of buckets, if the resized number of buckets exceeds a threshold. It
shouldnt be too hard, AIUI it would involve a call in MultiExecHash right
after the new cost evaluation the patch adds.

It isnt a target of the current patch, but I think that it is a logical
extension to the same.

Thoughts/Comments?

Regards,

Atri
-- 
Regards,

Atri
*l'apprenant*

Re: [HACKERS] [v9.5] Custom Plan API

2014-07-03 Thread Shigeru Hanada

Kaigai-san,

Sorry for lagged response.

Here are my  random thoughts about the patch.  I couldn't understand
the patch fully, because some of APIs are not used by ctidscan.  If

Custom Scan patch v2 review

* Custom plan class comparison
In backend/optimizer/util/pathnode.c, custclass is compared by bit-and
with 's'.  Do you plan to use custclass as bit field?  If so, values
for custom plan class should not be a character.  Otherwise, custclass
should be compared by == operator.

* Purpose of GetSpecialCustomVar()
The reason why FinalizeCustomPlan callback is necessary is not clear to me.
Could you show a case that the API would be useful?

* Purpose of FinalizeCustomPlan()
The reason why FinalizeCustomPlan callback is necessary is not clear
to me, because ctidscan just calls finalize_primnode() and
bms_add_members() with given information.  Could you show a case that
the API would be useful?

* Is it ok to call set_cheapest() for all relkind?
Now set_cheapest() is called not for only relation and foreign table
but also custom plan, and other relations such as subquery, function,
and value.  Calling call_custom_scan_provider() and set_cheapest() in
the case of RTE_RELATION seems similar to the old construct, how do
you think about this?

* Is it hard to get rid of CopyCustomPlan()?
Copying ForeignScan node doesn't need per-FDW copy function by
limiting fdw_private to have only copy-able objects.  Can't we use the
same way for CustomPlan?  Letting authors call NodeSetTag or
copyObject() sounds uncomfortable to me.

This would be able to apply to TextOutCustomPlan() and TextOutCustomPath() too.

* MultiExec support is appropriate for the first version?
The cases need MultiExec seems little complex for the first version of
custom scan.  What kind of plan do you image for this feature?

* Does SupportBackwardScan() have enough information?
Other scans check target list with TargetListSupportsBackwardScan().
Isn't it necessary to check it for CustomPlan too in
ExecSupportsBackwardScan()?

* Place to call custom plan provider
Is it necessary to call provider when relkind != RELKIND_RELATION?  If
yes, isn't it necessary to call for append relation?

I know that we concentrate to replacing scan in the initial version,
so it would not be a serious problem, but it would be good to consider
extensible design.

* Custom Plan Provider is addpath?
Passing addpath handler as only one attribute of CUSTOM PLAN PROVIDER
seems little odd.
Using handler like FDW makes the design too complex and/or messy?

* superclass of CustomPlanState
CustomPlanState derives ScanState, instead of deriving PlanState
directly.  I worry the case of non-heap-scan custom plan, but it might
be ok to postpone consideration about that at the first cut.

* Naming and granularity of objects related to custom plan
I'm not sure the current naming is appropriate, especially difference
between custom plan and provider and handler.  In the context of
CREATE CUSTOM PLAN statement, what the term custom plan means?  My
impression is that custom plan is an alternative plan type, e.g.
ctidscan or pg_strom_scan.  Then what the term provider means?  My
impression for that is extension, such as ctidscan and pg_strom.  The
grammar allows users to pass function via PROVIDER clause of CREATE
CUSTOM SCAN, so the function would be the provider of the custom plan
created by the statement.

* enable_customscan
GUC parameter enable_customscan would be useful for users who want to
disable custom plan feature temporarily.  In the case of pg_strom,
using GPU for limited sessions for analytic or batch applications
seems handy.

* Adding pg_custom_plan catalog
Using cust as prefix for pg_custom_plan causes ambiguousness which
makes it difficult to choose catalog prefix for a feature named
Custom Foo in future.  How about using cusp (CUStom Plan)?

Or is it better to use pg_custom_plan_provider as catalog relation
name, as the document says that CREATE CUSTOM PLAN defines custom
plan provider.  Then prefix could be cpp (Custom Plan Provider).
This seems to match the wording used for pg_foreign_data_wrapper.

* CREATE CUSTOM PLAN statement
This is just a question:  We need to emit CREATE CUSTOM PLAN if we want to use
I wonder how it is extended when supporting join as custom class.

* New operators about TID comparison
IMO this portion should be a separated patch, because it adds OID
definition of existing operators such as tidgt and tidle.  Is there
any (explicit or implicit) rule about defining macro for oid of an
operator?

* Prototype of get_custom_plan_oid()
custname (or cppname if use the rule I proposed above) seems
appropriate as the parameter name of get_custom_plan_oid() because
similar funcitons use catalog column names in such case.

* Coding conventions
Some lines are indented with white space.  Future pgindent run will
fix this issue?

* Unnecessary struct forward declaration
Forward declarations of CustomPathMethods, Plan, and CustomPlan in

Re: [HACKERS] Proposing pg_hibernate

2014-07-03 Thread Kevin Grittner

Amit Kapila amit.kapil...@gmail.com wrote:

 Overall I agree that following Robert's idea will increase the
 time to make database server up and reach a state where apps can
 connect and start operations,

I agree that warming the cache before beginning to apply WAL would
be best.

 but I think atleast with such an approach we can claim that after
 warming buffers with pg_hibernator apps will always have
 performance greater than equal to the case when there is no
 pg_hibernator.

I would be careful of this claim in a NUMA environment.  The whole
reason I investigated NUMA issues and wrote the NUMA patch is that
a customer had terrible performance which turned out to be caused
by cache warming using a single connection (and thus a single
process, and thus a single NUMA memory node).  To ensure that this
doesn't trash performance on machines with many cores and memory
nodes, it would be best to try to spread the data around among
nodes.  One simple way of doing this would be to find some way to
use multiple processes in hopes that they would run on difference
CPU packages.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting PG-version without recompiling

Andreas Joseph Krogh andr...@visena.com writes:
 Hi. Â  I'm up for testing 9.4 but my JDBC-driver fails to connect due to PG's 
 minor-version string: 4beta1. Is it possible to set this somewhere without 
 recompiling PG?

No, and even if you could, that would be the wrong approach.  The right
approach is to fix the JDBC driver to not complain about such version
strings.  I'm a bit surprised they haven't done so long since, considering
how long PG beta versions have been tagged like that.  For that matter,
they really ought not complain about strings like 9.5devel or
9.5alpha2 either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

Rushabh Lathia rushabh.lat...@gmail.com writes:
 On Thu, Jul 3, 2014 at 6:56 AM, Ian Barwick i...@2ndquadrant.com wrote:
 A better solution would be to have an optional IF EXISTS clause:
 RETURNING PRIMARY KEY [ IF EXISTS ]

 Hmm liked the idea about adding [ IF EXISTS ]. Looking at the grammar so
 far we
 having [ IF EXISTS ] option only for DDL commands (mainly DROP) not sure
 whether
 its ok to use such syntax for DML commands.

 Others please share your thoughts/comments.

TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
in that the application would have little idea what it was getting back.
IF EXISTS would make it so spongy as to be useless, IMO.

It sounds to me like designing this for JDBC's getGeneratedKeys method
is a mistake.  There isn't going to be any way that the driver can support
that without having looked at the table's metadata for itself, and if
it's going to do that then it doesn't need a crutch that only partially
solves the problem anyhow.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section

2014-07-03 Thread Kevin Grittner

Fujii Masao masao.fu...@gmail.com wrote:

 This seems to make sense. Barring objection, I will commit this
 only in HEAD.

I'm inclined to think this is a slight improvement, just for the
sake of consistency with peer level information.  You probably
already noticed, but the patch as submitted neglects to close the
prior sect2 block before opening the new one.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting PG-version without recompiling

2014-07-03 Thread Andreas Joseph Krogh

På torsdag 03. juli 2014 kl. 16:16:01, skrev Tom Lane t...@sss.pgh.pa.us 
mailto:t...@sss.pgh.pa.us: Andreas Joseph Krogh andr...@visena.com writes:
  Hi. �� I'm up for testing 9.4 but my JDBC-driver fails to connect due to 
PG's
  minor-version string: 4beta1. Is it possible to set this somewhere without
  recompiling PG?

 No, and even if you could, that would be the wrong approach.  The right
 approach is to fix the JDBC driver to not complain about such version
 strings.  I'm a bit surprised they haven't done so long since, considering
 how long PG beta versions have been tagged like that.  For that matter,
 they really ought not complain about strings like 9.5devel or
 9.5alpha2 either.

 regards, tom lane   I'm using the pgjdbc-ng driver, I'm sure the official 
driver handles this. The driver will be fixed (the issue has a pull-request), I 
just wondered if there was a magic know I could use until the PR is merged.   --
Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 56 963 
andr...@visena.com mailto:andr...@visena.com www.visena.com 
https://www.visena.com  https://www.visena.com

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On 1.7.2014 01:24, Tomas Vondra wrote:
 On 30.6.2014 23:12, Tomas Vondra wrote:
 Hi,

 Hopefully I got it right this time. At least it seems to be working for
 cases that failed before (no file leaks, proper rowcounts so far).

Attached v7, fixing nbatch/ntuples in an assert.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es-format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong(Hash Buckets, hashtable-nbuckets, es);
+			ExplainPropertyLong(Original Hash Buckets,
+hashtable-nbuckets_original, es);
 			ExplainPropertyLong(Hash Batches, hashtable-nbatch, es);
 			ExplainPropertyLong(Original Hash Batches,
 hashtable-nbatch_original, es);
 			ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es);
 		}
-		else if (hashtable-nbatch_original != hashtable-nbatch)
+		else if ((hashtable-nbatch_original != hashtable-nbatch) || (hashtable-nbuckets_original != hashtable-nbuckets))
 		{
 			appendStringInfoSpaces(es-str, es-indent * 2);
 			appendStringInfo(es-str,
-			Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n,
-			 hashtable-nbuckets, hashtable-nbatch,
-			 hashtable-nbatch_original, spacePeakKb);
+			Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n,
+			 hashtable-nbuckets, hashtable-nbuckets_original,
+			 hashtable-nbatch, hashtable-nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..53642d1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include utils/lsyscache.h
 #include utils/syscache.h
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1. (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1. etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* 
  *		ExecHash
  *
@@ -116,6 +145,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable-skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +156,23 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hashjoin_bucket) {
+
+		/* consider only tuples in the non-skew buckets */
+		double nonSkewTuples = (hashtable-totalTuples - hashtable-skewTuples);
+
+		if ((nonSkewTuples / hashtable-nbatch)  (hashtable-nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf(Increasing nbucket to %d (average per bucket = %.1f)\n,
+   nbuckets,  (batchTuples / hashtable-nbuckets));
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+	}
+
 	/* must provide our own instrumentation support */
 	if (node-ps.instrument)
 		InstrStopNode(node-ps.instrument, hashtable-totalTuples);
@@ -239,6 +286,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +311,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1  log2_nbuckets));
 
+	/* nbatch must be a power of 2 */
+	log2_nbatch = my_log2(nbatch);
+	Assert(nbatch

Re: [HACKERS] tweaking NTUP_PER_BUCKET

On 3.7.2014 02:13, Tomas Vondra wrote:
 Hi,
 
 while hacking on the 'dynamic nbucket' patch, scheduled for the next CF
 (https://commitfest.postgresql.org/action/patch_view?id=1494) I was
 repeatedly stumbling over NTUP_PER_BUCKET. I'd like to propose a change
 in how we handle it.
 
 
 TL;DR; version
 --
 
 I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
 once the table is built and there's free space in work_mem. The patch
 mentioned above makes implementing this possible / rather simple.

Attached is v1 of this experimental patch. It's supposed to be applied
on top of v7 of this patch

   http://www.postgresql.org/message-id/53b59498.3000...@fuzzy.cz

as it shared most of the implementation. So apply it like this:

   patch -p1  hashjoin-nbuckets-growth-v7.patch
   patch -p1  hashjoin-dynamic-ntup-v1.patch

It implements the ideas outlined in the previous message, most importantly:

   (a) decreases NTUP_PER_BUCKET to 4

   (b) checks free work_mem when deciding whether to add batch

   (c) after building the batches, increases the number of buckets as
   much as possible, i.e. up to the number of batch tuples, but not
   exceeding work_mem

The improvements for the queries I tried so far are quite significant
(partially due to the NTUP_PER_BUCKET decrease, partially due to the
additional bucket count increase).

The rebuild is quite fast - the patch measures and reports this as a
WARNING, and the timings I've seen are ~12ms per 7MB (i.e. ~120ms for
70MB and ~1200ms for 700MB). Of course, this only makes sense when
compared to how much time it saved, but for the queries I tested so far
this was a good investment.

However it's likely there are queries where this may not be the case,
i.e. where rebuilding the hash table is not worth it. Let me know if you
can construct such query (I wasn't).

regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 53642d1..a4623dc 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -58,7 +58,7 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			4
 
 /* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
  * 
@@ -77,6 +77,8 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 #define NTUP_GROW_COEFFICIENT	1.333
 #define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
 
+#define SPACE_USED(hashtable, nbuckets) ((hashtable)-spaceUsed + (nbuckets) * sizeof(void*))
+
 /* 
  *		ExecHash
  *
@@ -156,21 +158,33 @@ MultiExecHash(HashState *node)
 		}
 	}
 
-	/* If average number of tuples per bucket is over the defined threshold,
-	 * increase the number of buckets to get below it. */
+/*
+ * Consider resizing the hash table (number of buckets) for better
+ * lookup performance. The code in ExecHashTableInsert guarantees
+ * we have enough memory to reach NTUP_PER_BUCKET, but maybe we can
+ * do better - getting lower number of tuples per bucket (up to
+ * NTUP_PER_BUCKET=1).
+ */
 	if (enable_hashjoin_bucket) {
 
-		/* consider only tuples in the non-skew buckets */
-		double nonSkewTuples = (hashtable-totalTuples - hashtable-skewTuples);
-
-		if ((nonSkewTuples / hashtable-nbatch)  (hashtable-nbuckets * NTUP_GROW_THRESHOLD)) {
+instr_time start_time, end_time;
 
 #ifdef HJDEBUG
-			printf(Increasing nbucket to %d (average per bucket = %.1f)\n,
-   nbuckets,  (batchTuples / hashtable-nbuckets));
+		printf(Increasing nbucket to %d (average per bucket = %.1f)\n,
+nbuckets,  (batchTuples / hashtable-nbuckets));
 #endif
-			ExecHashIncreaseNumBuckets(hashtable);
-		}
+
+elog(WARNING, hash resize (start) : nbuckets=%d, hashtable-nbuckets);
+
+INSTR_TIME_SET_CURRENT(start_time);
+
+ExecHashIncreaseNumBuckets(hashtable);
+
+INSTR_TIME_SET_CURRENT(end_time);
+INSTR_TIME_SUBTRACT(end_time, start_time);
+
+elog(WARNING, hash resize (end) : nbuckets=%d duration=%.3f, hashtable-nbuckets, INSTR_TIME_GET_MILLISEC(end_time));
+
 	}
 
 	/* must provide our own instrumentation support */
@@ -738,35 +752,34 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
 	int			oldnbuckets = hashtable-nbuckets;
 	HashJoinTuple  *oldbuckets = hashtable-buckets;
 	MemoryContext   oldcxt;
-	double		batchTuples = (hashtable-totalTuples / hashtable-nbatch);
+
+/* average number of tuples per batch */
+double  batchTuples = (hashtable-totalTuples - hashtable-skewTuples) / hashtable-nbatch;
+
+/* memory available for buckets */
+SizefreeMemory = (hashtable-spaceAllowed - hashtable-spaceUsed);
 
 	/*
-	 * Determine the proper number of buckets, i.e. stop once the average
-	 * per bucket gets below the threshold (1.33 * NTUP_PER_BUCKET).
-

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Stephen Frost

Tomas,

* Tomas Vondra (t...@fuzzy.cz) wrote:
 However it's likely there are queries where this may not be the case,
 i.e. where rebuilding the hash table is not worth it. Let me know if you
 can construct such query (I wasn't).

Thanks for working on this!  I've been thinking on this for a while and
this seems like it may be a good approach.  Have you considered a bloom
filter over the buckets..?  Also, I'd suggest you check the archives
from about this time last year for test cases that I was using which
showed cases where hashing the larger table was a better choice- those
same cases may also show regression here (or at least would be something
good to test).

Have you tried to work out what a 'worst case' regression for this
change would look like?  Also, how does the planning around this change?
Are we more likely now to hash the smaller table (I'd guess 'yes' just
based on the reduction in NTUP_PER_BUCKET, but did you make any changes
due to the rehashing cost?)?

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

On 3.7.2014 15:42, Atri Sharma wrote:
 
 
 
 On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 On 30.6.2014 23:12, Tomas Vondra wrote:
  Hi,
 
  attached is v5 of the patch. The main change is that scaling the
 number
  of buckets is done only once, after the initial hash table is
 build. The
  main advantage of this is lower price. This also allowed some
 cleanup of
  unecessary code.
 
  However, this new patch causes warning like this:
 
  WARNING:  temporary file leak: File 231 still referenced
 
  I've been eyeballing the code for a while now, but I still fail to see
  where this comes from :-( Any ideas?
 
 Meh, the patches are wrong - I haven't realized the tight coupling
 between buckets/batches in ExecHashGetBucketAndBatch:
 
   *bucketno = hashvalue  (nbuckets - 1);
   *batchno = (hashvalue  hashtable-log2_nbuckets)  (nbatch - 1);
 
 The previous patches worked mostly by pure luck (the nbuckets was set
 only in the first batch), but once I moved the code at the end, it
 started failing. And by worked I mean didn't throw an error, but
 probably returned bogus results with (nbatch1).
 
 However, ISTM this nbuckets-nbatch coupling is not really necessary,
 it's only constructed this way to assign independent batch/bucket. So I
 went and changed the code like this:
 
   *bucketno = hashvalue  (nbuckets - 1);
   *batchno = (hashvalue  (32 - hashtable-log2_nbatch));
 
 I.e. using the top bits for batchno, low bits for bucketno (as before).
 
 Hopefully I got it right this time. At least it seems to be working for
 cases that failed before (no file leaks, proper rowcounts so far).
 
 
 Hi,
 
 I had a look at the patch and I was wondering if there is a way we can
 set a cap on the resized size, and instead increase the number of
 batches instead? Since this patch evaluates the growth of tuples vs
 increase of space, we could look at increasing the number of batches
 itself instead of number of buckets, if the resized number of buckets
 exceeds a threshold. It shouldnt be too hard, AIUI it would involve a
 call in MultiExecHash right after the new cost evaluation the patch adds.

So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET
(i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the
number of batches?

Hmmm, that'd be easy to implement. I don't think it's possible to do
that in MultiExecHash (that's too late). But it's a matter of changing
this condition in ExecHashTableInsert:

if (hashtable-spaceUsed  hashtable-spaceAllowed)
ExecHashIncreaseNumBatches(hashtable);

to something like this

if ((hashtable-spaceUsed  hashtable-spaceAllowed) ||
(hashtable-totalTuples / hashtable-nbatch
  == NTUP_PER_BUCKET * hashtable-nbuckets))
ExecHashIncreaseNumBatches(hashtable);

But I don't think increasing number of batches is a good approach, as it
means more rescans. I think using memory is more efficient (after all,
that's what work_mem is for, right?).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Atri Sharma

On Thu, Jul 3, 2014 at 11:40 PM, Stephen Frost sfr...@snowman.net wrote:

 Tomas,

 * Tomas Vondra (t...@fuzzy.cz) wrote:
  However it's likely there are queries where this may not be the case,
  i.e. where rebuilding the hash table is not worth it. Let me know if you
  can construct such query (I wasn't).

 Thanks for working on this!  I've been thinking on this for a while and
 this seems like it may be a good approach.  Have you considered a bloom
 filter?


IIRC, last time when we tried doing bloom filters, I was short of some real
world useful hash functions that we could use for building the bloom filter.

If we are restarting experiments on this, I would be glad to assist.

Regards,

Atri

Re: [HACKERS] tweaking NTUP_PER_BUCKET

Hi Stephen,

On 3.7.2014 20:10, Stephen Frost wrote:
 Tomas,
 
 * Tomas Vondra (t...@fuzzy.cz) wrote:
 However it's likely there are queries where this may not be the case,
 i.e. where rebuilding the hash table is not worth it. Let me know if you
 can construct such query (I wasn't).
 
 Thanks for working on this! I've been thinking on this for a while
 and this seems like it may be a good approach. Have you considered a
 bloom filter over the buckets..? Also, I'd suggest you check the

I know you've experimented with it, but I haven't looked into that yet.

 archives from about this time last year for test cases that I was
 using which showed cases where hashing the larger table was a better
 choice- those same cases may also show regression here (or at least
 would be something good to test).

Good idea, I'll look at the test cases - thanks.

 Have you tried to work out what a 'worst case' regression for this 
 change would look like? Also, how does the planning around this
 change? Are we more likely now to hash the smaller table (I'd guess
 'yes' just based on the reduction in NTUP_PER_BUCKET, but did you
 make any changes due to the rehashing cost?)?

The case I was thinking about is underestimated cardinality of the inner
table and a small outer table. That'd lead to a large hash table and
very few lookups (thus making the rehash inefficient). I.e. something
like this:

  Hash Join
 Seq Scan on small_table (rows=100) (actual rows=100)
 Hash
Seq Scan on bad_estimate (rows=100) (actual rows=10)
Filter: ((a  100) AND (b  100))

But I wasn't able to reproduce this reasonably, because in practice
that'd lead to a nested loop or something like that (which is a planning
issue, impossible to fix in hashjoin code).

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Greg Stark

On Thu, Jul 3, 2014 at 11:40 AM, Atri Sharma atri.j...@gmail.com wrote:
 IIRC, last time when we tried doing bloom filters, I was short of some real
 world useful hash functions that we could use for building the bloom filter.

Last time was we wanted to use bloom filters in hash joins to filter
out tuples that won't match any of the future hash batches to reduce
the amount of tuples that need to be spilled to disk. However the
problem was that it was unclear for a given amount of memory usage how
to pick the right size bloom filter and how to model how much it would
save versus how much it would cost in reduced hash table size.

I think it just required some good empirical tests and hash join heavy
workloads to come up with some reasonable guesses. We don't need a
perfect model just some reasonable bloom filter size that we're pretty
sure will usually help more than it hurts.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

2014-07-03 Thread Greg Stark

On Thu, Jul 3, 2014 at 7:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
 in that the application would have little idea what it was getting back.
 IF EXISTS would make it so spongy as to be useless, IMO.

Seems easy enough to look at the count of columns in the result set.
But it seems like noise words -- if you don't put IF EXISTS then
surely you'll get the same behaviour anyways, no?

Fwiw when I wrote ORM-like layers I used to describe the output of the
query, including sometimes issuing WHERE 10 queries and looking at
the output column types when I needed that before executing the query.
Using table metadata would have required a much more in depth
understanding of how the database worked and also wouldn't handle
expressions, joins, set operations, etc.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Andrew Gierth

 Tom == Tom Lane t...@sss.pgh.pa.us writes:

 Tom If we're going to do that, I think it needs to be in 9.4.  Are
 Tom you going to work up a patch?

How's this?  (applies and passes regression on 9.4 and master)

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 09ff035..0388735 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2199,44 +2199,56 @@ AggGetAggref(FunctionCallInfo fcinfo)
 }
 
 /*
- * AggGetPerTupleEContext - fetch per-input-tuple ExprContext
+ * AggGetTempMemoryContext - fetch short-term memory context
  *
- * This is useful in agg final functions; the econtext returned is the
- * same per-tuple context that the transfn was called in (which can
- * safely get reset during the final function).
+ * This is useful in agg final functions; the context returned is one that the
+ * final function can safely reset as desired. This isn't useful for transition
+ * functions, since the context returned MAY (we don't promise) be the same as
+ * the context those are called in.
  *
  * As above, this is currently not useful for aggs called as window functions.
  */
-ExprContext *
-AggGetPerTupleEContext(FunctionCallInfo fcinfo)
+MemoryContext
+AggGetTempMemoryContext(FunctionCallInfo fcinfo)
 {
 	if (fcinfo-context  IsA(fcinfo-context, AggState))
 	{
 		AggState   *aggstate = (AggState *) fcinfo-context;
 
-		return aggstate-tmpcontext;
+		return aggstate-tmpcontext-ecxt_per_tuple_memory;
 	}
 	return NULL;
 }
 
 /*
- * AggGetPerAggEContext - fetch per-output-tuple ExprContext
+ * AggRegisterCallback - register a cleanup callback linked to aggcontext
  *
  * This is useful for aggs to register shutdown callbacks, which will ensure
- * that non-memory resources are freed.
+ * that non-memory resources are freed.  These callbacks will occur just before
+ * the associated aggcontext (as returned by AggCheckCallContext) is reset,
+ * either between groups or as a result of rescanning the query.  They will NOT
+ * be called on error paths.  The primary intent is to allow for freeing of
+ * tuplestores or tuplesorts maintained in aggcontext, or pins held by slots
+ * created by the agg functions.  (They will not be called until after the
+ * result of the finalfn is no longer needed, so it's safe for the finalfn to
+ * return data that will be freed by the callback.)
  *
  * As above, this is currently not useful for aggs called as window functions.
  */
-ExprContext *
-AggGetPerAggEContext(FunctionCallInfo fcinfo)
+void
+AggRegisterCallback(FunctionCallInfo fcinfo,
+	ExprContextCallbackFunction func,
+	Datum arg)
 {
 	if (fcinfo-context  IsA(fcinfo-context, AggState))
 	{
 		AggState   *aggstate = (AggState *) fcinfo-context;
 
-		return aggstate-ss.ps.ps_ExprContext;
+		RegisterExprContextCallback(aggstate-ss.ps.ps_ExprContext, func, arg);
+
+		return;
 	}
-	return NULL;
+	elog(ERROR,Aggregate function cannot register a callback in this context);
 }
 
 
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index efb0411..d116a63 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -49,8 +49,6 @@ typedef struct OSAPerQueryState
 	MemoryContext qcontext;
 	/* Memory context containing per-group data: */
 	MemoryContext gcontext;
-	/* Agg plan node's output econtext: */
-	ExprContext *peraggecontext;
 
 	/* These fields are used only when accumulating tuples: */
 
@@ -117,7 +115,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 		Aggref	   *aggref;
 		MemoryContext qcontext;
 		MemoryContext gcontext;
-		ExprContext *peraggecontext;
 		List	   *sortlist;
 		int			numSortCols;
 
@@ -133,10 +130,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 			elog(ERROR, ordered-set aggregate called in non-aggregate context);
 		if (!AGGKIND_IS_ORDERED_SET(aggref-aggkind))
 			elog(ERROR, ordered-set aggregate support function called for non-ordered-set aggregate);
-		/* Also get output exprcontext so we can register shutdown callback */
-		peraggecontext = AggGetPerAggEContext(fcinfo);
-		if (!peraggecontext)
-			elog(ERROR, ordered-set aggregate called in non-aggregate context);
 
 		/*
 		 * Prepare per-query structures in the fn_mcxt, which we assume is the
@@ -150,7 +143,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 		qstate-aggref = aggref;
 		qstate-qcontext = qcontext;
 		qstate-gcontext = gcontext;
-		qstate-peraggecontext = peraggecontext;
 
 		/* Extract the sort information */
 		sortlist = aggref-aggorder;
@@ -291,9 +283,9 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	osastate-number_of_rows = 0;
 
 	/* Now register a shutdown callback to clean things up */
-	RegisterExprContextCallback(qstate-peraggecontext,
-ordered_set_shutdown,
-PointerGetDatum(osastate));
+	AggRegisterCallback(fcinfo,
+		ordered_set_shutdown,
+

[HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Bruce Momjian

There have been periodic reports of pg_upgrade errors related to toast
tables.  The most recent one was from May of this year:


http://www.postgresql.org/message-id/flat/20140520202223.gb3...@momjian.us#20140520202223.gb3...@momjian.us

There error was:

Copying user relation files
   /var/lib/postgresql/8.4/main/base/4275487/4278965
Mismatch of relation OID in database FNBooking: old OID 4279499, new 
OID 19792
Failure, exiting

and the fix is to add a dummy TEXT column to the table on the old
cluster to force a toast table, then drop the dummy column.

I have had trouble getting a table schema that is causing problems, but
received a report via EDB support recently that had a simple schema
(anonymized):

CREATE TABLE pg_upgrade_toast_test (
x1 numeric(15,0),
x2 numeric(15,0),
x3 character varying(15),
x4 character varying(60),
x5 numeric(15,0),
x6 numeric(15,0),
x7 character varying(15),
x8 character varying(60),
x9 numeric(15,0),
x10 character varying(15),
x11 character varying(60),
x12 numeric(15,0),
x13 numeric(15,0),
x14 character varying(15),
x15 character varying(60),
x16 numeric(15,0),
x17 character varying(15),
x18 character varying(60),
x19 numeric(15,0),
x20 character varying(15),
x21 character varying(60)
);

needs_toast_table() computes the length of this table as 2024 bytes in
9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 
It turns out it is this commit that causes the difference:

commit 97f38001acc61449f7ac09c539ccc29e40fecd26
Author: Robert Haas rh...@postgresql.org
Date:   Wed Aug 4 17:33:09 2010 +

Fix numeric_maximum_size() calculation.

The old computation can sometimes underestimate the necessary space
by 2 bytes; however we're not back-patching this, because this 
result
isn't used for anything critical.  Per discussion with Tom Lane,
make the typmod test in this function match the ones in numeric()
and apply_typmod() exactly.

It seems the impact of this patch on pg_upgrade wasn't considered, or
even realized until now.

Suggestions on a fix?  

My initial idea is to to allow for toast tables in the new cluster that
aren't in the old cluster by skipping over the extra toast tables.  This
would only be for pre-9.1 old clusters.  It would not involve adding
toast tables to the old cluster as pg_upgrade never modifies the old
cluster.  We already handle cases where the old cluster had toast tables
and the new cluster wouldn't ordinarily have them.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Andrew Gierth

Here are two more, cumulative with the previous patch and each other:

The first removes the assumption in ordered_set_startup that the
aggcontext can be cached in fn_extra, and puts it in the transvalue
instead.

The second exposes the OSA* structures in a header file, so that
user-defined ordered-set aggs can use ordered_set_transition[_multi]
directly rather than having to cargo-cult it into their own code.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index d116a63..92fa9f3 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -47,8 +47,6 @@ typedef struct OSAPerQueryState
 	Aggref	   *aggref;
 	/* Memory context containing this struct and other per-query data: */
 	MemoryContext qcontext;
-	/* Memory context containing per-group data: */
-	MemoryContext gcontext;
 
 	/* These fields are used only when accumulating tuples: */
 
@@ -86,6 +84,8 @@ typedef struct OSAPerGroupState
 {
 	/* Link to the per-query state for this aggregate: */
 	OSAPerQueryState *qstate;
+	/* MemoryContext for per-group data */
+	MemoryContext gcontext;
 	/* Sort object we're accumulating data in: */
 	Tuplesortstate *sortstate;
 	/* Number of normal rows inserted into sortstate: */
@@ -104,6 +104,15 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	OSAPerGroupState *osastate;
 	OSAPerQueryState *qstate;
 	MemoryContext oldcontext;
+	MemoryContext gcontext;
+
+	/*
+	 * Check we're called as aggregate (and not a window function), and
+	 * get the Agg node's group-lifespan context (which might not be the
+	 * same throughout the query, but will be the same for each transval)
+	 */
+	if (AggCheckCallContext(fcinfo, gcontext) != AGG_CONTEXT_AGGREGATE)
+		elog(ERROR, ordered-set aggregate called in non-aggregate context);
 
 	/*
 	 * We keep a link to the per-query state in fn_extra; if it's not there,
@@ -114,16 +123,9 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	{
 		Aggref	   *aggref;
 		MemoryContext qcontext;
-		MemoryContext gcontext;
 		List	   *sortlist;
 		int			numSortCols;
 
-		/*
-		 * Check we're called as aggregate (and not a window function), and
-		 * get the Agg node's group-lifespan context
-		 */
-		if (AggCheckCallContext(fcinfo, gcontext) != AGG_CONTEXT_AGGREGATE)
-			elog(ERROR, ordered-set aggregate called in non-aggregate context);
 		/* Need the Aggref as well */
 		aggref = AggGetAggref(fcinfo);
 		if (!aggref)
@@ -142,7 +144,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 		qstate = (OSAPerQueryState *) palloc0(sizeof(OSAPerQueryState));
 		qstate-aggref = aggref;
 		qstate-qcontext = qcontext;
-		qstate-gcontext = gcontext;
 
 		/* Extract the sort information */
 		sortlist = aggref-aggorder;
@@ -259,10 +260,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	}
 
 	/* Now build the stuff we need in group-lifespan context */
-	oldcontext = MemoryContextSwitchTo(qstate-gcontext);
+	oldcontext = MemoryContextSwitchTo(gcontext);
 
 	osastate = (OSAPerGroupState *) palloc(sizeof(OSAPerGroupState));
 	osastate-qstate = qstate;
+	osastate-gcontext = gcontext;
 
 	/* Initialize tuplesort object */
 	if (use_tuples)
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 92fa9f3..bbad0e5 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -16,6 +16,8 @@
 
 #include math.h
 
+#include utils/orderedset.h
+
 #include catalog/pg_aggregate.h
 #include catalog/pg_operator.h
 #include catalog/pg_type.h
@@ -30,71 +32,8 @@
 #include utils/tuplesort.h
 
 
-/*
- * Generic support for ordered-set aggregates
- *
- * The state for an ordered-set aggregate is divided into a per-group struct
- * (which is the internal-type transition state datum returned to nodeAgg.c)
- * and a per-query struct, which contains data and sub-objects that we can
- * create just once per query because they will not change across groups.
- * The per-query struct and subsidiary data live in the executor's per-query
- * memory context, and go away implicitly at ExecutorEnd().
- */
-
-typedef struct OSAPerQueryState
-{
-	/* Aggref for this aggregate: */
-	Aggref	   *aggref;
-	/* Memory context containing this struct and other per-query data: */
-	MemoryContext qcontext;
-
-	/* These fields are used only when accumulating tuples: */
-
-	/* Tuple descriptor for tuples inserted into sortstate: */
-	TupleDesc	tupdesc;
-	/* Tuple slot we can use for inserting/extracting tuples: */
-	TupleTableSlot *tupslot;
-	/* Per-sort-column sorting information */
-	int			numSortCols;
-	AttrNumber *sortColIdx;
-	Oid		   *sortOperators;
-	Oid		   *eqOperators;
-	Oid		   *sortCollations;
-	bool	   *sortNullsFirsts;
-	/* Equality operator call info, created only if needed: */
-	FmgrInfo   *equalfns;
-
-	/* These fields are used only when accumulating datums: */
-
-	/* Info about datatype

Re: [HACKERS] Array of composite types returned from python

I wrote:
 Ronan Dunklau ronan.dunk...@dalibo.com writes:
 I don't know how to do that without implementing the cache itself.

 I don't either, but my thought was that we could hack up a simple
 one-element cache pretty trivially, eg static info and desc variables
 in PLyObject_ToComposite that are initialized the first time through.
 You could only test one composite-array type per session with that
 sort of kluge, but that would be good enough for doing some simple
 performance testing.

I did that, and found that building and returning a million-element
composite array took about 4.2 seconds without any optimization, and 3.2
seconds with the hacked-up cache (as of HEAD, asserts off).  I'd say that
means we might want to do something about it eventually, but it's hardly
the first order of business.

I've committed the patch with a bit of additional cleanup.  I credited
Ronan and Ed equally as authors, since I'd say the fix for plpy.prepare
was at least as complex as the original patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

Greg Stark st...@mit.edu writes:
 On Thu, Jul 3, 2014 at 7:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
 in that the application would have little idea what it was getting back.
 IF EXISTS would make it so spongy as to be useless, IMO.

 Seems easy enough to look at the count of columns in the result set.

Sure, if you know how many columns are in the table to begin with; but
that goes back to having table metadata.  If you don't, you're probably
going to be issuing RETURNING *, and AFAICS RETURNING *, PRIMARY KEY
would be entirely useless (even without IF EXISTS :-().

I'm still unconvinced that there's a robust use-case here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pg_upgrade and toast tables bug discovered

Bruce Momjian br...@momjian.us writes:
 I have had trouble getting a table schema that is causing problems, but
 received a report via EDB support recently that had a simple schema
 (anonymized):
 ...
 needs_toast_table() computes the length of this table as 2024 bytes in
 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 

 My initial idea is to to allow for toast tables in the new cluster that
 aren't in the old cluster by skipping over the extra toast tables.  This
 would only be for pre-9.1 old clusters.

TBH, it has never been more than the shakiest of assumptions that the new
version could not create toast tables where the old one hadn't.  I think
you should just allow for that case, independently of specific PG
versions.  Fortunately it seems easy enough, since you certainly don't
need to put any data into the new toast tables.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tweaking NTUP_PER_BUCKET

On 3.7.2014 20:50, Tomas Vondra wrote:
 Hi Stephen,
 
 On 3.7.2014 20:10, Stephen Frost wrote:
 Tomas,

 * Tomas Vondra (t...@fuzzy.cz) wrote:
 However it's likely there are queries where this may not be the case,
 i.e. where rebuilding the hash table is not worth it. Let me know if you
 can construct such query (I wasn't).

 Thanks for working on this! I've been thinking on this for a while
 and this seems like it may be a good approach. Have you considered a
 bloom filter over the buckets..? Also, I'd suggest you check the
 archives from about this time last year for test cases that I was
 using which showed cases where hashing the larger table was a better
 choice- those same cases may also show regression here (or at least
 would be something good to test).
 
 Good idea, I'll look at the test cases - thanks.

I can't find the thread / test cases in the archives. I've found this
thread in hackers:

http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com

Can you point me to the right one, please?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Andres Freund

On 2014-07-03 17:09:41 -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  I have had trouble getting a table schema that is causing problems, but
  received a report via EDB support recently that had a simple schema
  (anonymized):
  ...
  needs_toast_table() computes the length of this table as 2024 bytes in
  9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 
 
  My initial idea is to to allow for toast tables in the new cluster that
  aren't in the old cluster by skipping over the extra toast tables.  This
  would only be for pre-9.1 old clusters.
 
 TBH, it has never been more than the shakiest of assumptions that the new
 version could not create toast tables where the old one hadn't.  I think
 you should just allow for that case, independently of specific PG
 versions.  Fortunately it seems easy enough, since you certainly don't
 need to put any data into the new toast tables.

I don't think it's just that simple unfortunately. If pg_class entries
get created that didn't exist on the old server there's a chance for oid
conflicts. Consider

SELECT binary_upgrade.set_next_heap_pg_class_oid('17094'::pg_catalog.oid);
CREATE TABLE table_without_toast_in_old_server(...);
-- heap oid 17094, toast oid 16384

SELECT binary_upgrade.set_next_heap_pg_class_oid('16384'::pg_catalog.oid);
CREATE TABLE another_table(...);
ERROR: could not create file ...: File exists

I think we even had reports of such a problem.

The most robust, but not trivial, approach seems to be to prevent toast
table creation if there wasn't a set_next_toast_pg_class_oid(). Then,
after all relations are created, iterate over all pg_class entries that
possibly need toast tables and recheck if they now might need one.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate function API versus grouping sets

Andrew Gierth and...@tao11.riddles.org.uk writes:
 Tom == Tom Lane t...@sss.pgh.pa.us writes:
  Tom If we're going to do that, I think it needs to be in 9.4.  Are
  Tom you going to work up a patch?

 How's this?  (applies and passes regression on 9.4 and master)

Pushed with minor cosmetic adjustments.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-03 Thread Andrew Gierth

 Tom == Tom Lane t...@sss.pgh.pa.us writes:

  How's this?  (applies and passes regression on 9.4 and master)

 Tom Pushed with minor cosmetic adjustments.

Thanks!

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate function API versus grouping sets

Andrew Gierth and...@tao11.riddles.org.uk writes:
 Here are two more, cumulative with the previous patch and each other:
 The first removes the assumption in ordered_set_startup that the
 aggcontext can be cached in fn_extra, and puts it in the transvalue
 instead.

OK, done.  (ATM it seems like we wouldn't need gcontext in the transvalue
either, but I left it there in case we want it later.)

 The second exposes the OSA* structures in a header file, so that
 user-defined ordered-set aggs can use ordered_set_transition[_multi]
 directly rather than having to cargo-cult it into their own code.

I'm not very happy about this one; we intentionally did not expose
these structs, so that we could freely make fixes like, for example,
your immediately preceding patch.

I think it'd be worth considering whether there's a way to allow
third-party ordered-set aggs to use the infrastructure in orderedsetaggs.c
while still treating these structs as opaque.  In any case we'd need a
more carefully thought-through API than just decreeing that some private
structs are now public.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

2014-07-03 Thread Tom Dunstan

On 4 July 2014 00:07, Tom Lane t...@sss.pgh.pa.us wrote:

 TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy
 in that the application would have little idea what it was getting back.
 IF EXISTS would make it so spongy as to be useless, IMO.


IF EXISTS is pretty pointless - while the behaviour of getGeneratedKeys()
isn't defined for cases where there aren't any, it's only meaningful if the
caller has previously asked for the keys to be returned, and someone asking
to do that where it doesn't make sense can get an error as far as I'm
concerned. No-one does this in practice.

Looking at the feature as a more general SQL construct, ISTM that if
someone requests RETURNING PRIMARY KEY where there isn't one, an error is
appropriate. And for the IF EXISTS case, when on earth will someone request
a primary key even if they're not sure one exists?


 It sounds to me like designing this for JDBC's getGeneratedKeys method
 is a mistake.  There isn't going to be any way that the driver can support
 that without having looked at the table's metadata for itself, and if
 it's going to do that then it doesn't need a crutch that only partially
 solves the problem anyhow.


Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
from whatever was returned. It's CURRENTLY doing that, but it's appending
RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
out which columns the caller is interested in.

Turns out that's actually ok - most Java-based ORMs have more than enough
metadata about the tables they manage to know which columns are the primary
key. It's the driver that doesn't have that information, which is where
RETURNING PRIMARY KEY can help by not forcing the driver to request that
every single column is returned.

The only downside that I see is cases where someone requests the keys to be
returned but already has a RETURNING clause in their insert statement -
what if the requested columns don't include the primary key? IMO it's
probably enough to document that if the caller wants to issue a RETURNING
clause then they better make sure it contains the columns they're
interested in. This is already way outside anything that standard ORMs will
do (they don't know about RETURNING) so anyone doing this is hand-crafting
anyway.

Cheers

Tom

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

Tom Dunstan pg...@tomd.cc writes:
 On 4 July 2014 00:07, Tom Lane t...@sss.pgh.pa.us wrote:
 It sounds to me like designing this for JDBC's getGeneratedKeys method
 is a mistake.  There isn't going to be any way that the driver can support
 that without having looked at the table's metadata for itself, and if
 it's going to do that then it doesn't need a crutch that only partially
 solves the problem anyhow.

 Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
 from whatever was returned. It's CURRENTLY doing that, but it's appending
 RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
 out which columns the caller is interested in.

I might be mistaken, but as I read the spec for getGeneratedKeys, whether
or not a column is in a primary key has got nothing to do with whether
that column should be returned.  It looks to me like what you're supposed
to return is columns with computed default values, eg, serial columns.
(Whether we should define it as being *exactly* columns created by the
SERIAL macro is an interesting question, but for sure those ought to be
included.)  Now, the pkey might be a serial column ... but it equally
well might not be; it might not have any default expression at all.
And there certainly could be serial columns that weren't in the pkey.

The fact that the spec is kinda fuzzy doesn't entitle us to ignore the
plain meaning of the term generated key.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

2014-07-03 Thread Tom Dunstan

On 4 July 2014 11:37, Tom Lane t...@sss.pgh.pa.us wrote:

Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
from whatever was returned. It's CURRENTLY doing that, but it's appending
RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
out which columns the caller is interested in.

I might be mistaken, but as I read the spec for getGeneratedKeys, whether
or not a column is in a primary key has got nothing to do with whether
that column should be returned. It looks to me like what you're supposed
to return is columns with computed default values, eg, serial columns.
(Whether we should define it as being *exactly* columns created by the
SERIAL macro is an interesting question, but for sure those ought to be
included.) Now, the pkey might be a serial column ... but it equally
well might not be; it might not have any default expression at all.
And there certainly could be serial columns that weren't in the pkey.

The fact that the spec is kinda fuzzy doesn't entitle us to ignore the
plain meaning of the term generated key.

Well, strictly speaking no, but in reality it doesn't look like other
databases that support this feature support anything other than returning
auto-increment fields on primary keys, often only supporting a single
column (see [1]). Oracle doesn't support even that - callers have to call
the variant that specifies a column list, since historically there was no
real support for recognising such columns, oracle being sequence based.

As such, there can't be any callers in the wild that will expect this to
return anything other than the primary key, because there's no support for
doing anything else. And if the caller really DOES want other columns
returned, they can always call the variant that allows them to specify
those columns, or just issue a normal query with a returning clause. This
feature really exists to support the default case where those columns
aren't specified by the caller, and given current driver support (and
sanity) the only thing any caller will expect from such a result is the
primary key.

Cheers

Tom

[1]
http://www.postgresql.org/message-id/cappfruwqy0z66trv4xmdqnyv0prjky+38x+p4vkhmrrw5rb...@mail.gmail.com

Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section

On Fri, Jul 4, 2014 at 12:51 AM, Kevin Grittner kgri...@ymail.com wrote:
 Fujii Masao masao.fu...@gmail.com wrote:

 This seems to make sense. Barring objection, I will commit this
 only in HEAD.

Committed.

 I'm inclined to think this is a slight improvement, just for the
 sake of consistency with peer level information.  You probably
 already noticed, but the patch as submitted neglects to close the
 prior sect2 block before opening the new one.

Yes, thanks for pointing out that!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-03 Thread gotoschool6g

slow query(8531 ms):
SELECT ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
40.12211338311868)')) FROM road order by id LIMIT 1;

explain output:
Limit  (cost=4653.48..4653.48 rows=1 width=3612)
  -  Sort  (cost=4653.48..4683.06 rows=11832 width=3612)
Sort Key: id
-  Seq Scan on road  (cost=0.00..4594.32 rows=11832 width=3612)

fast query(16ms):
select ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
40.12211338311868)')) from (SELECT shape FROM road order by id  LIMIT 1) a

explain output:
Subquery Scan on a  (cost=1695.48..1695.74 rows=1 width=3608)
  -  Limit  (cost=1695.48..1695.48 rows=1 width=3612)
-  Sort  (cost=1695.48..1725.06 rows=11832 width=3612)
  Sort Key: road.id
  -  Seq Scan on road  (cost=0.00..1636.32 rows=11832 width=3612)

CREATE TABLE road
(
  shape geometry,
  id integer
)
WITH (
  OIDS=FALSE
);

There are redundant call when sorting?


 On Tue, Jul 1, 2014 at 2:16 PM, Martijn van Oosterhout 
 klep...@svana.org wrote: 
  On Sun, Jun 29, 2014 at 10:05:50PM +0800, gotoschool6g wrote: 
  The simplified scene: 
  select slowfunction(s) from a order by b limit 1; 
  is slow than 
  select slowfunction(s) from (select s from a order by b limit 1) as z; 
  if there are many records in table 'a'. 
  
  
  The real scene. Function  ST_Distance_Sphere is slow, the query: 
  SELECT ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from road 
  order by c limit 1; 
  is slow than: 
  select ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from (SELECT s 
  from road order by c limit 1) as a; 
  There are about 7000 records in 'road'. 
  
  I think to help here I think we need the EXPLAIN ANALYSE output for 
  both queries. 
  
 Well, I think the problem is a well understood one: there is no 
 guarantee that functions-in-select-list are called exactly once per 
 output row.  This is documented -- for example see here: 
 http://www.postgresql.org/docs/9.1/static/explicit-locking.html#ADVISORY-LOCKS.
  
 In short, if you want very precise control of function evaluation use 
 a subquery, or, if you're really paranoid, a CTE. 

I'm probably dense, but I'm not sure I understand. Or it is that the 
slowfunction() is called prior to the sort? That seems insane. 

Have a nice day, 
--  
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/ 
 He who writes carelessly confesses thereby at the very outset that he does 
 not attach much importance to his own thoughts. 
   -- Arthur Schopenhauer 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_receivexlog add synchronous mode

On Tue, Jul 1, 2014 at 10:11 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jun 30, 2014 at 7:09 PM,  furu...@pm.nttdata.co.jp wrote:
 Thanks for the review!

 +if (secs = 0)
 +secs = 1;/* Always sleep at least 1 sec */
 +
 +sleeptime = secs * 1000 + usecs / 1000;

 The above is the code which caused that problem. 'usecs' should have been
 reset to zero when 'secs' are rounded up to 1 second. But not. Attached
 is the updated version of the patch.
 Thank you for the refactoring v2 patch.
 I did a review of the patch.

 1. applied cleanly and compilation was without warnings and errors
 2. all regress tests was passed ok
 3. sleeptime is ok when the --status-intarvall is set to 1

 Thanks for reviewing the patch!

 I think that this refactoring patch is useful for improving source code
 readability and making the future patches simpler, whether we adopt
 your patch or not. So, barring any objections, I'm thinking to commit
 this refactoring patch.

Committed!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Bruce Momjian

On Thu, Jul  3, 2014 at 05:09:41PM -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  I have had trouble getting a table schema that is causing problems, but
  received a report via EDB support recently that had a simple schema
  (anonymized):
  ...
  needs_toast_table() computes the length of this table as 2024 bytes in
  9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. 
 
  My initial idea is to to allow for toast tables in the new cluster that
  aren't in the old cluster by skipping over the extra toast tables.  This
  would only be for pre-9.1 old clusters.
 
 TBH, it has never been more than the shakiest of assumptions that the new
 version could not create toast tables where the old one hadn't.  I think
 you should just allow for that case, independently of specific PG
 versions.  Fortunately it seems easy enough, since you certainly don't
 need to put any data into the new toast tables.

OK, patch attached.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
new file mode 100644
index 6205c74..17e4d5b
*** a/contrib/pg_upgrade/info.c
--- b/contrib/pg_upgrade/info.c
*** gen_db_file_maps(DbInfo *old_db, DbInfo
*** 38,58 
   int *nmaps, const char *old_pgdata, const char *new_pgdata)
  {
  	FileNameMap *maps;
! 	int			relnum;
  	int			num_maps = 0;
  
  	maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) *
  	 old_db-rel_arr.nrels);
  
! 	for (relnum = 0; relnum  Min(old_db-rel_arr.nrels, new_db-rel_arr.nrels);
! 		 relnum++)
  	{
! 		RelInfo*old_rel = old_db-rel_arr.rels[relnum];
! 		RelInfo*new_rel = new_db-rel_arr.rels[relnum];
  
  		if (old_rel-reloid != new_rel-reloid)
! 			pg_fatal(Mismatch of relation OID in database \%s\: old OID %d, new OID %d\n,
! 	 old_db-db_name, old_rel-reloid, new_rel-reloid);
  
  		/*
  		 * TOAST table names initially match the heap pg_class oid. In
--- 38,85 
   int *nmaps, const char *old_pgdata, const char *new_pgdata)
  {
  	FileNameMap *maps;
! 	int			old_relnum, new_relnum;
  	int			num_maps = 0;
  
  	maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) *
  	 old_db-rel_arr.nrels);
  
! 	/*
! 	 * The old database shouldn't have more relations than the new one.
! 	 * We force the new cluster to have a TOAST table if the old table
! 	 * had one.
! 	 */
! 	if (old_db-rel_arr.nrels  new_db-rel_arr.nrels)
! 		pg_fatal(old and new databases \%s\ have a mismatched number of relations\n,
!  old_db-db_name);
! 
! 	/* Drive the loop using new_relnum, which might be higher. */
! 	for (old_relnum = new_relnum = 0; new_relnum  new_db-rel_arr.nrels;
! 		 new_relnum++)
  	{
! 		RelInfo*old_rel = old_db-rel_arr.rels[old_relnum];
! 		RelInfo*new_rel = new_db-rel_arr.rels[new_relnum];
! 
  
  		if (old_rel-reloid != new_rel-reloid)
! 		{
! 			if (strcmp(new_rel-nspname, pg_toast) == 0)
! /*
!  * It is possible that the new cluster has a TOAST table
!  * for a table that didn't need one in the old cluster,
!  * e.g. 9.0 to 9.1 changed the NUMERIC length computation.
!  * Therefore, if we have a TOAST table in the new cluster
!  * that doesn't match, skip over it and continue processing.
!  * It is possible this TOAST table used an OID that was
!  * reserved in the old cluster, but we have no way of
!  * testing that, and we would have already gotten an error
!  * at the new cluster schema creation stage.
!  */
! continue;
! 			else
! pg_fatal(Mismatch of relation OID in database \%s\: old OID %d, new OID %d\n,
! 		 old_db-db_name, old_rel-reloid, new_rel-reloid);
! 		}
  
  		/*
  		 * TOAST table names initially match the heap pg_class oid. In
*** gen_db_file_maps(DbInfo *old_db, DbInfo
*** 76,89 
  		create_rel_filename_map(old_pgdata, new_pgdata, old_db, new_db,
  old_rel, new_rel, maps + num_maps);
  		num_maps++;
  	}
  
! 	/*
! 	 * Do this check after the loop so hopefully we will produce a clearer
! 	 * error above
! 	 */
! 	if (old_db-rel_arr.nrels != new_db-rel_arr.nrels)
! 		pg_fatal(old and new databases \%s\ have a different number of relations\n,
   old_db-db_name);
  
  	*nmaps = num_maps;
--- 103,114 
  		create_rel_filename_map(old_pgdata, new_pgdata, old_db, new_db,
  old_rel, new_rel, maps + num_maps);
  		num_maps++;
+ 		old_relnum++;
  	}
  
! 	/* Did we fail to exhaust the old array? */
! 	if (old_relnum != old_db-rel_arr.nrels)
! 		pg_fatal(old and new databases \%s\ have a mismatched number of relations\n,
   old_db-db_name);
  
  	*nmaps = num_maps;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-03 Thread Bruce Momjian

On Thu, Jul  3, 2014 at 11:55:40PM +0200, Andres Freund wrote:
 I don't think it's just that simple unfortunately. If pg_class entries
 get created that didn't exist on the old server there's a chance for oid
 conflicts. Consider
 
 SELECT binary_upgrade.set_next_heap_pg_class_oid('17094'::pg_catalog.oid);
 CREATE TABLE table_without_toast_in_old_server(...);
 -- heap oid 17094, toast oid 16384
 
 SELECT binary_upgrade.set_next_heap_pg_class_oid('16384'::pg_catalog.oid);
 CREATE TABLE another_table(...);
 ERROR: could not create file ...: File exists
 
 I think we even had reports of such a problem.

I had not considered this.

I don't remember ever seeing such a report.  We have had oid mismatch
reports, but we now know the cause of those.

 The most robust, but not trivial, approach seems to be to prevent toast
 table creation if there wasn't a set_next_toast_pg_class_oid(). Then,
 after all relations are created, iterate over all pg_class entries that
 possibly need toast tables and recheck if they now might need one.

Wow, that is going to be kind of odd in that there really isn't a good
way to create toast tables except perhaps add a dummy TEXT column and
remove it.  There also isn't an easy way to not create a toast table,
but also find out that one was needed.  I suppose we would have to
insert some dummy value in the toast pg_class column and come back later
to clean it up.

I am wondering what the probability of having a table that didn't need a
toast table in the old cluster, and needed one in the new cluster, and
there being an oid collision.

I think the big question is whether we want to go down that route.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-07-03 Thread Kouhei Kaigai

Hanada-san,

Thanks for your dedicated reviewing.

It's a very long message. So, let me summarize the things
I shall do in the next patch.

* fix bug: custom-plan class comparison
* fix up naming convention and syntax
  CREATE CUSTOM PLAN PROVIDER, rather than
  CREATE CUSTOM PLAN. Prefix shall be cpp_.
* fix up: definition of get_custom_plan_oid()
* fix up: unexpected white spaces, to be tabs.
* fix up: remove unnecessary forward declarations.
* fix up: revert replace_nestloop_params() to static
* make SetCustomPlanRef an optional callback
* fix up: typos in various points
* add documentation to explain custom-plan interface.

Also, I want committer's opinion about the issues below
* whether set_cheapest() is called for all relkind?
* how argument of add_path handler shall be derivered?

Individual comments are put below:

 Kaigai-san,
 
 Sorry for lagged response.
 
 Here are my  random thoughts about the patch.  I couldn't understand the
 patch fully, because some of APIs are not used by ctidscan.  If
 
 Custom Scan patch v2 review
 
 * Custom plan class comparison
 In backend/optimizer/util/pathnode.c, custclass is compared by bit-and
 with 's'.  Do you plan to use custclass as bit field?  If so, values for
 custom plan class should not be a character.  Otherwise, custclass should
 be compared by == operator.
 
Sorry, it is a bug that come from the previous design.
I had an idea that allows a custom plan provider to support multiple kind
of exec nodes, however, I concluded it does not make sense so much. (we
can define multiple CPP for each)

 * Purpose of GetSpecialCustomVar()
 The reason why FinalizeCustomPlan callback is necessary is not clear to
 me.
 Could you show a case that the API would be useful?
 
It is needed feature to replace a built-in join by custom scan, however,
it might be unclear on the scan workloads.

Let me explain why join replacement needed. A join node has two input
slot (inner and outer), its expression node including Var node reference
either of slot according to its varno (INNER_VAR or OUTER_VAR).
In case when a CPP replaced a join, it has to generate an equivalent result
but it may not be a best choice to use two input streams.
(Please remind when we construct remote join on postgres_fdw, all the
materialization was done on remote side, thus we had one input stream to
generate local join equivalent view.)
On the other hands, EXPLAIN command has to understand what column is the
source of varnodes in targetlist of custom-node even if it is rewritten
to use just one slot. For example, which label shall be shown in case when
3rd item of targetlist is originally come from 2nd item of inner slot but
all the materialized result is stored into outer slot.
Only CPP can track its relationship between the original and the newer one.
This interface provides a way to solve a varnode that actually references.

 * Purpose of FinalizeCustomPlan()
 The reason why FinalizeCustomPlan callback is necessary is not clear to
 me, because ctidscan just calls finalize_primnode() and
 bms_add_members() with given information.  Could you show a case that the
 API would be useful?
 
The main purpose of this callback gives an extension chance to apply
finalize_primenode() if custom-node hold expression tree on its private
fields.
In case when CPP picked up a part of clauses to run its own way, it shall
be attached on neither plan-targetlist nor plan-qual, only CPP knows
where does it attached. So, these orphan expression nodes have to be
treated by CPP.

 * Is it ok to call set_cheapest() for all relkind?
 Now set_cheapest() is called not for only relation and foreign table but
 also custom plan, and other relations such as subquery, function, and value.
 Calling call_custom_scan_provider() and set_cheapest() in the case of
 RTE_RELATION seems similar to the old construct, how do you think about
 this?
 
I don't think we may be actually able to have some useful custom scan logic
on these special relation forms, however, I also didn't have a special reason
why custom-plan does not need to support these special relations.
I'd like to see committer's opinion here.


 * Is it hard to get rid of CopyCustomPlan()?
 Copying ForeignScan node doesn't need per-FDW copy function by limiting
 fdw_private to have only copy-able objects.  Can't we use the same way for
 CustomPlan?  Letting authors call NodeSetTag or
 copyObject() sounds uncomfortable to me.
 
 This would be able to apply to TextOutCustomPlan() and TextOutCustomPath()
 too.
 
FDW-like design was the original one, but the latest design was suggestion
by Tom on the v9.4 development cycle, because some data types are not
complianced to copyObject; like Bitmapset.

 * MultiExec support is appropriate for the first version?
 The cases need MultiExec seems little complex for the first version of custom
 scan.  What kind of plan do you image for this feature?
 
It is definitely necessary to exchange multiple rows with custom-format with
upper level if

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

2014-07-03 Thread Rushabh Lathia

On Fri, Jul 4, 2014 at 7:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Tom Dunstan pg...@tomd.cc writes:
  On 4 July 2014 00:07, Tom Lane t...@sss.pgh.pa.us wrote:
  It sounds to me like designing this for JDBC's getGeneratedKeys method
  is a mistake.  There isn't going to be any way that the driver can
 support
  that without having looked at the table's metadata for itself, and if
  it's going to do that then it doesn't need a crutch that only partially
  solves the problem anyhow.

  Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet
  from whatever was returned. It's CURRENTLY doing that, but it's appending
  RETURNING * and leaving it up to the caller of getGeneratedKeys() to work
  out which columns the caller is interested in.

 I might be mistaken, but as I read the spec for getGeneratedKeys, whether
 or not a column is in a primary key has got nothing to do with whether
 that column should be returned.  It looks to me like what you're supposed
 to return is columns with computed default values, eg, serial columns.
 (Whether we should define it as being *exactly* columns created by the
 SERIAL macro is an interesting question, but for sure those ought to be
 included.)  Now, the pkey might be a serial column ... but it equally
 well might not be; it might not have any default expression at all.
 And there certainly could be serial columns that weren't in the pkey.


100% agree with Tom.


 The fact that the spec is kinda fuzzy doesn't entitle us to ignore the
 plain meaning of the term generated key.

 regards, tom lane




-- 
Rushabh Lathia

[HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-03 Thread Ashoke

Hi,

--

I have defined a new command my_command in PostgreSQL. This command takes
the path of ANALYZE and inside analyze.c, I have a function to do some
operations if its my_command.This command takes the input arguments: table
name, column name and an input string.

my_command nation (n_nationkey) 'input string';

When I run this command from command line psql, it works as expected. But
when I call the same command from a java application, the variable that
stores the input string is NULL.

I printed the value of the input string in gram.y file where I have defined
my_command.
fprintf (stderr, I am inside gram.y %s\n,n-inp_str); and the input
string is printed correctly.

But when I print stmt-inp_str in the function standard_ProcessUtility() of
utility.c for the case T_VacuumStmt, I get the value as NULL. This is as
far as I could trace back from analyze.c.

I am not sure how executing the same command from an application can make a
difference.

gram.y content gist:
--

MyStmt:
my_keyword qualified_name name_list my_inp_str
{
VacuumStmt *n = makeNode(VacuumStmt);
n-options = VACOPT_ANALYZE;
n-freeze_min_age = -1;
n-freeze_table_age = -1;
n-relation = $2;
n-va_cols = $3;
n-inp_str = $4;
fprintf (stderr, I am inside gram.y %s\n,n-inp_str);

$$ = (Node *)n;
};

char *inp_str is added to the struct VacuumStmt in parsenodes.h

---

Only the newly added char *inp_str(that is different from ANALYZE) value is
NULL. I was able to retrieve the column name from va_cols.

Any help is appreciated. Thanks!
-- 
Regards,
Ashoke

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes