Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name

2012-01-25 Thread Hitoshi Harada
On Mon, Jan 23, 2012 at 7:13 PM, Matthew Draper matt...@trebex.net wrote:
 On 19/01/12 20:28, Hitoshi Harada wrote:
 (Now it occurred to me that forgetting the #include parse_func.h might
 hit this breakage..., so I'll fix it here and continue to test, but if
 you'll fix it yourself, let me know)

 I fixed it here and it now works with my environment.

 Well spotted; that's exactly what I'd done. :/


 The regression tests pass, the feature seems working as aimed, but it
 seems to me that it needs more test cases and documentation. For the
 tests, I believe at least we need ambiguous case given upthread, so
 that we can ensure to keep compatibility. For the document, it should
 describe the name resolution rule, as stated in the patch comment.

 Attached are a new pair of patches, fixing the missing include (and the
 other warning), plus addressing the above.

 I'm still not sure whether to just revise (almost) all the SQL function
 examples to use parameter names, and declare them the right choice; as
 it's currently written, named parameters still seem rather second-class.


Agreed. The patch seems ok, except an example I've just found.

db1=# create function t(a int, t t) returns int as $$ select t.a $$
language sql;
CREATE FUNCTION
db1=# select t(0, row(1, 2)::t);
 t
---
 1
(1 row)

Should we throw an error in such ambiguity? Or did you make it happen
intentionally? If latter, we should also mention the rule in the
manual.

Other than that, the patch looks good to me.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-25 Thread Heikki Linnakangas
I've been thinking, what exactly is the important part of this group 
commit patch that gives the benefit? Keeping the queue sorted isn't all 
that important - XLogFlush() requests for commits will come in almost 
the correct order anyway.


I also don't much like the division of labour between groupcommit.c and 
xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back 
DoXLogFlush(), which is a bit hard to follow. (that's in my latest 
version; the original patch had similar division but it also cut across 
processes, with the wal writer actually doing the flush)


It occurs to me that the WALWriteLock already provides much of the same 
machinery we're trying to build here. It provides FIFO-style queuing, 
with the capability to wake up the next process or processes in the 
queue. Perhaps all we need is some new primitive to LWLock, to make it 
do exactly what we need.


Attached is a patch to do that. It adds a new mode to 
LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, 
it is acquired and the function returns immediately. However, unlike 
normal LWLockConditionalAcquire(), if the lock is not free it goes to 
sleep until it is released. But unlike normal LWLockAcquire(), when the 
lock is released, the function returns *without* acquiring the lock.


I modified XLogFlush() to use that new function for WALWriteLock. It 
tries to get WALWriteLock, but if it's not immediately free, it waits 
until it is released. Then, before trying to acquire the lock again, it 
rechecks LogwrtResult to see if the record was already flushed by the 
process that released the lock.


This is a much smaller patch than the group commit patch, and in my 
pgbench-tools tests on my laptop, it has the same effect on performance. 
The downside is that it touches lwlock.c, which is a change at a lower 
level. Robert's flexlocks patch probably would've been useful here.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce659ec..d726a98 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2066,23 +2066,42 @@ XLogFlush(XLogRecPtr record)
 	/* initialize to given target; may increase below */
 	WriteRqstPtr = record;
 
-	/* read LogwrtResult and update local state */
+	/*
+	 * Now wait until we get the write lock, or someone else does the
+	 * flush for us.
+	 */
+	for (;;)
 	{
 		/* use volatile pointer to prevent code rearrangement */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
+		/* read LogwrtResult and update local state */
 		SpinLockAcquire(xlogctl-info_lck);
 		if (XLByteLT(WriteRqstPtr, xlogctl-LogwrtRqst.Write))
 			WriteRqstPtr = xlogctl-LogwrtRqst.Write;
 		LogwrtResult = xlogctl-LogwrtResult;
 		SpinLockRelease(xlogctl-info_lck);
-	}
 
-	/* done already? */
-	if (!XLByteLE(record, LogwrtResult.Flush))
-	{
-		/* now wait for the write lock */
-		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		/* done already? */
+		if (XLByteLE(record, LogwrtResult.Flush))
+			break;
+
+		/*
+		 * Try to get the write lock. If we can't get it immediately, wait
+		 * until it's released, but before actually acquiring it, loop back
+		 * to check if the backend holding the lock did the flush for us
+		 * already. This helps to maintain a good rate of group committing
+		 * when the system is bottlenecked by the speed of fsyncing.
+		 */
+		if (!LWLockConditionalAcquire(WALWriteLock, LW_EXCLUSIVE_BUT_WAIT))
+		{
+			/*
+			 * Didn't get the lock straight away, but we might be done
+			 * already
+			 */
+			continue;
+		}
+		/* Got the lock */
 		LogwrtResult = XLogCtl-Write.LogwrtResult;
 		if (!XLByteLE(record, LogwrtResult.Flush))
 		{
@@ -2111,6 +2130,8 @@ XLogFlush(XLogRecPtr record)
 			XLogWrite(WriteRqst, false, false);
 		}
 		LWLockRelease(WALWriteLock);
+		/* done */
+		break;
 	}
 
 	END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cc41568..488f573 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -430,6 +430,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 			elog(PANIC, cannot wait without a PGPROC structure);
 
 		proc-lwWaiting = true;
+		proc-lwWaitOnly = false;
 		proc-lwExclusive = (mode == LW_EXCLUSIVE);
 		proc-lwWaitLink = NULL;
 		if (lock-head == NULL)
@@ -496,7 +497,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 /*
  * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode
  *
- * If the lock is not available, return FALSE with no side-effects.
+ * If the lock is not available, return FALSE with no side-effects. In
+ * LW_EXCLUSIVE_BUT_WAIT mode, if the lock is not available, waits until it
+ * is available, but then returns false without acquiring it.
  *
  * If successful, cancel/die interrupts are held off until lock release.
  */
@@ -504,7 +507,9 @@ bool
 LWLockConditionalAcquire(LWLockId lockid, 

Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Fujii Masao
On Tue, Jan 24, 2012 at 7:54 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao masao.fu...@gmail.com wrote:

 I'll proceed to commit for this now.

 Thanks a lot!

 Can I just check a few things?

Sure!

 You say
 /*
 +        * Update full_page_writes in shared memory and write an
 +        * XLOG_FPW_CHANGE record before resource manager writes cleanup
 +        * WAL records or checkpoint record is written.
 +        */

 why does it need to be before the cleanup and checkpoint?

Because the cleanup and checkpoint need to see FPW in shared memory.
If FPW in shared memory is not updated there, the cleanup and (end-of-recovery)
checkpoint always use an initial value (= false) of FPW in shared memory.

 You say
 /*
 +        * Currently only non-exclusive backup can be taken during recovery.
 +        */

 why?

At first I proposed to allow exclusive backup to be taken during recovery. But
Heikki disagreed with the proposal because he thought that the exclusive backup
procedure which I proposed was too fragile. No one could come up with any good
user-friendly easy-to-implement procedure. So we decided to allow only
non-exclusive backup to be taken during recovery. In non-exclusive backup,
the complicated procedure is performed by pg_basebackup, so a user doesn't
need to care about that.

 You mention in the docs
 The backup history file is not created in the database cluster backed up.
 but we need to explain the bad effect, if any.

Users cannot know various information (e.g., which WAL files are required for
backups, backup starting/ending time, etc) about backups which have been taken
so far. If they need such information, they need to record that manually.

Users cannot pass the backup history file to pg_archivecleanup. Which might make
the usage of pg_archivecleanup more difficult.

After a little thought, pg_basebackup would be able to create the backup history
file in the backup, though it cannot be archived. We shoud implement
that feature
to alleviate the bad effect?

 You say
 If the standby is promoted to the master during online backup, the
 backup fails.
 but no explanation of why?

 I could work those things out, but I don't want to have to, plus we
 may disagree if I did.

If the backup succeeds in that case, when we start an archive recovery from that
backup, the recovery needs to cross between two timelines. Which means that
we need to set recovery_target_timeline before starting recovery. Whether
recovery_target_timeline needs to be set or not depends on whether the standby
was promoted during taking the backup. Leaving such a decision to a user seems
fragile.

pg_basebackup -x ensures that all required files are included in the backup and
we can start recovery without restoring any file from the archive. But
if the standby
is promoted during the backup, the timeline history file would become
an essential
file for recovery, but it's not included in the backup.

 There are some good explanations in comments of other things, just not
 everywhere needed.

 What happens if we shutdown the WALwriter and then issue SIGHUP?

SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about
the case where smart shutdown kills walwriter but some backends are
still running?
Currently SIGHUP affects full_page_writes and running backends use the changed
new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

To address the problem, we should either postpone the shutdown of walwriter
until all backends have gone away, or leave the update of full_page_writes to
checkpointer process instead of walwriter. Thought?

 Are we sure we want to make the change of file format mandatory? That
 means earlier versions of clients such as pg_basebackup will fail
 against this version.

Really? Unless I'm missing something, pg_basebackup doesn't care about the
file format of backup_label. So I don't think that earlier version of
pg_basebackup
fails.

 There are no docs to explain the new feature is available in the main
 docs, or to explain the restrictions.
 I expect you will add that later after commit.

Okay. Will do.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Simon Riggs
On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao masao.fu...@gmail.com wrote:

 What happens if we shutdown the WALwriter and then issue SIGHUP?

 SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned 
 about
 the case where smart shutdown kills walwriter but some backends are
 still running?
 Currently SIGHUP affects full_page_writes and running backends use the changed
 new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

 To address the problem, we should either postpone the shutdown of walwriter
 until all backends have gone away, or leave the update of full_page_writes to
 checkpointer process instead of walwriter. Thought?

checkpointer seems the correct place to me

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-25 Thread Oleg Bartunov

What's about travelaccomodation support ?

On Tue, 24 Jan 2012, Joshua D. Drake wrote:



Hello,

The call for papers for PgNext (the old PgWest/PgEast) is now open:

January 19th: Talk submission opens
April 15th: Talk submission closes
April 30th: Speaker notification

Submit: https://www.postgresqlconference.org/talk_types

Sincerely,

JD




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Peter Eisentraut
On tis, 2012-01-24 at 20:13 -0500, Tom Lane wrote:
 Yeah.  In both cases, the (proposed) new output format is
 self-identifying *to clients that know what to look for*.
 Unfortunately it would only be the most anally-written pre-existing
 client code that would be likely to spit up on the unexpected
 variations.  What's much more likely to happen, and did happen in the
 bytea case, is silent data corruption. 

The problem in the bytea case is that the client libraries are written
to ignore encoding errors.  No amount of protocol versioning will help
you in that case.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote:
 Furthermore, while we haven't settled the question of exactly what a
 good negotiation facility would look like, we seem to agree that a GUC
 isn't it.  I think that means this isn't going to happen for 9.2, so
 we should mark this patch Returned with Feedback and return to this
 topic for 9.3.

Simply extending the text/bin flags should be quite
uncontroversial first step.  How to express the
capability in startup packet, I leave to others to decide.

But my proposal would be following:

bit 0 : text/bin
bit 1..15 : format version number, maps to best formats in some
Postgres version.  

It does not solve the resultset problem, where I'd like to say
gimme well-known types in optimal representation, others in text.  
I don't know the perfect solution for that, but I suspect the
biggest danger here is the urge to go to maximal complexity
immediately.  So perhaps the good idea would simply give one
additional bit (0x8000?) in result flag to say that only
well-known types should be optimized.  That should cover 95%
of use-cases, and we can design more flexible packet format
when we know more about actual needs.

libpq suggestions:

  PQsetformatcodes(bool)
only if its called with TRUE, it starts interpreting
text/bin codes as non-bools.  IOW, we will be compatible
with old code using -1 as TRUE.

protocol suggestions:

  On startup server sends highest supported text/bin codes,
  and gives error if finds higher code than supported.
  Poolers/proxies with different server versions in pool
  will simply give lowest common code out.


Small QA, to put obvious aspects into writing
--

* Does that mean we need to keep old formats around infinitely?

Yes.On-wire formats have *much* higher visibility than
on-disk formats.  Also, except some basic types they are
not parsed in adapters, but in client code.  Libpq offers
least help in that respect.

Basically - changing on-wire formatting is big deal,
don't do it willy-nilly.


* Does that mean we cannot turn on new formats automatically?

Yes.  Should be obvious..


-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Configuring Postgres to Add A New Source File

2012-01-25 Thread Tareq Aljabban
Hi,
I'm doing some development on the storage manager module of Postgres.

I have added few source files already to the smgr folder, and I was able to
have the Make system includes them by adding their names to the OBJS list
in the Makefile inside the smgr folder. (i.e. When I add A.c, I would add
A.o to the OBJS list).

That was working fine. Now I'm trying to add a new file hdfs_test.c to the
project. The problem with this file is that it requires some extra
directives in its compilation command (-I and -L directives).


Using gcc the file is compiled with the command:

gcc hdfs_test.c -I/HDFS_HOME/hdfs/src/c++/libhdfs
-I/usr/lib/jvm/default-java/include -L/HDFS_HOME/hdfs/src/c++/libhdfs
-L/HDFS_HOME/build/c++/Linux-i386-32/lib
-L/usr/lib/jvm/default-java/jre/lib/i386/server -ljvm -lhdfs -o
hdfs_test

 I was told that in order to add the extra directives, I need to modify
$LDFLAGS in configure.in and $LIBS in MakeFile.

I read about autoconf and checked configure.in of Postgres but still wasn't
able to know exactly what I should be doing.

Thanks


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-01-25 Thread Julien Tachoires
2012/1/24 Robert Haas robertmh...@gmail.com:
 On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires jul...@gmail.com wrote:
 2011/12/15 Alvaro Herrera alvhe...@commandprompt.com:

 Uhm, surely you could compare the original toast tablespace to the heap
 tablespace, and if they differ, handle appropriately when creating the
 new toast table?  Just pass down the toast tablespace into
 AlterTableCreateToastTable, instead of having it assume that
 rel-rd_rel-relnamespace is sufficient.  This should be done in all
 cases where a toast tablespace is created, which shouldn't be more than
 a handful of them.

 Thank you, that way seems right.
 Now, I distinguish before each creation of a TOAST table with
 AlterTableCreateToastTable() : if it will create a new one or recreate
 an existing one.
 Thus, in create_toast_table() when toastTableSpace is equal to
 InvalidOid, we are able :
 - to fallback to the main table tablespace in case of new TOAST table 
 creation
 - to keep it previous tablespace in case of recreation.

 Here's a new version rebased against HEAD.

 To ask more directly the question that's come up a few times upthread,
 why do *you* think this is useful?  What motivated you to want this
 behavior, and/or how do you think it could benefit other PostgreSQL
 users?


Sorry, I didn't get this question to me. I've just picked up this item
from the TODO list and then I was thinking that it could be useful. My
motivation was to learn more about PostgreSQL dev. and to work on a
concrete case. Now, I'm not sure anymore this is useful.

--
JT

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 7:13 AM, Julien Tachoires jul...@gmail.com wrote:
 2012/1/24 Robert Haas robertmh...@gmail.com:
 On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires jul...@gmail.com wrote:
 2011/12/15 Alvaro Herrera alvhe...@commandprompt.com:

 Uhm, surely you could compare the original toast tablespace to the heap
 tablespace, and if they differ, handle appropriately when creating the
 new toast table?  Just pass down the toast tablespace into
 AlterTableCreateToastTable, instead of having it assume that
 rel-rd_rel-relnamespace is sufficient.  This should be done in all
 cases where a toast tablespace is created, which shouldn't be more than
 a handful of them.

 Thank you, that way seems right.
 Now, I distinguish before each creation of a TOAST table with
 AlterTableCreateToastTable() : if it will create a new one or recreate
 an existing one.
 Thus, in create_toast_table() when toastTableSpace is equal to
 InvalidOid, we are able :
 - to fallback to the main table tablespace in case of new TOAST table 
 creation
 - to keep it previous tablespace in case of recreation.

 Here's a new version rebased against HEAD.

 To ask more directly the question that's come up a few times upthread,
 why do *you* think this is useful?  What motivated you to want this
 behavior, and/or how do you think it could benefit other PostgreSQL
 users?

 Sorry, I didn't get this question to me. I've just picked up this item
 from the TODO list and then I was thinking that it could be useful. My
 motivation was to learn more about PostgreSQL dev. and to work on a
 concrete case. Now, I'm not sure anymore this is useful.

OK.  In that case, I don't think it makes sense to add this right now.
 I share the feeling that it could possibly be useful under some set
of circumstances, but it doesn't seem prudent to add features because
there might hypothetically be a use case.  I suggest that we mark this
patch Returned with Feedback and add links to your latest version of
this patch and one of the emails expressing concerns about the utility
of this patch to the Todo list.  If we later have a clearer idea in
mind why this might be useful, we can add it then - and document the
use case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Client Messages

2012-01-25 Thread Jim Mlodgenski
On Tue, Jan 24, 2012 at 7:38 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 23.01.2012 22:52, Jim Mlodgenski wrote:

 On Wed, Jan 18, 2012 at 9:19 AM, Jim Mlodgenskijimm...@gmail.com  wrote:

 On Wed, Jan 18, 2012 at 3:08 AM, Heikki Linnakangas

 I don't think that's a problem, it's just a free-form message to
 display.

 But it also doesn't seem very useful to have it PGC_USERSET: if it's
 only
 displayed at connect time, there's no point in changing it after
 connecting.

 Should we make it PGC_BACKEND?


 In hindsight, making it PGC_BACKEND makes it much less useful, because then
 you can't set it on a per-database basis with ALTER DATABASE foo SET ...
 So I made it PGC_USERSET again.


 Here is the revised patch based on the feedback.


 Thanks! I renamed the GUC to welcome_message, to avoid confusion with
 client_min_messages. I also moved it to Connection Settings category.
 Although it's technically settable within a session, the purpose is to
 display a message when connecting, so Connection Settings feels more
 fitting.

 There's one little problem remaining with this, which is what to do if the
 message is in a different encoding than used by the client? That's not a new
 problem, we have the same problem with any string GUC, if you do SHOW
 setting. We restricted application_name to ASCII characters, which is an
 obvious way to avoid the problem, but I think it would be a shame to
 restrict this to ASCII-only.
Isn't that an issue for the administrator understanding his audience.
Maybe some additional documentation explaining the encoding issue?



 --
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Robert Haas
On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I think we should be working to commit XLogInsert and then Group
 Commit, then come back to the discussion.

I definitely agree that those two have way more promise than anything
else on the table.  However, now that I understand how badly we're
getting screwed by checkpoints, I'm curious to do some more
investigation of what's going on there.  I can't help thinking that in
these particular cases the full page writes may be a bigger issue than
the checkpoint itself.  If that turns out to be the case it's not
likely to be something we're able to address for 9.2, but I'd like to
at least characterize it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 1:23 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Jan 25, 2012 at 5:35 AM, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Tue, Jan 24, 2012 at 3:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Add new replication mode synchronous_commit = 'write'.
 Replication occurs only to memory on standby, not to disk,
 so provides additional performance if user wishes to
 reduce durability level slightly. Adds concept of multiple
 independent sync rep queues.

 Fujii Masao and Simon Riggs


 i guess, you should add the new value in postgresql.conf too.

 Yes, I forgot to do that. Patch attached.

I think that this would be a lot more clear if we described this as
synchronous_commit = remote_write rather than just synchronous_commit
= write.  Actually, the internal constant is named that way already,
but it's not exposed as such to the user.

There's a logical hierarchy here:

fully async commit (off)  local flush only (local)  local flush
+ remote write (currently write)  local flush + remote flush
(currently on)  local flush + remote apply

All of the levels except for off involve waiting for local flush;
the higher levels also involve waiting for something on the remote
side.  But the name of the variable is just synchronous_commit, so I
thik that if the word remote doesn't appear anywhere in the
user-visible parameter name, it's not as clear as it could be.  In
addition to renaming write to remote_write, I think we might also
want to add remote_flush as an alias for on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-25 Thread Simon Riggs
On Wed, Jan 25, 2012 at 1:57 PM, Robert Haas robertmh...@gmail.com wrote:

 I think that this would be a lot more clear if we described this as
 synchronous_commit = remote_write rather than just synchronous_commit
 = write.  Actually, the internal constant is named that way already,
 but it's not exposed as such to the user.

That's something to discuss at the end of the CF when people are less
busy and we get more input.

It's an easy change whatever we decide to do.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-25 Thread Alvaro Herrera

Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012:

 Thanks.  I've attached a new version fixing this problem.

Looks good to me.  Can you please clarify this bit?

 * Since we elsewhere require that all collations share 
the same
 * notion of equality, don't compare collation here.

Since I'm not familiar with this code, it's difficult for me to figure
out where this elsewhere is, and what does same notion of equality
mean.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote:
 Furthermore, while we haven't settled the question of exactly what a
 good negotiation facility would look like, we seem to agree that a GUC
 isn't it.  I think that means this isn't going to happen for 9.2, so
 we should mark this patch Returned with Feedback and return to this
 topic for 9.3.

 Simply extending the text/bin flags should be quite
 uncontroversial first step.  How to express the
 capability in startup packet, I leave to others to decide.

 But my proposal would be following:

 bit 0 : text/bin
 bit 1..15 : format version number, maps to best formats in some
 Postgres version.  

 It does not solve the resultset problem, where I'd like to say
 gimme well-known types in optimal representation, others in text.  
 I don't know the perfect solution for that, but I suspect the
 biggest danger here is the urge to go to maximal complexity
 immediately.  So perhaps the good idea would simply give one
 additional bit (0x8000?) in result flag to say that only
 well-known types should be optimized.  That should cover 95%
 of use-cases, and we can design more flexible packet format
 when we know more about actual needs.

Huh?  How can that work?  If we decide to change the representation of
some other well known type, say numeric, how do we decide whether a
client setting that bit is expecting that change or not?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
 Marko Kreen mark...@gmail.com writes:
  On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote:
  Furthermore, while we haven't settled the question of exactly what a
  good negotiation facility would look like, we seem to agree that a GUC
  isn't it.  I think that means this isn't going to happen for 9.2, so
  we should mark this patch Returned with Feedback and return to this
  topic for 9.3.
 
  Simply extending the text/bin flags should be quite
  uncontroversial first step.  How to express the
  capability in startup packet, I leave to others to decide.
 
  But my proposal would be following:
 
  bit 0 : text/bin
  bit 1..15 : format version number, maps to best formats in some
  Postgres version.  
 
  It does not solve the resultset problem, where I'd like to say
  gimme well-known types in optimal representation, others in text.  
  I don't know the perfect solution for that, but I suspect the
  biggest danger here is the urge to go to maximal complexity
  immediately.  So perhaps the good idea would simply give one
  additional bit (0x8000?) in result flag to say that only
  well-known types should be optimized.  That should cover 95%
  of use-cases, and we can design more flexible packet format
  when we know more about actual needs.
 
 Huh?  How can that work?  If we decide to change the representation of
 some other well known type, say numeric, how do we decide whether a
 client setting that bit is expecting that change or not?

It sets that bit *and* version code - which means that it is
up-to-date with all well-known type formats in that version.

The key here is to sanely define the well-known types
and document them, so clients can be uptodate with them.

Variants:
- All built-in and contrib types in some Postgres version
- All built-in types in some Postgres version
- Most common types (text, numeric, bytes, int, float, bool, ..)

Also, as we have only one bit, the set of types cannot be
extended.  (Unless we provide more bits for that, but that
may get too confusing?)


Basically, I see 2 scenarios here:

1) Client knows the result types and can set the
text/bin/version code safely, without further restrictions.

2) There is generic framework, that does not know query contents
but can be expected to track Postgres versions closely.
Such framework cannot say binary for results safely,
but *could* do it for some well-defined subset of types.


Ofcourse it may be that 2) is not worth supporting, as
frameworks can throw errors on their own if they find
format that they cannot parse.  Then the user needs
to either register their own parser, or simply turn off
optmized formats to get the plain-text values.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trigger_depth() v3 (was: TG_DEPTH)

2012-01-25 Thread Alvaro Herrera


Committed, with OID change.  Thanks.

I tested it with plphp just for the heck of it and it worked
wonderfully.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-25 Thread Tom Lane
hubert depesz lubaczewski dep...@depesz.com writes:
 anyway - the point is that in \df date_part(, timestamp) says it's
 immutable, while it is not.

Hmm, you're right.  I thought we'd fixed that way back when, but
obviously not.  Or maybe the current behavior of the epoch case
postdates that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
 Huh?  How can that work?  If we decide to change the representation of
 some other well known type, say numeric, how do we decide whether a
 client setting that bit is expecting that change or not?

 It sets that bit *and* version code - which means that it is
 up-to-date with all well-known type formats in that version.

Then why bother with the bit in the format code?  If you've already done
some other negotiation to establish what datatype formats you will
accept, this doesn't seem to be adding any value.

 Basically, I see 2 scenarios here:

 1) Client knows the result types and can set the
 text/bin/version code safely, without further restrictions.

 2) There is generic framework, that does not know query contents
 but can be expected to track Postgres versions closely.
 Such framework cannot say binary for results safely,
 but *could* do it for some well-defined subset of types.

The hole in approach (2) is that it supposes that the client side knows
the specific datatypes in a query result in advance.  While this is
sometimes workable for application-level code that knows what query it's
issuing, it's really entirely untenable for a framework or library.
The only way that a framework can deal with arbitrary queries is to
introduce an extra round trip (Describe step) to see what datatypes
the query will produce so it can decide what format codes to issue
... and that will pretty much eat up any time savings you might get
from a more efficient representation.

You really want to do the negotiation once, at connection setup, and
then be able to process queries without client-side prechecking of what
data types will be sent back.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] sepgsql's DROP Permission checks

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 9:54 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 I tried to implement based on the idea to call object-access-hook with
 flag; that
 informs extensions contexts of objects being removed.
 Because I missed DROP_RESTRICT and DROP_CASCADE are enum type,
 this patch added performInternalDeletion() instead of OR-masked DROP_INTERNAL.
 All its difference from performDeletion() is a flag (OAT_DROP_FLAGS_INTERNAL)
 shall be delivered to extension module. I replaced several performDeletion() 
 by
 performInternalDeletion() that clean-up objects due to internal stuff.

 How about this approach?

I generally agree with this line of attack, but I think you've failed
to find all the cases where a drop should be considered internal, and
I'd rather add a new parameter to an existing function than define a
new one that someone might accidentally fail to use in some place
where it's needed.  Here's a cut-down patch that *just* adds a
PERFORM_DELETE_INTERNAL flag, plus some related comment additions.  If
this looks reasonable to you, I'll commit it and then we can work out
the remaining details.

Since sepgsql doesn't seem to need the DropBehavior, I'm inclined to
say we shouldn't go to any extra work to pass it just now.  We can
always add that later if some other client needs it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


perform-deletion-internal.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Jeff Janes
On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas robertmh...@gmail.com wrote:
 Early yesterday morning, I was able to use Nate Boley's test machine
 do a single 30-minute pgbench run at scale factor 300 using a variety
 of trees built with various patches, and with the -l option added to
 track latency on a per-transaction basis.  All tests were done using
 32 clients and permanent tables.  The configuration was otherwise
 identical to that described here:

 http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com

Previously we mostly used this machine for CPU benchmarking.  Have you
previously described the IO subsystem?  That is becoming relevant for
these types of benchmarks.   For example, is WAL separated from data?


 By doing this, I hoped to get a better understanding of (1) the
 effects of a scale factor too large to fit in shared_buffers,

In my hands, the active part of data at scale of 300 fits very
comfortably into 8GB shared_buffers.

Indeed, until you have run a very long time so that pgbench_history
gets large, everything fits in 8GB.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
 On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
 Huh?  How can that work?  If we decide to change the representation of
 some other well known type, say numeric, how do we decide whether a
 client setting that bit is expecting that change or not?

 It sets that bit *and* version code - which means that it is
 up-to-date with all well-known type formats in that version.

 Then why bother with the bit in the format code?  If you've already done
 some other negotiation to establish what datatype formats you will
 accept, this doesn't seem to be adding any value.

 Basically, I see 2 scenarios here:

 1) Client knows the result types and can set the
 text/bin/version code safely, without further restrictions.

 2) There is generic framework, that does not know query contents
 but can be expected to track Postgres versions closely.
 Such framework cannot say binary for results safely,
 but *could* do it for some well-defined subset of types.

 The hole in approach (2) is that it supposes that the client side knows
 the specific datatypes in a query result in advance.  While this is
 sometimes workable for application-level code that knows what query it's
 issuing, it's really entirely untenable for a framework or library.
 The only way that a framework can deal with arbitrary queries is to
 introduce an extra round trip (Describe step) to see what datatypes
 the query will produce so it can decide what format codes to issue
 ... and that will pretty much eat up any time savings you might get
 from a more efficient representation.

 You really want to do the negotiation once, at connection setup, and
 then be able to process queries without client-side prechecking of what
 data types will be sent back.

What might work is for clients to advertise a list of capability
strings, like compact_array_format, at connection startup time.  The
server can then adjust its behavior based on that list.  But the
problem with that is that as we make changes to the wire protocol, the
list of capabilities clients need to advertise could get pretty long
in a hurry.  A simpler alternative is to have the client send a server
version along with the initial connection attempt and have the server
do its best not to use any features that weren't present in that
server version - but that seems to leave user-defined types out in the
cold.

I reiterate my previous view that we don't have time to engineer a
good solution to this problem right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote:
 Marko Kreen mark...@gmail.com writes:
  On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
  Huh?  How can that work?  If we decide to change the representation of
  some other well known type, say numeric, how do we decide whether a
  client setting that bit is expecting that change or not?
 
  It sets that bit *and* version code - which means that it is
  up-to-date with all well-known type formats in that version.
 
 Then why bother with the bit in the format code?  If you've already done
 some other negotiation to establish what datatype formats you will
 accept, this doesn't seem to be adding any value.

The other negotiation is done via Postgres release notes...

I specifically want to avoid any sort of per-connection
negotation, except the max format version supported,
because it will mess up multiplexed usage of single connection.
Then they need to either disabled advanced formats completely,
or still do it per-query somehow (via GUCs?) which is mess.

Also I don't see any market for flexible negotations,
instead I see that people want 2 things:

- Updated formats are easily available
- Old apps not to break

I might be mistaken here, then please correct me,
but currently I'm designing for simplicity.

  Basically, I see 2 scenarios here:
 
  1) Client knows the result types and can set the
  text/bin/version code safely, without further restrictions.
 
  2) There is generic framework, that does not know query contents
  but can be expected to track Postgres versions closely.
  Such framework cannot say binary for results safely,
  but *could* do it for some well-defined subset of types.
 
 The hole in approach (2) is that it supposes that the client side knows
 the specific datatypes in a query result in advance.  While this is
 sometimes workable for application-level code that knows what query it's
 issuing, it's really entirely untenable for a framework or library.

No, the list of well-known types is documented and fixed.
The bit is specifically for frameworks, so that they can say
I support all well-known types in Postgres version X.Y.

Note I said that the list cannot be extended but that is wrong.
When this bit and version code are taken together, it clearly
defines list as in version X.Y.  So considering that
client should not send any higher version than server supports,
it means server always knows what list client refers to.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 After that I started doing some performance testing, and the initial
 news was bad: the planner was about 3x slower than 9.1 on a moderately
 large join problem.  I've spent the past several days hacking away at
 that, and have got it down to about 35% slower by dint of the following
 changes:

 * micro-optimization of add_path(), in particular avoiding duplicate
 calculations during cost comparisons.

 * introducing a two-step mechanism for testing whether proposed join
 paths are interesting.  The patch first calculates a quick and dirty
 lower bound on the cost of the join path (mostly, from the costs of
 its input paths) and looks through the joinrel's pathlist to see if
 there is already a path that is clearly better.  If not, it proceeds
 with the full cost calculation and add_path insertion.  In my testing
 the precheck is able to eliminate 80% or so of the proposed paths,
 more than repaying the extra pathlist search.

 Now this is of course cheating with both hands, in that either of these
 optimization techniques could have been applied before ... but they
 weren't.  I think that 35% slower on large join problems is probably
 acceptable, given that this is investigating a larger number of possible
 solutions and hopefully often finding a better plan.  I think I can
 knock some more off of that by refactoring the costsize.c APIs, anyway.
 Right now the first-pass and second-pass cost calculations are
 independent and so there's some repeated work, which I'd like to
 eliminate both for speed reasons and to get rid of duplicate code
 that'd have to be kept in sync if it's left as-is.

 If there are not objections, I'd like to go ahead and commit this
 after one more round of polishing.  There's still a lot left to do,
 but what it mainly needs now is some testing on real-world planning
 problems, and I don't think it's going to get that until it's been
 merged in.

I have to admit that I have a bad feeling about this.  It strikes me
that there is no way we're not going to get complaints about a 35%
slowdown on planning a large join problem.  It is true that some
people will have the benefit of finding a faster plan - but I think
many people won't.  We're really facing the same trade-off here that
we do with many other things people have asked for the query planner
to do over the years: is it worth slowing down everyone, on every
query, for an optimization that will apply rarely but provide huge
benefits when it does?  Of course, that's fundamentally a judgement
call.  But I can't help observing that the number of requests we've
had for the planner to deduce implied inequalities is far larger than
the number of people who have complained about the problem that this
fixes.  Now, maybe the speed penalty for deducing implied inequalities
would be even larger than 35%: I don't know.  But we've sweat blood to
avoid much smaller regressions on much less important cases.

To be clear, I'd love to have this feature.  But if there is a choice
between reducing planning time significantly for everyone and NOT
getting this feature, and increasing planning time significantly for
everyone and getting this feature, I think we will make more people
happy by doing the first one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote:
 Then why bother with the bit in the format code?  If you've already done
 some other negotiation to establish what datatype formats you will
 accept, this doesn't seem to be adding any value.

 The other negotiation is done via Postgres release notes...

That is really not going to work if the requirement is to not break old
apps.  They haven't read the release notes.

 I specifically want to avoid any sort of per-connection
 negotation, except the max format version supported,
 because it will mess up multiplexed usage of single connection.
 Then they need to either disabled advanced formats completely,
 or still do it per-query somehow (via GUCs?) which is mess.

Hmm, that adds yet another level of not-obvious-how-to-meet requirement.
I tend to concur with Robert that we are not close to a solution.

 No, the list of well-known types is documented and fixed.
 The bit is specifically for frameworks, so that they can say
 I support all well-known types in Postgres version X.Y.

So in other words, if we have a client that contains a framework that
knows about version N, and we connect it up to a server that speaks
version N+1, it suddenly loses the ability to use any version-N
optimizations?  That does not meet my idea of not breaking old apps.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Simon Riggs
On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao masao.fu...@gmail.com wrote:

 What happens if we shutdown the WALwriter and then issue SIGHUP?

 SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned 
 about
 the case where smart shutdown kills walwriter but some backends are
 still running?
 Currently SIGHUP affects full_page_writes and running backends use the 
 changed
 new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

 To address the problem, we should either postpone the shutdown of walwriter
 until all backends have gone away, or leave the update of full_page_writes to
 checkpointer process instead of walwriter. Thought?

 checkpointer seems the correct place to me


Done.


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 5:12 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 2012/1/21 Robert Haas robertmh...@gmail.com:
 On Sat, Jan 21, 2012 at 3:59 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 I marked the default leakproof function according to the criteria that
 does not leak contents of the argument.
 Indeed, timestamp_ne_timestamptz() has a code path that rises
 an error of timestamp out of range message. Is it a good idea to
 avoid mark leakproof on these functions also?

 I think that anything which looks at the data and uses that as a basis
 for whether or not to throw an error is non-leakproof.  Even if
 doesn't directly leak an arbitrary value, I think that leaking even
 some information about what the value is no good.  Otherwise, you
 might imagine that we would allow /(int, int), because it only leaks
 in the second_arg = 0 case.  And you might imagine we'd allow -(int,
 int) because it only leaks in the case where an overflow occurs.  But
 of course the combination of the two allows writing something of the
 form 1/(a-constant) and getting it pushed down, and now you have the
 ability to probe for an arbitrary value.  So I think it's just no good
 to allow any leaking at all: otherwise it'll be unclear how safe it
 really is, especially when combinations of different functions or
 operators are involved.

 OK. I checked list of the default leakproof functions.

 Functions that contains translation between date and timestamp(tz)
 can raise an error depending on the supplied arguments.
 Thus, I unmarked leakproof from them.

 In addition, varstr_cmp() contains translation from UTF-8 to UTF-16 on
 win32 platform; that may raise an error if string contains a character that
 is unavailable to translate.
 Although I'm not sure which case unavailable to translate between them,
 it seems to me hit on the basis not to leak what kind of information is
 no good. Thus, related operator functions of bpchar and text got unmarked.
 (Note that bpchareq, bpcharne, texteq and textne don't use it.)

Can you rebase this?  It seems that the pg_proc.h and
select_views{,_1}.out hunks no longer apply cleanly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I have to admit that I have a bad feeling about this.  It strikes me
 that there is no way we're not going to get complaints about a 35%
 slowdown on planning a large join problem.

I have to admit that I'm worried about that too.  However, I hope to
continue whittling away at that number.  It's also worth pointing out
that the number is based on just a couple of test cases that are not
very real-world, in that I'm testing against empty tables with no
statistics.  I think that this is a worst case, because it results in a
lot of paths with basically the same costs, making them hard to prune;
but I can't do much better, because the examples I've got were supplied
without test data.

Also, you're assuming that the changes have no upside whatsoever, which
I fondly hope is not the case.  Large join problems tend not to execute
instantaneously --- so nobody is going to complain if the planner takes
awhile longer but the resulting plan is enough better to buy that back.
In my test cases, the planner *is* finding better plans, or at least
ones with noticeably lower estimated costs.  It's hard to gauge how
much that translates to in real-world savings, since I don't have
real data loaded up.  I also think, though I've not tried to measure,
that I've made planning cheaper for very simple queries by eliminating
some overhead in those cases.

Anyway, I'd be willing to hold off committing if someone were to
volunteer to test an unintegrated copy of the patch against some
moderately complicated application.  But it's a sufficiently large
patch that I don't really care to sit on it and try to maintain it
outside the tree for a long time.

 To be clear, I'd love to have this feature.  But if there is a choice
 between reducing planning time significantly for everyone and NOT
 getting this feature, and increasing planning time significantly for
 everyone and getting this feature, I think we will make more people
 happy by doing the first one.

We're not really talking about are we going to accept or reject a
specific feature.  We're talking about whether we're going to decide
that the last two years worth of planner development were headed in
the wrong direction and we're now going to reject that and try to
think of some entirely new concept.  This isn't an isolated patch,
it's the necessary next step in a multi-year development plan.  The
fact that it's a bit slower at the moment just means there's still
work to do.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Merlin Moncure
On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote:
 I specifically want to avoid any sort of per-connection
 negotation, except the max format version supported,
 because it will mess up multiplexed usage of single connection.
 Then they need to either disabled advanced formats completely,
 or still do it per-query somehow (via GUCs?) which is mess.

Being able to explicitly pick format version other than the one the
application was specifically written against adds a lot of complexity
and needs to be justified.  Maybe you're trying to translate data
between two differently versioned servers?  I'm trying to understand
the motive behind your wanting finer grained control of picking format
version...

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 12:58:15PM -0500, Tom Lane wrote:
 Marko Kreen mark...@gmail.com writes:
  On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote:
  Then why bother with the bit in the format code?  If you've already done
  some other negotiation to establish what datatype formats you will
  accept, this doesn't seem to be adding any value.
 
  The other negotiation is done via Postgres release notes...
 
 That is really not going to work if the requirement is to not break old
 apps.  They haven't read the release notes.

Yes, but they also keep requesting the old formats so everything is fine?
Note that formats are under full control of client, server has no way
to send newer formats to client that has not requested it.

  I specifically want to avoid any sort of per-connection
  negotation, except the max format version supported,
  because it will mess up multiplexed usage of single connection.
  Then they need to either disabled advanced formats completely,
  or still do it per-query somehow (via GUCs?) which is mess.
 
 Hmm, that adds yet another level of not-obvious-how-to-meet requirement.
 I tend to concur with Robert that we are not close to a solution.

Well, my simple scheme seems to work fine with such requirement.

[My scheme - client-supplied 16bit type code is only thing
that decides format.]

  No, the list of well-known types is documented and fixed.
  The bit is specifically for frameworks, so that they can say
  I support all well-known types in Postgres version X.Y.
 
 So in other words, if we have a client that contains a framework that
 knows about version N, and we connect it up to a server that speaks
 version N+1, it suddenly loses the ability to use any version-N
 optimizations?  That does not meet my idea of not breaking old apps.

That is up to Postgres maintainers to decide, whether they want
to phase out some type from the list.  But my main point was
it's OK to add types into list.  I missed that aspect on my
previous mail.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote:
 On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote:
  I specifically want to avoid any sort of per-connection
  negotation, except the max format version supported,
  because it will mess up multiplexed usage of single connection.
  Then they need to either disabled advanced formats completely,
  or still do it per-query somehow (via GUCs?) which is mess.
 
 Being able to explicitly pick format version other than the one the
 application was specifically written against adds a lot of complexity
 and needs to be justified.  Maybe you're trying to translate data
 between two differently versioned servers?  I'm trying to understand
 the motive behind your wanting finer grained control of picking format
 version...

You mean if client has written with version N formats, but connects
to server with version N-1 formats?  True, simply not supporting
such case simplifies client-side API.

But note that it does not change anything on protocol level, it's purely
client-API specific.  It may well be that some higher-level APIs
(JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level
API-s (raw libpq), it may be optional whether the client wants to
support such usage or not.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Merlin Moncure
On Wed, Jan 25, 2012 at 1:24 PM, Marko Kreen mark...@gmail.com wrote:
 On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote:
 On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote:
  I specifically want to avoid any sort of per-connection
  negotation, except the max format version supported,
  because it will mess up multiplexed usage of single connection.
  Then they need to either disabled advanced formats completely,
  or still do it per-query somehow (via GUCs?) which is mess.

 Being able to explicitly pick format version other than the one the
 application was specifically written against adds a lot of complexity
 and needs to be justified.  Maybe you're trying to translate data
 between two differently versioned servers?  I'm trying to understand
 the motive behind your wanting finer grained control of picking format
 version...

 You mean if client has written with version N formats, but connects
 to server with version N-1 formats?  True, simply not supporting
 such case simplifies client-side API.

 But note that it does not change anything on protocol level, it's purely
 client-API specific.  It may well be that some higher-level APIs
 (JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level
 API-s (raw libpq), it may be optional whether the client wants to
 support such usage or not.

well, I see the following cases:
1) Vserver  Vapplication: server downgrades wire formats to
applications version
2) Vapplication  Vlibpq  Vserver: since the application is
reading/writing formats the server can't understand, an error should
be raised if they are used in either direction
3) Vlibpq = VApplication  Vserver: same as above, but libpq can
'upconvert' low version wire format to application's wire format or
error otherwise.

By far, the most common cause of problems (both in terms of severity
and frequency) is case #1.  #3 allows a 'compatibility mode' via
libpq, but that comes at significant cost of complexity since libpq
needs to be able to translate wire formats up (but not down).  #2/3 is
a less common problem though as it's more likely the application can
be adjusted to get up to speed: so to keep things simple we can maybe
just error out in those scenarios.

In the database, we need to maintain outdated send/recv functions
basically forever and as much as possible try and translate old wire
format data to and from newer backend structures (maybe in very
specific cases that will be impossible such that the application is
SOL, but that should be rare).  All send/recv functions, including
user created ones need to be stamped with a version token (database
version?).  With the versions of the application, libpq, and all
server functions, we can determine all wire formats as long as we
assume the application's targeted database version represents all the
wire formats it was using.

My good ideas stop there: the exact mechanics of how the usable set of
functions are determined, how exactly the adjusted type look ups will
work, etc. would all have to be sorted out.  Most of the nastier parts
though (protocol changes notwithstanding) are not in libpq, but the
server.  There's just no quick fix on the client side I can see.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread David E. Wheeler
On Jan 25, 2012, at 10:24 AM, Tom Lane wrote:

 Anyway, I'd be willing to hold off committing if someone were to
 volunteer to test an unintegrated copy of the patch against some
 moderately complicated application.  But it's a sufficiently large
 patch that I don't really care to sit on it and try to maintain it
 outside the tree for a long time.

Why not create a branch? IIRC the build farm can be configured to run branches.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_bulkload ON_DUPLICATE_MERGE

2012-01-25 Thread Benjamin Johnson
PG Gurus,

I have a table like this:

CREATE TABLE filemods (
  guid  BIGINT NOT NULL UNIQUE,
  filepath_guid BIGINT NOT NULL,
  createtimeTIMESTAMP WITH TIME ZONE DEFAULT NULL,
  writetime TIMESTAMP WITH TIME ZONE DEFAULT NULL,
  deletetimeTIMESTAMP WITH TIME ZONE DEFAULT NULL,
);

One event might have (1, '2012-01-25 11:00:00', NULL, NULL) and
another event might have (1, NULL, '2012-01-25 11:05:00', NULL) and the
end result should be:
(1, '2012-01-25 11:00:00', '2012-01-25 11:05:00', NULL).


I'm trying to modify pg_bulkload to merge two rows together.  The
changes I have made seem to be working, although I would like input on
what I am doing that is unsafe or terribly wrong.  You can be brutal.

I've seen incredible write speed with using pg_bulkload.  If I can just
get it to consolidate our rows based on the unique key it will remove
a lot of complexity in our software.

Also, I'm not entirely sure this mailing list is the correct one, but
with all the internals you all know, I'm hoping you can help point out
nasty flaws in my algorithm.  This is the first legitimate attempt I
have made at modifying PG source, so I'm not real familiar with the
proper way of loading pages and tuples and updating heaps and all that
pg core stuff.

Here's the modifications to pg_btree.c (from pg_bulkload HEAD):

http://pastebin.com/U23CapvR

I also attached the patch.

Thank you!!

Ben


-- 
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612
diff -r 3f065ec72ab8 pgbulkload/lib/pg_btree.c
--- a/pgbulkload/lib/pg_btree.c Fri Jan 20 09:26:20 2012 -0600
+++ b/pgbulkload/lib/pg_btree.c Wed Jan 25 13:37:43 2012 -0600
@@ -398,6 +398,8 @@
BTReaderTerm(reader);
 }
 
+void merge_tuples(Relation heap, IndexTuple itup_dst, IndexTuple itup_src);
+
 /*
  * _bt_mergeload - Merge two streams of index tuples into new index files.
  */
@@ -462,7 +464,6 @@
}
else
{
-   // TODO -- BSJ
if (on_duplicate == 
ON_DUPLICATE_KEEP_NEW)
{
self-dup_old++;
@@ -470,7 +471,21 @@

RelationGetRelationName(wstate-index));
itup2 = 
BTReaderGetNextItem(btspool2);
}
-   else 
+   else if (on_duplicate == 
ON_DUPLICATE_MERGE)
+   {
+   self-dup_old++;
+
+   // merge from itup into itup2 
where itup's col[i] is not null
+   // but itup2's col[i] IS null
+   merge_tuples(heapRel, itup2, 
itup); 
+
+   ItemPointerCopy(t_tid2, 
itup2-t_tid);
+   self-dup_new++;
+   remove_duplicate(self, heapRel, 
itup,
+   
RelationGetRelationName(wstate-index));
+   itup = 
BTSpoolGetNextItem(btspool, itup, should_free);
+   }
+   else
{
ItemPointerCopy(t_tid2, 
itup2-t_tid);
self-dup_new++;
@@ -950,6 +965,113 @@
self-dup_old + self-dup_new, relname);
 }
 
+// returns Buffer after locking it (BUFFER_LOCK_SHARE then BUFFER_LOCK_UNLOCK)
+Buffer load_buffer(Relation heap, IndexTuple itup, HeapTupleData *tuple /*OUT 
*/, ItemId *itemid /*OUT */ )
+{
+   BlockNumber blknum;
+   BlockNumber offnum;
+   Buffer  buffer;
+   Pagepage;
+
+   blknum = ItemPointerGetBlockNumber(itup-t_tid);
+   offnum = ItemPointerGetOffsetNumber(itup-t_tid);
+   buffer = ReadBuffer(heap, blknum);
+
+   LockBuffer(buffer, BUFFER_LOCK_SHARE);
+   page = BufferGetPage(buffer);
+   *itemid = PageGetItemId(page, offnum);
+   tuple-t_data = ItemIdIsNormal(*itemid)
+   ? (HeapTupleHeader) PageGetItem(page, *itemid)
+   : NULL;
+   LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+   return buffer;
+}
+
+void load_tuple(Relation   heap, 
+   HeapTuple   tuple, 
+   IndexTuple  itup, 
+   ItemId  itemid,
+  

Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Jeff Janes
On Wed, Jan 25, 2012 at 9:09 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas robertmh...@gmail.com wrote:
 Early yesterday morning, I was able to use Nate Boley's test machine
 do a single 30-minute pgbench run at scale factor 300 using a variety
 of trees built with various patches, and with the -l option added to
 track latency on a per-transaction basis.  All tests were done using
 32 clients and permanent tables.  The configuration was otherwise
 identical to that described here:

 http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com

 Previously we mostly used this machine for CPU benchmarking.  Have you
 previously described the IO subsystem?  That is becoming relevant for
 these types of benchmarks.   For example, is WAL separated from data?

 I actually don't know much about the I/O subsystem, but, no, WAL is
 not separated from data.  I believe $PGDATA is on a SAN, but I don't
 know anything about its characteristics.

I think the checkpointing issues that Greg is exploring are important,
and I'm pretty sure that that is what is limiting your TPS in this
case.  However, I don't think we can make much progress on that front
using a machine whose IO subsystem is largely unknown, and not
tweakable.

So I think the best use of this machine would be to explore the purely
CPU based scaling issues, like freelist, CLOG, and XLogInsert.

To do that, I'd set the scale factor small enough so that entire data
set is willing to be cached dirty by the kernel, based on the kernel
parameters:

egrep .  /proc/sys/vm/dirty_*

Then set shared_buffers to be less than the needs for that scale
factor, so freelist gets exercised.

And neutralize checkpoints, by setting fsync=off, so they don't
generate physical IO that we can't control for given the current
constraints on the machine set up.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas robertmh...@gmail.com wrote:
 Early yesterday morning, I was able to use Nate Boley's test machine
 do a single 30-minute pgbench run at scale factor 300 using a variety
 of trees built with various patches, and with the -l option added to
 track latency on a per-transaction basis.  All tests were done using
 32 clients and permanent tables.  The configuration was otherwise
 identical to that described here:

 http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com

 Previously we mostly used this machine for CPU benchmarking.  Have you
 previously described the IO subsystem?  That is becoming relevant for
 these types of benchmarks.   For example, is WAL separated from data?

I actually don't know much about the I/O subsystem, but, no, WAL is
not separated from data.  I believe $PGDATA is on a SAN, but I don't
know anything about its characteristics.

 By doing this, I hoped to get a better understanding of (1) the
 effects of a scale factor too large to fit in shared_buffers,

 In my hands, the active part of data at scale of 300 fits very
 comfortably into 8GB shared_buffers.

 Indeed, until you have run a very long time so that pgbench_history
 gets large, everything fits in 8GB.

Hmm, my mistake.  Maybe I need to crank up the scale factor some more.
 This is why good benchmarking is hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Tom Lane
David E. Wheeler da...@justatheory.com writes:
 On Jan 25, 2012, at 10:24 AM, Tom Lane wrote:
 Anyway, I'd be willing to hold off committing if someone were to
 volunteer to test an unintegrated copy of the patch against some
 moderately complicated application.  But it's a sufficiently large
 patch that I don't really care to sit on it and try to maintain it
 outside the tree for a long time.

 Why not create a branch? IIRC the build farm can be configured to run 
 branches.

I already know what the patch does against the regression tests.
Buildfarm testing is not of interest here.  What would be of help is,
say, Kevin volunteering to load up his Circuit Courts software and data
into a git-head server and see how performance looks with and without
the patch.  Distribution of the code doesn't strike me as being the
bottleneck.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread David E. Wheeler
On Jan 25, 2012, at 12:19 PM, Tom Lane wrote:

 Why not create a branch? IIRC the build farm can be configured to run 
 branches.
 
 I already know what the patch does against the regression tests.
 Buildfarm testing is not of interest here.  What would be of help is,
 say, Kevin volunteering to load up his Circuit Courts software and data
 into a git-head server and see how performance looks with and without
 the patch.  Distribution of the code doesn't strike me as being the
 bottleneck.

Yeah, but it would be easier to keep a branch up-to-date via `git rebase` than 
to maintain the patch, I would think. And if it’s a remote branch, then simpler 
distribution is a bonus.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 01:43:03PM -0600, Merlin Moncure wrote:
 On Wed, Jan 25, 2012 at 1:24 PM, Marko Kreen mark...@gmail.com wrote:
  On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote:
  On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen mark...@gmail.com wrote:
   I specifically want to avoid any sort of per-connection
   negotation, except the max format version supported,
   because it will mess up multiplexed usage of single connection.
   Then they need to either disabled advanced formats completely,
   or still do it per-query somehow (via GUCs?) which is mess.
 
  Being able to explicitly pick format version other than the one the
  application was specifically written against adds a lot of complexity
  and needs to be justified.  Maybe you're trying to translate data
  between two differently versioned servers?  I'm trying to understand
  the motive behind your wanting finer grained control of picking format
  version...
 
  You mean if client has written with version N formats, but connects
  to server with version N-1 formats?  True, simply not supporting
  such case simplifies client-side API.
 
  But note that it does not change anything on protocol level, it's purely
  client-API specific.  It may well be that some higher-level APIs
  (JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level
  API-s (raw libpq), it may be optional whether the client wants to
  support such usage or not.
 
 well, I see the following cases:
 1) Vserver  Vapplication: server downgrades wire formats to
 applications version
 2) Vapplication  Vlibpq  Vserver: since the application is
 reading/writing formats the server can't understand, an error should
 be raised if they are used in either direction
 3) Vlibpq = VApplication  Vserver: same as above, but libpq can
 'upconvert' low version wire format to application's wire format or
 error otherwise.

I don't see why you special-case libpq here.  There is no reason
libpq cannot pass older/newer formats through.  Only thing that
matters it parser/formatter version.  If that is done in libpq,
then app version does not matter.  If it's done in app, then
libpq version does not matter.

 By far, the most common cause of problems (both in terms of severity
 and frequency) is case #1.  #3 allows a 'compatibility mode' via
 libpq, but that comes at significant cost of complexity since libpq
 needs to be able to translate wire formats up (but not down).  #2/3 is
 a less common problem though as it's more likely the application can
 be adjusted to get up to speed: so to keep things simple we can maybe
 just error out in those scenarios.

I don't like the idea of conversion.  Instead either client
writes values through API that picks format based on server version,
or it writes them for specific version only.  In latter case it cannot
work with older server.  Unless the fixed version is the baseline.

 In the database, we need to maintain outdated send/recv functions
 basically forever and as much as possible try and translate old wire
 format data to and from newer backend structures (maybe in very
 specific cases that will be impossible such that the application is
 SOL, but that should be rare).  All send/recv functions, including
 user created ones need to be stamped with a version token (database
 version?).  With the versions of the application, libpq, and all
 server functions, we can determine all wire formats as long as we
 assume the application's targeted database version represents all the
 wire formats it was using.
 
 My good ideas stop there: the exact mechanics of how the usable set of
 functions are determined, how exactly the adjusted type look ups will
 work, etc. would all have to be sorted out.  Most of the nastier parts
 though (protocol changes notwithstanding) are not in libpq, but the
 server.  There's just no quick fix on the client side I can see.

It does not need to be complex - just bring the version number to
i/o function and let it decide whether it cares about it or not.
Most functions will not..  Only those that we want to change in
compatible manner need to look at it.

But I don't see that there is danger of having regular changes in wire
formats.  So most of the functions will ignore the versioning.
Including the ones that don't care about compatibility.

But seriously - on-wire compatibility is good thing, do not fear it...

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote:
 New version that repairs a defective test case.

Committed.  I don't find this to be particularly good style:

+   for (i = 0; i  old_natts  ret; i++)
+   ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
+  irel-rd_att-attrs[i]-atttypid == typeObjectId[i]);

...but I am not sure whether we have any formal policy against it, so
I just committed it as-is for now.  I would have surrounded the loop
with an if (ret) block and written the body of the loop as if
(condition) { ret = false; break; }.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Merlin Moncure
On Wed, Jan 25, 2012 at 2:29 PM, Marko Kreen mark...@gmail.com wrote:
 well, I see the following cases:
 1) Vserver  Vapplication: server downgrades wire formats to
 applications version
 2) Vapplication  Vlibpq  Vserver: since the application is
 reading/writing formats the server can't understand, an error should
 be raised if they are used in either direction
 3) Vlibpq = VApplication  Vserver: same as above, but libpq can
 'upconvert' low version wire format to application's wire format or
 error otherwise.

 I don't see why you special-case libpq here.  There is no reason
 libpq cannot pass older/newer formats through.  Only thing that
 matters it parser/formatter version.  If that is done in libpq,
 then app version does not matter.  If it's done in app, then
 libpq version does not matter.

Only because if the app is targeting wire format N, but the server can
only handle N-1, libpq has the opportunity to fix it up.  That's could
be just over thinking it though.

 By far, the most common cause of problems (both in terms of severity
 and frequency) is case #1.  #3 allows a 'compatibility mode' via
 libpq, but that comes at significant cost of complexity since libpq
 needs to be able to translate wire formats up (but not down).  #2/3 is
 a less common problem though as it's more likely the application can
 be adjusted to get up to speed: so to keep things simple we can maybe
 just error out in those scenarios.

 I don't like the idea of conversion.  Instead either client
 writes values through API that picks format based on server version,
 or it writes them for specific version only.  In latter case it cannot
 work with older server.  Unless the fixed version is the baseline.

ok.  another point about that: libpq isn't really part of the solution
anyways since there are other popular fully native protocol consumers,
including (and especially) jdbc, but also python, node.js etc etc.

that's why I was earlier insisting on a protocol bump, so that we
could in the new protocol force application version to be advertised.
v3 would remain caveat emptor for wire formats but v4 would not.

 In the database, we need to maintain outdated send/recv functions
 basically forever and as much as possible try and translate old wire
 format data to and from newer backend structures (maybe in very
 specific cases that will be impossible such that the application is
 SOL, but that should be rare).  All send/recv functions, including
 user created ones need to be stamped with a version token (database
 version?).  With the versions of the application, libpq, and all
 server functions, we can determine all wire formats as long as we
 assume the application's targeted database version represents all the
 wire formats it was using.

 My good ideas stop there: the exact mechanics of how the usable set of
 functions are determined, how exactly the adjusted type look ups will
 work, etc. would all have to be sorted out.  Most of the nastier parts
 though (protocol changes notwithstanding) are not in libpq, but the
 server.  There's just no quick fix on the client side I can see.

 It does not need to be complex - just bring the version number to
 i/o function and let it decide whether it cares about it or not.
 Most functions will not..  Only those that we want to change in
 compatible manner need to look at it.

well, maybe instead of passing version number around, the server
installs the proper compatibility send/recv functions just once on
session start up so your code isn't littered with stuff like
if(version  n) do this; else do this;?

 But seriously - on-wire compatibility is good thing, do not fear it...

sure -- but for postgres I just don't think it's realistic, especially
for the binary wire formats.  a json based data payload could give it
to you (and I'm only half kidding) :-).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
 On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote:
  New version that repairs a defective test case.
 
 Committed.  I don't find this to be particularly good style:
 
 +   for (i = 0; i  old_natts  ret; i++)
 +   ret = 
 (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
 +  irel-rd_att-attrs[i]-atttypid == 
 typeObjectId[i]);
 
 ...but I am not sure whether we have any formal policy against it, so
 I just committed it as-is for now.  I would have surrounded the loop
 with an if (ret) block and written the body of the loop as if
 (condition) { ret = false; break; }.

I find that code way too clever.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Nathan Boley
 I actually don't know much about the I/O subsystem, but, no, WAL is
 not separated from data.  I believe $PGDATA is on a SAN, but I don't
 know anything about its characteristics.

The computer is here:
http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm

$PGDATA is on a 5 disk SATA software raid 5.

Is there anything else that would be useful to know? ( Sorry, this
isn't a subject that I'm very familiar with )

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 02:50:09PM -0600, Merlin Moncure wrote:
 On Wed, Jan 25, 2012 at 2:29 PM, Marko Kreen mark...@gmail.com wrote:
  well, I see the following cases:
  1) Vserver  Vapplication: server downgrades wire formats to
  applications version
  2) Vapplication  Vlibpq  Vserver: since the application is
  reading/writing formats the server can't understand, an error should
  be raised if they are used in either direction
  3) Vlibpq = VApplication  Vserver: same as above, but libpq can
  'upconvert' low version wire format to application's wire format or
  error otherwise.
 
  I don't see why you special-case libpq here.  There is no reason
  libpq cannot pass older/newer formats through.  Only thing that
  matters it parser/formatter version.  If that is done in libpq,
  then app version does not matter.  If it's done in app, then
  libpq version does not matter.
 
 Only because if the app is targeting wire format N, but the server can
 only handle N-1, libpq has the opportunity to fix it up.  That's could
 be just over thinking it though.

I think it's over thinking.  The value should be formatted/parsed just
once.  Server side must support processing different versions.
Whether client side supports downgrading, it's up to client-side
programmers.

If you want to write compatible client, you have a choice of using
proper wrapper API, or simply writing baseline formatting, ignoring
format changes in new versions.

Both are valid approaches and I think we should keep it that way.

  By far, the most common cause of problems (both in terms of severity
  and frequency) is case #1.  #3 allows a 'compatibility mode' via
  libpq, but that comes at significant cost of complexity since libpq
  needs to be able to translate wire formats up (but not down).  #2/3 is
  a less common problem though as it's more likely the application can
  be adjusted to get up to speed: so to keep things simple we can maybe
  just error out in those scenarios.
 
  I don't like the idea of conversion.  Instead either client
  writes values through API that picks format based on server version,
  or it writes them for specific version only.  In latter case it cannot
  work with older server.  Unless the fixed version is the baseline.
 
 ok.  another point about that: libpq isn't really part of the solution
 anyways since there are other popular fully native protocol consumers,
 including (and especially) jdbc, but also python, node.js etc etc.
 
 that's why I was earlier insisting on a protocol bump, so that we
 could in the new protocol force application version to be advertised.
 v3 would remain caveat emptor for wire formats but v4 would not.

We can bump major/minor anyway to inform clients about new
functionality.  I don't particularly care about that.  What
I'm interested in is what the actual type negotation looks like.

It might be possible we could get away without bumpping anything.
But I have not thought about that angle too deeply yet.

  In the database, we need to maintain outdated send/recv functions
  basically forever and as much as possible try and translate old wire
  format data to and from newer backend structures (maybe in very
  specific cases that will be impossible such that the application is
  SOL, but that should be rare).  All send/recv functions, including
  user created ones need to be stamped with a version token (database
  version?).  With the versions of the application, libpq, and all
  server functions, we can determine all wire formats as long as we
  assume the application's targeted database version represents all the
  wire formats it was using.
 
  My good ideas stop there: the exact mechanics of how the usable set of
  functions are determined, how exactly the adjusted type look ups will
  work, etc. would all have to be sorted out.  Most of the nastier parts
  though (protocol changes notwithstanding) are not in libpq, but the
  server.  There's just no quick fix on the client side I can see.
 
  It does not need to be complex - just bring the version number to
  i/o function and let it decide whether it cares about it or not.
  Most functions will not..  Only those that we want to change in
  compatible manner need to look at it.
 
 well, maybe instead of passing version number around, the server
 installs the proper compatibility send/recv functions just once on
 session start up so your code isn't littered with stuff like
 if(version  n) do this; else do this;?

Seems confusing.  Note that type i/o functions are user-callable.
How should they act then?

Also note that if()s are needed only for types that want to change their
on-wire formatting.  Considering the mess incompatible on-wire format change
can cause, it's good price to pay.

  But seriously - on-wire compatibility is good thing, do not fear it...
 
 sure -- but for postgres I just don't think it's realistic, especially
 for the binary wire formats.  a json based data payload could give it
 to you (and I'm only half 

Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings

2012-01-25 Thread Alvaro Herrera

Excerpts from Noah Misch's message of sáb ene 14 12:40:02 -0300 2012:
 It has bothered me that psql's \copy ignores the ON_ERROR_ROLLBACK setting.
 Only SendQuery() takes note of ON_ERROR_ROLLBACK, and \copy, like all
 backslash commands, does not route through SendQuery().  Looking into this
 turned up several other weaknesses in psql's handling of COPY.

Interesting.

Committed, thanks.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 1:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Also, you're assuming that the changes have no upside whatsoever, which
 I fondly hope is not the case.  Large join problems tend not to execute
 instantaneously --- so nobody is going to complain if the planner takes
 awhile longer but the resulting plan is enough better to buy that back.
 In my test cases, the planner *is* finding better plans, or at least
 ones with noticeably lower estimated costs.  It's hard to gauge how
 much that translates to in real-world savings, since I don't have
 real data loaded up.  I also think, though I've not tried to measure,
 that I've made planning cheaper for very simple queries by eliminating
 some overhead in those cases.

I had a 34-table join on one of the last applications I maintained
that planned and executed in less than 2 seconds.  That was pushing
it, but I had many joins in the 10-20 table range that planned and
executed in 100-200 ms.  I agree that if you are dealing with a
terabyte table - or even a gigabyte table -  then the growth of
planning time will probably not bother anyone even if you fail to find
a better plan, and will certainly make the user very happy if you do.
But on tables with only a megabyte of data, it's not nearly so
clear-cut.  In an ideal world, I'd like the amount of effort we spend
planning to be somehow tied to the savings we can expect to get, and
deploy optimizations like this only in cases where we have a
reasonable expectation of that effort being repaid.

AIUI, this is mostly going to benefit cases like small LJ (big1 IJ
big2) and, of course, those cases aren't going to arise if your query
only involves small tables, or even if you have something like big IJ
small1 IJ small2 IJ small3 IJ small4 LJ small5 LJ small6 IJ small7,
which is a reasonably common pattern for me.  Now, if you come back
and say, ah, well, those cases aren't the ones that are going to be
harmed by this, then maybe we should have a more detailed conversation
about where the mines are.  Or maybe it is helping in more cases than
I'm thinking about at the moment.

 To be clear, I'd love to have this feature.  But if there is a choice
 between reducing planning time significantly for everyone and NOT
 getting this feature, and increasing planning time significantly for
 everyone and getting this feature, I think we will make more people
 happy by doing the first one.

 We're not really talking about are we going to accept or reject a
 specific feature.  We're talking about whether we're going to decide
 that the last two years worth of planner development were headed in
 the wrong direction and we're now going to reject that and try to
 think of some entirely new concept.  This isn't an isolated patch,
 it's the necessary next step in a multi-year development plan.  The
 fact that it's a bit slower at the moment just means there's still
 work to do.

I'm not proposing that you should never commit this.  I'm proposing
that any commit by anyone that introduces a 35% performance regression
is unwise, and doubly so at the end of the release cycle.  I have
every confidence that you can improve the code further over time, but
the middle of the last CommitFest is not a great time to commit code
that, by your own admission, needs a considerable amount of additional
work.  Sure, there are some things that we're not going to find out
until the code goes into production, but it seems to me that you've
already uncovered a fairly major performance problem that is only
partially fixed.  Once this is committed, it's not coming back out, so
we're either going to have to figure out how to fix it before we
release, or release with a regression in certain cases.  If you got it
down to 10% I don't think I'd be worried, but a 35% regression that we
don't know how to fix seems like a lot.

On another note, nobody besides you has looked at the code yet, AFAIK...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 3:52 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:

 Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
 On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote:
  New version that repairs a defective test case.

 Committed.  I don't find this to be particularly good style:

 +       for (i = 0; i  old_natts  ret; i++)
 +               ret = 
 (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
 +                          irel-rd_att-attrs[i]-atttypid == 
 typeObjectId[i]);

 ...but I am not sure whether we have any formal policy against it, so
 I just committed it as-is for now.  I would have surrounded the loop
 with an if (ret) block and written the body of the loop as if
 (condition) { ret = false; break; }.

 I find that code way too clever.

The way he wrote it, or the way I proposed to write it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié ene 25 19:05:44 -0300 2012:
 
 On Wed, Jan 25, 2012 at 3:52 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 
  Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
  On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote:
   New version that repairs a defective test case.
 
  Committed.  I don't find this to be particularly good style:
 
  +       for (i = 0; i  old_natts  ret; i++)
  +               ret = 
  (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
  +                          irel-rd_att-attrs[i]-atttypid == 
  typeObjectId[i]);
 
  ...but I am not sure whether we have any formal policy against it, so
  I just committed it as-is for now.  I would have surrounded the loop
  with an if (ret) block and written the body of the loop as if
  (condition) { ret = false; break; }.
 
  I find that code way too clever.
 
 The way he wrote it, or the way I proposed to write it?

The code as committed.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Fix Leaky View Problem

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 5:57 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 This passes installcheck initially.  Then upon second invocation of
 installcheck, it fails.

 It creates the role alice, and doesn't clean it up.  On next
 invocation alice already exists and cases a failure in test
 select_views.

 Thanks for your pointing out.

 The attached patch adds cleaning-up part of object being defined
 within this test;
 includes user alice.

Urp.  I failed to notice this patch and committed a different fix for
the problem pointed out by Jeff.  I'm inclined to think it's OK to
leave the non-shared objects behind - among other things, if people
are testing pg_upgrade using the regression database, this will help
to ensure that pg_dump is handling security_barrier views correctly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Different error messages executing CREATE TABLE or ALTER TABLE to create a column xmin

2012-01-25 Thread Giuseppe Sucameli
Hi hackers,

the attached patch fixes the problem I explained in pgsql-bugs (forwarded).
It's a trivial problem, so no initial patch design was required.

Hope to see my patch applied.
Thanks for your work.

Regards.


-- Forwarded message --
From: Giuseppe Sucameli brush.ty...@gmail.com
Date: Sun, Jan 22, 2012 at 2:22 PM
Subject: Different error messages executing CREATE TABLE or ALTER
TABLE to create a column xmin
To: pgsql-b...@postgresql.org


Hi all,

trying to create a table with a column xmin I get the
following error message:

test= create table lx (xmin int);
ERROR:  column name xmin conflicts with a system
column name

Instead I get a different (and less understandable) error
message if I try to add a column named xmin to an
existent table:

test= create table lx (i int);
CREATE TABLE
test= alter table lx add xmin int;
ERROR:  column xmin of relation lx already exists.

The same problem occurs using xmax as column name.

I'm on Ubuntu 11.04.
Tried on both PostgreSQL 8.4.10 and 9.1.2

Regards.

--
Giuseppe Sucameli


-- 
Giuseppe Sucameli


postgresql_syscol_message.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: pg_basebackup (missing exit on error)

2012-01-25 Thread Thomas Ogrisegg
While testing a backup script, I noticed that pg_basebackup exits with
0, even if it had errors while writing the backup to disk when the
backup file is to be sent to stdout. The attached patch should fix this
problem (and some special cases when the last write fails).

Regards,

Thomas
diff -uNr postgresql-9.1.2/src/bin/pg_basebackup/pg_basebackup.c postgresql-9.1.2-patch/src/bin/pg_basebackup/pg_basebackup.c
--- postgresql-9.1.2/src/bin/pg_basebackup/pg_basebackup.c	2011-12-01 22:47:20.0 +0100
+++ postgresql-9.1.2-patch/src/bin/pg_basebackup/pg_basebackup.c	2012-01-23 13:14:16.0 +0100
@@ -410,6 +410,7 @@
 {
 	fprintf(stderr, _(%s: could not write to compressed file \%s\: %s\n),
 			progname, filename, get_gz_error(ztarfile));
+	disconnect_and_exit(1);
 }
 			}
 			else
@@ -428,7 +429,14 @@
 #ifdef HAVE_LIBZ
 if (ztarfile)
 	gzclose(ztarfile);
+else
 #endif
+	if (fflush (tarfile) != 0)
+	{
+		fprintf(stderr, _(%s: error flushing stdout: %s\n),
+strerror (errno));
+		disconnect_and_exit(1);
+	}
 			}
 			else
 			{
@@ -437,7 +445,14 @@
 	gzclose(ztarfile);
 #endif
 if (tarfile != NULL)
-	fclose(tarfile);
+{
+	if (fclose(tarfile) != 0)
+	{
+		fprintf(stderr, _(%s: error closing file \%s\: %s\n),
+progname, filename, strerror (errno));
+		disconnect_and_exit(1);
+	}
+}
 			}
 
 			break;
@@ -456,6 +471,7 @@
 			{
 fprintf(stderr, _(%s: could not write to compressed file \%s\: %s\n),
 		progname, filename, get_gz_error(ztarfile));
+disconnect_and_exit(1);
 			}
 		}
 		else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Psql names completion.

2012-01-25 Thread Dominik Bylica

Hello.

It's probably not the right place to write, but I guess you are there to 
take care of it.


When I was creating a trigger with command like:
create trigger asdf before update on beginninOfTheNameOfMyTable...
I hit tab and I got:
create trigger asdf before update on SET

That was obviously not what I expected to get.

Regards.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-01-25 Thread Greg Smith

On 01/23/2012 11:16 PM, Robert Treat wrote:

I keep thinking Greg has mistaken happiness with the MB based info in
the vacuum patch as being happy without the output format, when really
it is all about increased visibility.


I haven't taken that as anything but evidence I'm at least moving in the 
right direction.  I'm relatively systematic about how I approach these 
things nowadays:  figure out what isn't being logged/accounted for yet, 
add visibility to that thing, iterate on improving its behavior as 
measured by that, then make the best sorts of behavior changes 
automatic.  This suggested feature change, moving around what I see as 
the worst part of the tuning knob UI, is an initial attempt along step 3 
in that path.  There's still plenty of ongoing work adding more 
visibility too.


To more quickly summarize the point I was trying to make, providing a 
meter that trivially shows you good vs. bad is the almost same problem 
as making it fully automatic.  If you can measure exactly when something 
needs to happen and how badly screwed up it is, that's the hard part of 
knowing when to just go fix it.  Right now, the autovacuum measure is 
based on the percent of change to something.  I don't think any improved 
vacuum triggering will go somewhere useful unless it starts with a 
better measure of how real-world bloat accumulates, which you cannot 
extract from the data collected yet.  The free space measurement thread 
and the ideas that have mainly been bouncing between Jaime and Noah 
recently on this subject are directly addressing that, the part that 
I've found to be the single most useful additional thing to monitor.  It 
may not be obvious that's providing information that can be consumed by 
autovacuum, but that was my intention for how these pieces will 
ultimately fit together.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show Heap Fetches in EXPLAIN for index-only scans

2012-01-25 Thread Robert Haas
On Sat, Jan 21, 2012 at 9:50 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 A review:

 [ review ]

Thanks.  Committed with hopefully-appropriate revisions.

 As a side-note, I noticed that I needed to run vacuum twice in a row
 to get the Heap Fetches to drop to zero.  I vaguely recall that only
 one vacuum was needed when ios first went in (and I had instrumented
 it to elog heap-fetches).  Does anyone know if this the expected
 consequence of one of the recent changes we made to vacuum?

No, that's not expected.  The change we made to vacuum was to skip
pages that are busy - but it shouldn't be randomly skipping pages for
no reason.  I can reproduce what you're observing, though:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)

After updating a row in the table and checkpointing, the page the rows
was on is marked full and the page that gets the new version becomes
not-all-visible:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x03fb  Flags: 0x0003 (HAS_FREE_LINES|PAGE_FULL)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0001 (HAS_FREE_LINES)

Now I vacuum the relation and checkpoint, and the page the *new*
relation is on becomes all-visible:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0001 (HAS_FREE_LINES)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)

Now I vacuum it again and checkpoint, and now the old page also
becomes all-visible:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)

But it seems to me that this is expected (if non-optimal) behavior.
Only the first pass of vacuum knows how to mark pages all-visible.
After the update, the first pass of the first vacuum sees a dead tuple
on the old page and truncates it to a dead line pointer.  When it
comes to the new page, it observes that the page is now all-visible
and marks it so.  It then does index vacuuming and returns to the
first page, marking the dead line pointer unused.  But during this
second visit to the old page, there's no possibility of marking the
page as all-visible, because the code doesn't know how to do that.
The next vacuum's first pass, however, can do so, because there are no
longer any dead tuples on the page.

We could fix this by modifying lazy_vacuum_page(): since we have to
dirty the buffer anyway, we could recheck whether all the remaining
tuples on the page are now all-visible, and if so set the visibility
map bit.  This is probably desirable even apart from index-only scans,
because it will frequently save the next vacuum the cost of reading,
dirtying, and writing extra pages.  There will be some incremental CPU
cost, but that seems likely to be more than repaid by the I/O savings.

Thoughts?  Should we do this at all?  If so, should we squeeze it into
9.2 or leave it for 9.3?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I've been thinking, what exactly is the important part of this group commit
 patch that gives the benefit? Keeping the queue sorted isn't all that
 important - XLogFlush() requests for commits will come in almost the correct
 order anyway.

 I also don't much like the division of labour between groupcommit.c and
 xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back
 DoXLogFlush(), which is a bit hard to follow. (that's in my latest version;
 the original patch had similar division but it also cut across processes,
 with the wal writer actually doing the flush)

 It occurs to me that the WALWriteLock already provides much of the same
 machinery we're trying to build here. It provides FIFO-style queuing, with
 the capability to wake up the next process or processes in the queue.
 Perhaps all we need is some new primitive to LWLock, to make it do exactly
 what we need.

 Attached is a patch to do that. It adds a new mode to
 LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it
 is acquired and the function returns immediately. However, unlike normal
 LWLockConditionalAcquire(), if the lock is not free it goes to sleep until
 it is released. But unlike normal LWLockAcquire(), when the lock is
 released, the function returns *without* acquiring the lock.

 I modified XLogFlush() to use that new function for WALWriteLock. It tries
 to get WALWriteLock, but if it's not immediately free, it waits until it is
 released. Then, before trying to acquire the lock again, it rechecks
 LogwrtResult to see if the record was already flushed by the process that
 released the lock.

 This is a much smaller patch than the group commit patch, and in my
 pgbench-tools tests on my laptop, it has the same effect on performance. The
 downside is that it touches lwlock.c, which is a change at a lower level.
 Robert's flexlocks patch probably would've been useful here.

I think you should break this off into a new function,
LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
Also, instead of adding lwWaitOnly, I would suggest that we generalize
lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc.  That way
we don't have to add another boolean every time someone invents a new
type of wait - not that that should hopefully happen very often.  A
side benefit of this is that you can simplify the test in
LWLockRelease(): keep releasing waiters until you come to an exclusive
waiter, then stop.  This possibly saves one shared memory fetch and
subsequent test instruction, which might not be trivial - all of this
code is extremely hot.  We probably want to benchmark this carefully
to make sure that it doesn't cause a performance regression even just
from this:

+   if (!proc-lwWaitOnly)
+   lock-releaseOK = false;

I know it sounds crazy, but I'm not 100% sure that that additional
test there is cheap enough not to matter.  Assuming it is, though, I
kind of like the concept: I think there must be other places where
these semantics are useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-01-25 Thread Robert Haas
On Sat, Jan 21, 2012 at 10:11 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Your caution is wise. All users of an index have already checked
 whether the index is usable at plan time, so although there is much
 code that assumes they can look at the index itself, that is not
 executed until after the correct checks.

 I'll look at VACUUM and other utilities, so thanks for that.

 I don't see much point in having the higher level lock, except perhaps
 as a test this code works.

I thought of another way this can cause a problem: suppose that while
we're dropping the relation with only ShareUpdateExclusiveLock, we get
as far as calling DropRelFileNodeBuffers.  Just after we finish, some
other process that holds AccessShareLock or RowShareLock or
RowExclusiveLock reads and perhaps even dirties a page in the
relation.  Now we've got pages in the buffer pool that might even be
dirty, but the backing file is truncated or perhaps removed
altogether.  If the page is dirty, I think the background writer will
eventually choke trying to write out the page.  If the page is not
dirty, I'm less certain whether anything is going to explode
violently, but it seems mighty sketchy to have pages in the buffer
pool with a buffer tag that don't exist any more.  As the comment
says:

 *  It is the responsibility of higher-level code to ensure that the
 *  deletion or truncation does not lose any data that could be needed
 *  later.  It is also the responsibility of higher-level code to ensure
 *  that no other process could be trying to load more pages of the
 *  relation into buffers.

...and of course, the intended mechanism is an AccessExclusiveLock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2012-01-25 Thread Noah Misch
On Tue, Jan 24, 2012 at 11:24:08AM -0500, Jaime Casanova wrote:
 On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch n...@leadboat.com wrote:
  If someone feels like
  doing it, +1 for making pgstattuple() count non-leaf free space.
 
 actually i agreed that non-leaf pages are irrelevant... i just
 confirmed that in a production system with 300GB none of the indexes
 in an 84M rows table nor in a heavily updated one has more than 1 root
 page, all the rest are deleted, half_dead or leaf. so the posibility
 of bloat coming from non-leaf pages seems very odd

FWIW, the number to look at is internal_pages from pgstatindex():

[local] test=# create table t4 (c) as select * from generate_series(1,100);
SELECT 100
[local] test=# alter table t4 add primary key(c);
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index t4_pkey for 
table t4
ALTER TABLE
[local] test=# select * from pgstatindex('t4_pkey');
-[ RECORD 1 ]--+-
version| 2
tree_level | 2
index_size | 22478848
root_block_no  | 290
internal_pages | 10
leaf_pages | 2733
empty_pages| 0
deleted_pages  | 0
avg_leaf_density   | 90.06
leaf_fragmentation | 0

So, 0.4% of this index.  They appear in proportion to the logarithm of the
total index size.  I agree that bloat centered on them is unlikely.  Counting
them would be justified, but that is a question of formal accuracy rather than
practical importance.

 but the possibility of bloat coming from the meta page doesn't exist,
 AFAIUI at least
 
 we need the most accurate value about usable free space, because the
 idea is to add a sampler mode to the function so we don't scan the
 whole relation. that's why we still need the function.

I doubt we'd add this function solely on the basis that a future improvement
will make it useful.  For the patch to go in now, it needs to be useful now.
(This is not a universal principle, but it mostly holds for low-complexity
patches like this one.)

All my comments below would also apply to such a broader patch.

 btw... pgstattuple also has the problem that it's not using a ring buffer
 
 
 attached are two patches:
 - v5: is the same original patch but only track space in leaf, deleted
 and half_dead pages
 - v5.1: adds the same for all kind of indexes (problem is that this is
 inconsistent with the fact that pageinspect only manages btree indexes
 for everything else)

Let's take a step back.  Again, what you're proposing is essentially a faster
implementation of SELECT free_percent FROM pgstattuple(rel).  If this code
belongs in core at all, it belongs in the pgstattuple module.  Share as much
infrastructure as is reasonable between the user-visible functions of that
module.  For example, I'm suspecting that the pgstat_index() call tree should
be shared, with pgstat_index_page() checking a flag to decide whether to
gather per-tuple stats.

Next, compare the bits of code that differ between pgstattuple() and
relation_free_space(), convincing yourself that the differences are justified.
Each difference will yield one of the following conclusions:

1. Your code contains an innovation that would apply to both functions.  Where
not too difficult, merge these improvements into pgstattuple().  In order for
a demonstration of your new code's better performance to be interesting, we
must fix the same low-hanging fruit in its competitor.  One example is the use
of the bulk read strategy.  Another is the support for SP-GiST.

2. Your code is missing an essential behavior of pgstattuple().  Add it to
your code.  One example is the presence of CHECK_FOR_INTERRUPTS() calls.

3. Your code behaves differently from pgstattuple() due to a fundamental
difference in their tasks.  These are the only functional differences that
ought to remain in your finished patch; please point them out in comments.
For example, pgstat_heap() visits every tuple in the heap.  You'll have no
reason to do that; pgstattuple() only needs it to calculate statistics other
than free_percent.

In particular, I call your attention to the fact that pgstattuple() takes
shared buffer content locks before examining pages.  Your proposed patch does
not do so.  I do not know with certainty whether that falls under #1 or #2.
The broad convention is to take such locks, because we elsewhere want an exact
answer.  These functions are already inexact; they make no effort to observe a
consistent snapshot of the table.  If you convince yourself that the error
arising from not locking buffers is reasonably bounded, we can lose the locks
(in both functions -- conclusion #1).  Comments would then be in order.

With all that done, run some quick benchmarks: see how SELECT free_percent
FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) for
a large heap and for a large B-tree index.  If the timing difference is too
small to be interesting to you, remove relation_free_space() and submit your
pgstattuple() improvements alone.  Otherwise, 

Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Noah Misch
On Wed, Jan 25, 2012 at 03:32:49PM -0500, Robert Haas wrote:
 On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote:
  New version that repairs a defective test case.
 
 Committed.  I don't find this to be particularly good style:

Thanks.

 +   for (i = 0; i  old_natts  ret; i++)
 +   ret = 
 (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
 +  irel-rd_att-attrs[i]-atttypid == 
 typeObjectId[i]);
 
 ...but I am not sure whether we have any formal policy against it, so
 I just committed it as-is for now.  I would have surrounded the loop
 with an if (ret) block and written the body of the loop as if
 (condition) { ret = false; break; }.

I value the savings in vertical space more than the lost idiomaticness.  This
decision is 90+% subjective, so I cannot blame you for concluding otherwise.
I do know the feeling of looking at PostgreSQL source code and wishing the
author had not attempted to conserve every line.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-25 Thread Noah Misch
On Wed, Jan 25, 2012 at 11:17:27AM -0300, Alvaro Herrera wrote:
 
 Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012:
 
  Thanks.  I've attached a new version fixing this problem.
 
 Looks good to me.  Can you please clarify this bit?
 
* Since we elsewhere require that all collations share 
 the same
* notion of equality, don't compare collation here.
 
 Since I'm not familiar with this code, it's difficult for me to figure
 out where this elsewhere is, and what does same notion of equality
 mean.

Good questions.  See the comment in front of ri_GenerateQualCollation(), the
comment in process_ordered_aggregate_single() at nodeAgg.c:605, the larger
comment in add_sort_column(), and the two XXX comments in
relation_has_unique_index_for().  We should probably document that assumption
in xindex.sgml to keep type implementors apprised.

Same notion of equality just means that a COLLATE x = b COLLATE y has the
same value for every x, y.

In any event, the patch needed a rebase, so I've attached it rebased and with
that comment edited to reference ri_GenerateQualCollation(), that being the
most-relevant source for the assumption in question.

Thanks for reviewing,
nm
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cb8ac67..ba20950 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 276,281  static Oid transformFkeyCheckAttrs(Relation pkrel,
--- 276,282 
int numattrs, int16 *attnums,
Oid *opclasses);
  static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts);
+ static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid 
*funcid);
  static void validateCheckConstraint(Relation rel, HeapTuple constrtup);
  static void validateForeignKeyConstraint(char *conname,
 Relation rel, Relation 
pkrel,
***
*** 357,362  static void ATPostAlterTypeCleanup(List **wqueue, 
AlteredTableInfo *tab, LOCKMOD
--- 358,364 
  static void ATPostAlterTypeParse(Oid oldId, char *cmd,
 List **wqueue, LOCKMODE lockmode, bool 
rewrite);
  static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
+ static void TryReuseForeignKey(Oid oldId, Constraint *con);
  static void change_owner_fix_column_acls(Oid relationOid,
 Oid oldOwnerId, Oid 
newOwnerId);
  static void change_owner_recurse_to_sequences(Oid relationOid,
***
*** 5591,5596  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5593,5600 
numpks;
Oid indexOid;
Oid constrOid;
+   boolold_check_ok;
+   ListCell   *old_pfeqop_item = list_head(fkconstraint-old_conpfeqop);
  
/*
 * Grab an exclusive lock on the pk table, so that someone doesn't 
delete
***
*** 5707,5712  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5711,5723 
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
 errmsg(number of referencing and referenced 
columns for foreign key disagree)));
  
+   /*
+* On the strength of a previous constraint, we might avoid scanning
+* tables to validate this one.  See below.
+*/
+   old_check_ok = (fkconstraint-old_conpfeqop != NIL);
+   Assert(!old_check_ok || numfks == 
list_length(fkconstraint-old_conpfeqop));
+ 
for (i = 0; i  numpks; i++)
{
Oid pktype = pktypoid[i];
***
*** 5721,5726  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5732,5738 
Oid ppeqop;
Oid ffeqop;
int16   eqstrategy;
+   Oid pfeqop_right;
  
/* We need several fields out of the pg_opclass entry */
cla_ht = SearchSysCache1(CLAOID, 
ObjectIdGetDatum(opclasses[i]));
***
*** 5763,5772  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,

 eqstrategy);
else
!   ffeqop = InvalidOid;/* keep compiler quiet */
  
if (!(OidIsValid(pfeqop)  OidIsValid(ffeqop)))
{
--- 5775,5791 
 

[HACKERS] cursors FOR UPDATE don't return most recent row

2012-01-25 Thread Alvaro Herrera

This is my test case (all in one session):

CREATE TABLE foo (
  key int PRIMARY KEY,
  value   int
);

INSERT INTO foo VALUES (1, 1);

 BEGIN;
 DECLARE foo CURSOR FOR SELECT * FROM foo FOR UPDATE;
 UPDATE foo SET value = 2 WHERE key = 1;
 UPDATE foo SET value = 3 WHERE key = 1;
 FETCH 1 FROM foo;
 COMMIT;


I expected the FETCH to return one row, with the latest data, i.e.
(1, 3), but instead it's returning empty.

If instead I run both UPDATEs in another session, then I do get

alvherre=#  FETCH 1 FROM foo; 
 key | value 
-+---
   1 | 3
(1 fila)


Is this intended?

-- 
Álvaro Herrera alvhe...@alvh.no-ip.org

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-25 Thread Fujii Masao
On Tue, Jan 24, 2012 at 6:43 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Cleaned up the points noted, new patch attached in case you wish to
 review further.

 Still has bug, so still with me to fix.

 Thanks! Will review further.

v3 patch contains lots of unrelated code changes like the following.

-   structfieldpid/structfield column of the
+   structfieldprocpid/structfield column of the

You seem to have failed to extract the patch from your repository correctly.
So I used v2 patch for the review. Sorry if I comment the things which you've
already fixed in v3 patch.

Here are the comments. They are almost not serious problems.

+/*
+ * GUC parameters
+ */
+intWalRestoreDelay = 1;

You forget to change guc.c to define wal_restore_delay as a GUC parameter?
Or just that source code comment is incorrect?


+   elog(FATAL, recovery_restore_command is too long);

Typo: s/recovery_restore_command/restore_command


+   InitLatch(walrstr-WALRestoreLatch); /* initialize latch used in main 
loop */

That latch is shared one. OwnLatch() should be called instead of InitLatch()?
If yes, it's better to call DisownLatch() when walrestore exits. Though skipping
DisownLatch() would be harmless because the latch is never owned by new
process after walrestore exits.


+   {
+   /* use volatile pointer to prevent code rearrangement */
+   volatile WalRestoreData *walrstr = WalRstr;
+
+   nextFileTli = walrstr-nextFileTli;

The declaration of walrstr is not required here because it's already done
at the head of WalRestoreNextFile().


+   if (restoredFromArchive)
+   {
+   /* use volatile pointer to prevent code rearrangement */
+   volatile WalRestoreData *walrstr = WalRstr;

Same as above.


+#define TMPRECOVERYXLOGRECOVERYXLOG

ISTM that it's better to move this definition to an include file and we should
use it in all the places where the fixed value RECOVERYXLOG is still used.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
 On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch n...@leadboat.com wrote:
 New version that repairs a defective test case.
 
 Committed.  I don't find this to be particularly good style:
 
 +   for (i = 0; i  old_natts  ret; i++)
 +   ret = 
 (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
 +  irel-rd_att-attrs[i]-atttypid == 
 typeObjectId[i]);
 
 ...but I am not sure whether we have any formal policy against it, so
 I just committed it as-is for now.  I would have surrounded the loop
 with an if (ret) block and written the body of the loop as if
 (condition) { ret = false; break; }.

 I find that code way too clever.

Not only is that code spectacularly unreadable, but has nobody noticed
that this commit broke the buildfarm?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-01-25 Thread Fujii Masao
On Fri, Jan 20, 2012 at 12:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs si...@2ndquadrant.com wrote:

 When I say skip the shutdown checkpoint, I mean remove it from the
 critical path of required actions at the end of recovery. We can still
 have a normal checkpoint kicked off at that time, but that no longer
 needs to be on the critical path.

 Any problems foreseen? If not, looks like a quick patch.

 Patch attached for discussion/review.

 This feature is what I want, and very helpful to shorten the failover time in
 streaming replication.

 Here are the review comments. Though I've not checked enough whether
 this feature works fine in all recovery patterns yet.

 LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery().
 LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery().

 XLOG_END_OF_RECOVERY record is written to the WAL file with new
 assigned timeline ID. But it must be written to the WAL file with old one.
 Otherwise, when re-entering a recovery after failover, we cannot find
 XLOG_END_OF_RECOVERY record at all.

 Before XLOG_END_OF_RECOVERY record is written,
 RmgrTable[rmid].rm_cleanup() might write WAL records. They also
 should be written to the WAL file with old timeline ID.

 When recovery target is specified, we cannot write new WAL to the file
 with old timeline because which means that valid WAL records in it are
 overwritten with new WAL. So when recovery target is specified,
 ISTM that we cannot skip end of recovery checkpoint. Or we might need
 to save all information about timelines in the database cluster instead
 of writing XLOG_END_OF_RECOVERY record, and use it when re-entering
 a recovery.

 LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise,
 what if the server crashes after new timeline history file is created and
 recovery.conf is removed, but before XLOG_END_OF_RECOVERY record
 has not been flushed to the disk yet?

 During recovery, when we replay XLOG_END_OF_RECOVERY record, we
 should close the currently-opened WAL file and read the WAL file with
 the timeline which XLOG_END_OF_RECOVERY record indicates.
 Otherwise, when re-entering a recovery with old timeline, we cannot
 reach new timeline.



 OK, some bad things there, thanks for the insightful comments.



 I think you're right that we can't skip the checkpoint if xlog_cleanup
 writes WAL records, since that implies at least one and maybe more
 blocks have changed and need to be flushed. That can be improved upon,
 but not now in 9.2.Cleanup WAL is written in either the old or the new
 timeline, depending upon whether we increment it. So we don't need to
 change anything there, IMHO.

 The big problem is how we handle crash recovery after we startup
 without a checkpoint. No quick fixes there.

 So let me rethink this: The idea was that we can skip the checkpoint
 if we promote to normal running during streaming replication.

 WALReceiver has been writing to WAL files, so can write more data
 without all of the problems noted. Rather than write the
 XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from
 the WALreceiver** as a dummy record by direct injection into the WAL
 stream. So the Startup process sees a WAL record that looks like it
 was written by the primary saying promote yourself, although it was
 actually written locally by WALreceiver when requested to shutdown.
 That doesn't damage anything because we know we've received all the
 WAL there is. Most importantly we don't need to change any of the
 logic in a way that endangers the other code paths at end of recovery.

 Writing the record in that way means we would need to calculate the
 new tli slightly earlier, so we can input the correct value into the
 record. That also solves the problem of how to get additional standbys
 to follow the new master. The XLOG_END_OF_RECOVERY record is simply
 the contents of the newly written tli history file.

 If we skip the checkpoint and then crash before the next checkpoint we
 just change timeline when we see XLOG_END_OF_RECOVERY. When we replay
 the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli
 file and then switch to it.

 So this solves 2 problems: having other standbys follow us when they
 don't have archiving, and avoids the checkpoint.

 Let me know what you think.

Looks good to me.

One thing I would like to ask is that why you think walreceiver is more
appropriate for writing XLOG_END_OF_RECOVERY record than startup
process. I was thinking the opposite, because if we do so, we might be
able to skip the end-of-recovery checkpoint even in file-based log-shipping
case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Fujii Masao
On Thu, Jan 26, 2012 at 3:07 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao masao.fu...@gmail.com wrote:

 What happens if we shutdown the WALwriter and then issue SIGHUP?

 SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned 
 about
 the case where smart shutdown kills walwriter but some backends are
 still running?
 Currently SIGHUP affects full_page_writes and running backends use the 
 changed
 new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

 To address the problem, we should either postpone the shutdown of walwriter
 until all backends have gone away, or leave the update of full_page_writes 
 to
 checkpointer process instead of walwriter. Thought?

 checkpointer seems the correct place to me


 Done.

Thanks a lot!!

I proposed another small patch which fixes the issue about an error message of
pg_basebackup, in this upthread. If it's reasonable, could you commit it?
http://archives.postgresql.org/message-id/CAHGQGwENjSDN=f_vdpwvq53qru0cu9+wzkbvwnaexmawj-y...@mail.gmail.com

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-25 Thread Fujii Masao
On Wed, Jan 25, 2012 at 11:12 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 25, 2012 at 1:57 PM, Robert Haas robertmh...@gmail.com wrote:

 I think that this would be a lot more clear if we described this as
 synchronous_commit = remote_write rather than just synchronous_commit
 = write.  Actually, the internal constant is named that way already,
 but it's not exposed as such to the user.

 That's something to discuss at the end of the CF when people are less
 busy and we get more input.

 It's an easy change whatever we decide to do.

Added this to 9.2 Open Items.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pause at end of recovery

2012-01-25 Thread Fujii Masao
On Wed, Dec 28, 2011 at 7:27 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Dec 22, 2011 at 6:16 AM, Simon Riggs si...@2ndquadrant.com wrote:

 I can see a reason to do this now. I've written patch and will commit
 on Friday. Nudge me if I don't.

 It's hard to write this so it works in all cases and doesn't work in
 the right cases also.

 Basically, we can't get in the way of crash recovery, so the only way
 we can currently tell a crash recovery from an archive recovery is the
 presence of restore_command.

 If you don't have that and you haven't set a recovery target, it won't
 pause and there's nothing I can do, AFAICS.

 Please test this and review before commit.

What if wrong recovery target is specified and an archive recovery reaches
end of WAL files unexpectedly? Even in this case, we want to pause
recovery at the end? Otherwise, we'll lose chance to correct the recovery
target and retry archive recovery.

One idea; starting archive recovery with standby_mode=on meets your needs?
When archive recovery reaches end of WAL files, regardless of whether recovery
target is specified or not, recovery pauses at the end. If hot_standby
is enabled,
you can check the contents and if it's OK you can finish recovery by
pg_ctl promote.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Mikko Tiihonen

On 01/25/2012 06:40 PM, Tom Lane wrote:

Marko Kreenmark...@gmail.com  writes:

On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:

Huh?  How can that work?  If we decide to change the representation of
some other well known type, say numeric, how do we decide whether a
client setting that bit is expecting that change or not?



It sets that bit *and* version code - which means that it is
up-to-date with all well-known type formats in that version.


Then why bother with the bit in the format code?  If you've already done
some other negotiation to establish what datatype formats you will
accept, this doesn't seem to be adding any value.


Basically, I see 2 scenarios here:



1) Client knows the result types and can set the
text/bin/version code safely, without further restrictions.



2) There is generic framework, that does not know query contents
but can be expected to track Postgres versions closely.
Such framework cannot say binary for results safely,
but *could* do it for some well-defined subset of types.


The hole in approach (2) is that it supposes that the client side knows
the specific datatypes in a query result in advance.  While this is
sometimes workable for application-level code that knows what query it's
issuing, it's really entirely untenable for a framework or library.
The only way that a framework can deal with arbitrary queries is to
introduce an extra round trip (Describe step) to see what datatypes
the query will produce so it can decide what format codes to issue
... and that will pretty much eat up any time savings you might get
from a more efficient representation.


This is pretty much what jdbc driver already does, since it does not have
100% coverage of even current binary formats. First time you execute a
query it requests text encoding, but caches the Describe results. Next
time it sets the binary bits on all return columns that it knows how to
decode.


You really want to do the negotiation once, at connection setup, and
then be able to process queries without client-side prechecking of what
data types will be sent back.


I think my original minor_version patch tried to do that. It introduced a
per-connection setting for version. Server GUC_REPORTED the maximum supported
minor_version but defaulted to the baseline wire format.
The jdbc client could bump the minor_version to supported higher
value (error if value larger than what server advertised).

A way was provided for the application using jdbc driver to
override the requested minor_version in the rare event that something
broke (rare, because jdbc driver generally does not expose the
wire-encoding to applications).

Now if pgbounce and other pooling solutions would reset the minor_version
to 0 then it should work.

Scenarios where other end is too old to know about the minor_version:
VserverVlibpq  = client does nothing - use baseline version
VlibpqVserver  = no supported_minor_version in GUC_REPORT - use baseline

Normal 9.2+ scenarios:
VserverVlibpq  = libpg sets minor_version to largest that is supports
   - libpq requested version used
VlibpqVserver  = libpg notices that server supported value is lower than
   its so it sets minor_version to server supported value
   - server version used

For perl driver that exposes the wire format to application by default
I can envision that the driver needs to add a new API that applications
need to use to explicitly bump the minor_version up instead of defaulting
to the largest supported by the driver as in jdbc/libpg.

The reason why I proposed a incrementing minor_version instead of bit flags
of new encodings was that it takes less space and is easier to document and
understand so that exposing it to applications is possible.

But how to handle postgres extensions that change their wire-format?
Maybe we do need to have oid:minor_version,oid:ver,oid_ver as the
negotiated version variable?

-Mikko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers