Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-06 Thread Shigeru Hanada
Thanks for the comments!

(2012/02/06 5:08), Kohei KaiGai wrote:
 Yes. In my opinion, one significant benefit of pgsql_fdw is to execute
 qualifiers on the distributed nodes; that enables to utilize multiple
 CPU resources efficiently.
 Duplicate checks are reliable way to keep invisible tuples being filtered
 out, indeed. But it shall degrade one competitive characteristics of the
 pgsql_fdw.
 
 https://github.com/kaigai/pg_strom/blob/master/plan.c#L693
 In my module, qualifiers being executable on device side are detached
 from the baserel-baserestrictinfo, and remaining qualifiers are chained
 to the list.
 The is_device_executable_qual() is equivalent to is_foreign_expr() in
 the pgsql_fdw module.

Agreed, I too think that pushed-down qualifiers should not be evaluated
on local side again from the viewpoint of performance.

 Of course, it is your decision, and I might miss something.
 
 BTW, what is the undesirable behavior on your previous implementation?

In early development, maybe during testing PREPARE/EXECUTE or DECALRE, I
saw that iterated execution of foreign scan produce wrong result which
includes rows which are NOT match pushed-down qualifiers.  And, at last,
I could recall what happened at that time.  It was just trivial bug I
made.  Perhaps I've removed pushed-down qualifiers in Path generation
phase, so generated plan node has lost qualifiers permanently.

In short, I'll remove remote qualifiers from baserestrictinfo, like
pg_storm.

 [Design comment]
 I'm not sure the reason why store_result() uses MessageContext to save
 the Tuplestorestate within PgsqlFdwExecutionState.
 The source code comment says it is used to avoid memory leaks in error
 cases. I also have a similar experience on implementation of my fdw module,
 so, I could understand per-scan context is already cleared at the timing of
 resource-release-callback, thus, handlers to external resource have to be
 saved on separated memory context.
 In my personal opinion, the right design is to declare a memory context for
 pgsql_fdw itself, instead of the abuse of existing memory context.
 (More wise design is to define sub-memory-context for each foreign-scan,
 then, remove the sub-memory-context after release handlers.)

 I simply chose built-in context which has enough lifespan, but now I
 think that using MessageContext directly is not recommended way.  As you
 say, creating new context as child of MessageContext for each scan in
 BeginForeignScan (or first IterateForeignScan) would be better.  Please
 see attached patch.

 One other option is getting rid of tuplestore by holding result rows as
 PGresult, and track it for error cases which might happen.
 ResourceOwner callback can be used to release PGresult on error, similar
 to PGconn.

 If we could have set of results on per-query memory context (thus,
 no need to explicit release on error timing), it is more ideal design.
 It it possible to implement based on the libpq APIs?

Currently no, so I used tuplestore even though it needs coping results.
However, Kyotaro Horiguchi's patch might make it possible. I'm reading
his patch to determine whether it suits pgsql_fdw.

http://archives.postgresql.org/message-id/20120202143057.ga12...@gmail.com

 Please note that per-query memory context is already released on
 ResourceOwner callback is launched, so, it is unavailable to implement
 if libpq requires to release some resource.

I see.  We need to use context which has longer lifetime if we want to
track malloc'ed PQresult.  I already use CacheContext for connection
pooling, so linking PGreslts to its source connection would be a solutions.

 [Design comment]
 When BEGIN should be issued on the remote-side?
 The connect_pg_server() is an only chance to issue BEGIN command
 at the remote-side on connection being opened. However, the connection
 shall be kept unless an error is not raised. Thus, the remote-side will
 continue to work within a single transaction block, even if local-side 
 iterates
 a pair of BEGIN and COMMIT.
 I'd like to suggest to close the transaction block at the timing of either
 end of the scan, transaction or sub-transaction.

 Indeed, remote transactions should be terminated at some timing.
 Terminating at the end of a scan seems troublesome because a connection
 might be shared by multiple scans in a query.  I'd prefer aborting
 remote transaction at the end of local query.  Please see
 abort_remote_tx in attached patch.

 It seems to me abort_remote_tx in ReleaseConnection() is reasonable.
 However, isn't it needed to have ABORT in GetConnection() at first time?

Hm, forcing overhead of aborting transaction to all local queries is
unreasonable.  Redundant BEGIN doesn't cause error but just generate
WARNING, so I'll remove abort_remote_tx preceding begin_remote_tx.

 [Comment to improve]
 It seems to me the design of exprFunction is not future-proof, if we add
 a new node type that contains two or more function calls, because it can
 return an oid of 

[HACKERS] about type cast

2012-02-06 Thread zoulx1982
hi,
there is a problem about type cast that i don't understand, follow is my test.
 
postgres=# select 10::bit(3);
 bit
-
 010
(1 row)
postgres=# select 10::bit varying(3);
ERROR:  cannot cast type integer to bit varying
LINE 1: select 10::bit varying(3);
 ^
postgres=#
 
my question is why int can cast to bit , but bot for bit varying?
i want to know the reason.
thank you for your timing.

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian br...@momjian.us wrote:
 On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:
 On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian br...@momjian.us wrote:
  On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
   Also, as far as I can see this patch usurps the page version field,
   which I find unacceptably short-sighted.  Do you really think this is
   the last page layout change we'll ever make?
 
  No, I don't. I hope and expect the next page layout change to
  reintroduce such a field.
 
  But since we're agreed now that upgrading is important, changing page
  format isn't likely to be happening until we get an online upgrade
  process. So future changes are much less likely. If they do happen, we
  have some flag bits spare that can be used to indicate later versions.
  It's not the prettiest thing in the world, but it's a small ugliness
  in return for an important feature. If there was a way without that, I
  would have chosen it.
 
  Have you considered the CRC might match a valuid page version number?
  Is that safe?

 In the proposed scheme there are two flag bits set on the page to
 indicate whether the field is used as a checksum rather than a version
 number. So its possible the checksum could look like a valid page
 version number, but we'd still be able to tell the difference.

 Thanks.  Clearly we don't need 16 bits to represent our page version
 number because we rarely change it.  The other good thing is I don't
 remember anyone wanting additional per-page storage in the past few
 years except for a checksum.

 I wonder if we should just dedicate 3 page header bits, call that the
 page version number, and set this new version number to 1, and assume
 all previous versions were zero, and have them look in the old page
 version location if the new version number is zero.  I am basically
 thinking of how we can plan ahead to move the version number to a new
 location and have a defined way of finding the page version number using
 old and new schemes.

 I don't want to get to a point where we need a bit per page number
 version.

Agreed, though I think we need at least 2 bits set if we are using the
checksum, so we should start at version 2 to ensure that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 06.02.2012 05:48, Bruce Momjian wrote:

 On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:

 In the proposed scheme there are two flag bits set on the page to
 indicate whether the field is used as a checksum rather than a version
 number. So its possible the checksum could look like a valid page
 version number, but we'd still be able to tell the difference.


 Thanks.  Clearly we don't need 16 bits to represent our page version
 number because we rarely change it. The other good thing is I don't
 remember anyone wanting additional per-page storage in the past few
 years except for a checksum.


 There's this idea that I'd like to see implemented:
 http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php

As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.

That leaves the same problem that if someone writes to the page you
need to freeze the rows first.

 In any case, I think it's a very bad idea to remove the page version field.
 How are we supposed to ever be able to change the page format again if we
 don't even have a version number on the page? I strongly oppose removing it.

Nobody is removing the version field, nor is anybody suggesting not
being able to tell which page version we are looking at.

 I'm also not very comfortable with the idea of having flags on the page
 indicating whether it has a checksum or not. It's not hard to imagine a
 software of firmware bug or hardware failure that would cause pd_flags field
 to be zeroed out altogether. It would be more robust if the expected
 bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
 at upgrade somehow. And it still feels a bit whacky anyway.

Good idea. Lets use

0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksum

So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator

 I wonder if we should just dedicate 3 page header bits, call that the
 page version number, and set this new version number to 1, and assume
 all previous versions were zero, and have them look in the old page
 version location if the new version number is zero.  I am basically
 thinking of how we can plan ahead to move the version number to a new
 location and have a defined way of finding the page version number using
 old and new schemes.


 Three bits seems short-sighted, but yeah, something like 6-8 bits should be
 enough. On the whole, though. I think we should bite the bullet and invent a
 way to extend the page header at upgrade.

There are currently many spare bits. I don't see any need to allocate
them to this specific use ahead of time - especially since that is the
exact decision we took last time when we reserved 16 bits for the
version.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Simon Riggs
On Sun, Feb 5, 2012 at 10:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Please post the patch rather than fixing directly. There's some subtle
 stuff there and it would be best to discuss first.

 Here's a proposed patch for the issues around unlocked updates of
 shared-memory state.  After going through this I believe that there is
 little risk of any real problems in the current state of the code; this
 is more in the nature of future-proofing against foreseeable changes.
 (One such is that we'd discussed fixing the age() function to work
 during Hot Standby.)  So I suggest applying this to HEAD but not
 back-patching.

All looks very good to me. Agreed.

 Except for one thing.  I realized while looking at the NEXTOID replay
 code that it is completely broken: it only advances
 ShmemVariableCache-nextOid when that's less than the value in the WAL
 record.  So that comparison fails if the OID counter wraps around during
 replay.  I've fixed this in the attached patch by just forcibly
 assigning the new value instead of trying to be smart, and I think
 probably that aspect of it needs to be back-patched.

Ouch! Well spotted.

Suggest fixing that as a separate patch; looks like backpatch to 8.0.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Simon Riggs
On Sun, Feb 5, 2012 at 11:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Please post the patch rather than fixing directly. There's some subtle
 stuff there and it would be best to discuss first.

 And here's a proposed patch for not throwing ERROR during replay of DROP
 TABLESPACE.  I had originally thought this would be a one-liner
 s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
 needs to be changed too, so that it doesn't throw error for unremovable
 directories.

Looks good.

The existing errmsg of tablespace is not empty doesn't cover all
reasons why tablespace was not removed.

The final message should have
errmsg tablespace not fully removed
errhint you should resolve this manually if it causes further problems

The errdetail is good.


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Heikki Linnakangas

On 06.02.2012 10:05, Simon Riggs wrote:

On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 06.02.2012 05:48, Bruce Momjian wrote:


Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.


There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.


Good idea. However, the followup idea to that discussion was to not only 
avoid the I/O needed to mark tuples as frozen, but to avoid xid 
wraparound altogether, by allowing clog to grow indefinitely. You do 
want to freeze at some point of course, to truncate the clog, but it 
would be nice to not have a hard limit. The way to do that is to store 
an xid epoch in the page header, so that Xids are effectively 64-bits 
wide, even though the xid fields on the tuple header are only 32-bits 
wide. That does require a new field in the page header.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 06.02.2012 10:05, Simon Riggs wrote:

 On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 06.02.2012 05:48, Bruce Momjian wrote:


 Thanks.  Clearly we don't need 16 bits to represent our page version
 number because we rarely change it. The other good thing is I don't
 remember anyone wanting additional per-page storage in the past few
 years except for a checksum.


 There's this idea that I'd like to see implemented:
 http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


 As you'll note, adding that field will change the page format and is
 therefore ruled out for 9.2.

 ISTM there is a different way to handle that anyway. All we need to do
 is to record the LSN of the last wraparound in shared memory/control
 file. Any block with page LSN older than that has all-frozen rows.
 That doesn't use any space nor does it require another field to be
 set.


 Good idea. However, the followup idea to that discussion was to not only
 avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
 altogether, by allowing clog to grow indefinitely. You do want to freeze at
 some point of course, to truncate the clog, but it would be nice to not have
 a hard limit. The way to do that is to store an xid epoch in the page
 header, so that Xids are effectively 64-bits wide, even though the xid
 fields on the tuple header are only 32-bits wide. That does require a new
 field in the page header.

We wouldn't need to do that would we?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Heikki Linnakangas

On 06.02.2012 11:25, Simon Riggs wrote:

On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Good idea. However, the followup idea to that discussion was to not only
avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
altogether, by allowing clog to grow indefinitely. You do want to freeze at
some point of course, to truncate the clog, but it would be nice to not have
a hard limit. The way to do that is to store an xid epoch in the page
header, so that Xids are effectively 64-bits wide, even though the xid
fields on the tuple header are only 32-bits wide. That does require a new
field in the page header.


We wouldn't need to do that would we?


Huh? Do you mean that we wouldn't need to implement that feature?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-06 Thread Fujii Masao
On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales sca...@vmware.com wrote:
 Let me know if you have any thoughts/comments, etc.  The patch is
 enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] libpq parallel build

2012-02-06 Thread Lionel Elie Mamane
src/interfaces/libpq/fe-misc.c #include-s pg_config_paths.h, but in
the makefile that dependency is not declared.

This breaks the build with 'make -jN' (with a big N) ... sometimes
(non-deterministically).

Patch attached. Made from 9.1.1, but from browsing the git master on
the web interface, the facts above seem to still be true.

-- 
Lionel
diff --recursive -u misc/build/postgresql-9.1.1/src/interfaces/libpq/Makefile misc/build/postgresql-9.1.1.patch/src/interfaces/libpq/Makefile
--- misc/build/postgresql-9.1.1/src/interfaces/libpq/Makefile	2012-02-06 15:11:19.0 +0100
+++ misc/build/postgresql-9.1.1.patch/src/interfaces/libpq/Makefile	2012-02-06 15:02:51.0 +0100
@@ -109,6 +109,7 @@
 libpq.rc: $(top_builddir)/src/Makefile.global
 
 fe-connect.o: fe-connect.c $(top_builddir)/src/port/pg_config_paths.h
+fe-misc.o: fe-misc.c $(top_builddir)/src/port/pg_config_paths.h
 
 $(top_builddir)/src/port/pg_config_paths.h:
 	$(MAKE) -C $(top_builddir)/src/port pg_config_paths.h

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Alvaro Herrera

Excerpts from Robert Haas's message of jue feb 02 11:24:08 -0300 2012:
 On Wed, Feb 1, 2012 at 11:33 PM, Alvaro Herrera alvhe...@alvh.no-ip.org 
 wrote:

  If there's only one remaining member, the problem is easy: replace it
  with that transaction's xid, and set the appropriate hint bits.  But if
  there's more than one, the only way out is to create a new multi.  This
  increases multixactid consumption, but I don't see any other option.
 
 Why do we need to freeze anything if the transactions are still
 running?  We certainly don't freeze regular transaction IDs while the
 transactions are still running; it would give wrong answers.  It's
 probably possible to do it for mxids, but why would you need to?

Well, I was thinking that we could continue generating the mxids
continuously and if we didn't freeze the old running ones, we could
overflow.  So one way to deal with the problem would be rewriting the
old ones into new ones.  But it has occurred to me that instead of doing
that we could simply disallow creation of new ones until the oldest ones
have been closed and removed from tables -- which is more in line with
what we do for Xids anyway.

 Suppose you have a tuple A which is locked by a series of transactions
 T0, T1, T2, ...; AIUI, each new locker is going to have to create a
 new mxid with all the existing entries plus a new one for itself.
 But, unless I'm confused, as it's doing so, it can discard any entries
 for locks taken by transactions which are no longer running.

That's correct.  But the problem is a tuple that is locked or updated by
a very old transaction that doesn't commit or rollback, and the tuple is
never locked again.  Eventually the Xid could remain live while the mxid
is in wraparound danger.

 So given
 an mxid with living members, any dead member in that mxid must have
 been living at the time the newest member was added.  Surely we can't
 be consuming mxids anywhere near fast enough for that to be a problem.

Well, the problem is that while it should be rare to consume mxids as
fast as necessary for this problem to show up, it *is* possible --
unless we add some protection that they are not created until the old
ones are frozen (which now means removed).

  However, there are cases where not even that is possible -- consider
  tuple freezing during WAL recovery.  Recovery is going to need to
  replace those multis with other multis, but it cannot create new multis
  itself.  The only solution here appears to be that when multis are
  frozen in the master, replacement multis have to be logged too.  So the
  heap_freeze_tuple Xlog record will have a map of old multi to new.  That
  way, recovery can just determine the new multi to use for any particular
  old multi; since multixact creation is also logged, we're certain that
  the replacement value has already been defined.
 
 This doesn't sound right.  Why would recovery need to create a multi
 that didn't exist on the master?  Any multi it applies to a record
 should be one that it was told to apply by the master; and the master
 should have already WAL-logged the creation of that multi.  I don't
 think that replacement mxids have to be logged; I think that *all*
 mxids have to be logged.  Am I all wet?

Well, yeah, all mxids are logged, in particular those that would have
been used for replacement.  However I think I've discarded the idea of
replacement altogether now, because it makes simpler both on master and
slave.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-06 Thread Robert Haas
On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 We really need to nail that down.  Could you post the scripts (on the
 wiki) you use for running the benchmark and making the graph?  I'd
 like to see how much work it would be for me to change it to detect
 checkpoints and do something like color the markers blue during
 checkpoints and red elsewhen.

They're pretty crude - I've attached them here.

 Also, I'm not sure how bad that graph really is.  The overall
 throughput is more variable, and there are a few latency spikes but
 they are few.  The dominant feature is simply that the long-term
 average is less than the initial burst.Of course the goal is to have
 a high level of throughput with a smooth latency under sustained
 conditions.  But to expect that that long-sustained smooth level of
 throughput be identical to the initial burst throughput sounds like
 more of a fantasy than a goal.

That's probably true, but the drop-off is currently quite extreme.
The fact that disabling full_page_writes causes throughput to increase
by 4x is dismaying, at least to me.

 If we want to accept the lowered
 throughput and work on the what variability/spikes are there, I think
 a good approach would be to take the long term TPS average, and dial
 the number of clients back until the initial burst TPS matches that
 long term average.  Then see if the spikes still exist over the long
 term using that dialed back number of clients.

Hmm, I might be able to do that.

 I don't think the full-page-writes are leading to WALInsert
 contention, for example, because that would probably lead to smooth
 throughput decline, but not those latency spikes in which those entire
 seconds go by without transactions.

Right.

 I doubt it is leading to general
 IO compaction, as the IO at that point should be pretty much
 sequential (the checkpoint has not yet reached the sync stage, the WAL
 is sequential).  So I bet that that is caused by fsyncs occurring at
 xlog segment switches, and the locking that that entails.

That's definitely possible.

 If I
 recall, we can have a segment which is completely written to OS and in
 the process of being fsynced, and we can have another segment which is
 in some state of partially in wal_buffers and partly written out to OS
 cache, but that we can't start reusing the wal_buffers that were
 already written to OS for that segment (and therefore are
 theoretically available for reuse by the upcoming 3rd segment)  until
 the previous segments fsync has completed.  So all WALInsert's freeze.
  Or something like that.  This code has changed a bit since last time
 I studied it.

Yeah, I need to better-characterize where the pauses are coming from,
but I'm reluctant to invest too much effort in until Heikki's xlog
scaling patch goes in, because I think that's going to change things
enough that any work done now will mostly be wasted.

It might be worth trying a run with wal_buffers=32MB or something like
that, just to see whether that mitigates any of the locking pile-ups.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


runtestl
Description: Binary data


makeplot
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dry-run mode for pg_archivecleanup

2012-02-06 Thread Alvaro Herrera

Excerpts from Gabriele Bartolini's message of mar ene 31 12:29:51 -0300 2012:
 Hi Robert,
 
 sorry for the delay.
 
 Il 27/01/12 15:47, Robert Haas ha scritto:
  This email thread seems to have trailed off without reaching a 
  conclusion. The patch is marked as Waiting on Author in the CommitFest 
  application, but I'm not sure that's accurate. Can we try to nail this 
  down? 
 Here is my final version which embeds comments from Josh. I have also 
 added debug information to be printed in case '-d' is given.

FWIW I committed this last week, though I changed the debug message
wording slightly -- hope that's OK; it can be adjusted of course.

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b2e431a4db81a735d1474c4d1565a20b835878c9

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dry-run mode for pg_archivecleanup

2012-02-06 Thread Gabriele Bartolini

Ciao Alvaro,

Il 06/02/12 16:02, Alvaro Herrera ha scritto:

FWIW I committed this last week, though I changed the debug message
wording slightly -- hope that's OK; it can be adjusted of course.

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b2e431a4db81a735d1474c4d1565a20b835878c9



Beautiful! :)

Thank you very much.

Ciao,
Gabriele

--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Simon Riggs
On Thu, Feb 2, 2012 at 4:33 AM, Alvaro Herrera alvhe...@alvh.no-ip.org wrote:

 However, there are cases where not even that is possible -- consider
 tuple freezing during WAL recovery.  Recovery is going to need to
 replace those multis with other multis, but it cannot create new multis
 itself.  The only solution here appears to be that when multis are
 frozen in the master, replacement multis have to be logged too.  So the
 heap_freeze_tuple Xlog record will have a map of old multi to new.  That
 way, recovery can just determine the new multi to use for any particular
 old multi; since multixact creation is also logged, we're certain that
 the replacement value has already been defined.

Multixacts are ignored during recovery. Why do anything at all?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Robert Haas
On Mon, Feb 6, 2012 at 9:31 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Suppose you have a tuple A which is locked by a series of transactions
 T0, T1, T2, ...; AIUI, each new locker is going to have to create a
 new mxid with all the existing entries plus a new one for itself.
 But, unless I'm confused, as it's doing so, it can discard any entries
 for locks taken by transactions which are no longer running.

 That's correct.  But the problem is a tuple that is locked or updated by
 a very old transaction that doesn't commit or rollback, and the tuple is
 never locked again.  Eventually the Xid could remain live while the mxid
 is in wraparound danger.

Ah, I see.  I think we should probably handle that the same way we do
for XIDs: try to force autovac when things get tight, then start
issuing warnings, and finally just refuse to assign any more MXIDs.

Another thing that might make sense, for both XIDs and MXIDs, is to
start killing transactions that are preventing vacuum/autovacuum from
doing their thing.  This could mean either killing the people who are
holding back RecentGlobalXmin, so that we can actually freeze the old
stuff; or killing people who are holding a conflicting lock, using the
recovery-conflict stuff or some adaptation of it.  We've made it
fairly difficult to avoid having autovacuum run at all with the
xidVacLimit/xidStopLimit stuff, but there's still no real defense
against autovacuum running but failing to mitigate the problem, either
because there's a long-running transaction holding a snapshot open, or
because someone is sitting on a relation or buffer lock.  This of
course is off-topic from your patch here...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-06 Thread Marko Kreen
On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote:
 * Peter Eisentraut wrote:
 I noticed ecpglib still uses PQconnectdb() with a craftily assembled
 connection string.  Here is a patch to use PQconnectdbParams() instead.
 
 + const char *conn_keywords[6];
 + const char *conn_values[6];
 
 Shouldn't these be [7]? You can have up to 6 items, plus the terminator.

This bug is now in GIT.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about type cast

2012-02-06 Thread Robert Haas
2012/2/6 zoulx1982 zoulx1...@163.com:
 hi,
 there is a problem about type cast that i don't understand, follow is my
 test.

 postgres=# select 10::bit(3);
  bit
 -
  010
 (1 row)
 postgres=# select 10::bit varying(3);
 ERROR:  cannot cast type integer to bit varying
 LINE 1: select 10::bit varying(3);
  ^
 postgres=#

 my question is why int can cast to bit , but bot for bit varying?
 i want to know the reason.
 thank you for your timing.

Off the top of my head, I'm guessing that it's just a case of nobody
having implemented it.  You can do this:

rhaas=# select 10::bit(3)::varbit;
 varbit

 010
(1 row)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about type cast

2012-02-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 2012/2/6 zoulx1982 zoulx1...@163.com:
 my question is why int can cast to bit , but bot for bit varying?

 Off the top of my head, I'm guessing that it's just a case of nobody
 having implemented it.

I have some vague recollection that we didn't want to be squishy about
the length of the resulting bitstring.  When you cast to bit you're more
or less forced to specify what you want, since the spec-mandated default
length is only 1.  varbit would leave us having to make a decision in
the code.

Anyway, try searching the pgsql-hackers archives if you want some
history.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-02-06 Thread Andrew Dunstan



On 01/31/2012 11:10 PM, Andrew Dunstan wrote:



Here's a possible patch for the exclude-table-data problem along the 
lines you suggest.



Should I apply this?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 09:05:19AM +, Simon Riggs wrote:
  In any case, I think it's a very bad idea to remove the page version field.
  How are we supposed to ever be able to change the page format again if we
  don't even have a version number on the page? I strongly oppose removing it.
 
 Nobody is removing the version field, nor is anybody suggesting not
 being able to tell which page version we are looking at.

Agreed.  I thought the idea was that we have a 16-bit page _version_
number and 8+ free page _flag_ bits, which are all currently zero.  The
idea was to move the version number from 16-bit field into the unused
flag bits, and use the 16-bit field for the checksum.  I would like to
have some logic that would allow tools inspecting the page to tell if
they should look for the version number in the bits at the beginning of
the page or at the end.

Specifically, this becomes the checksum:

uint16  pd_pagesize_version;

and this holds the page version, if we have updated the page to the new
format:

uint16  pd_flags;   /* flag bits, see below */

Of the 16 bits of pd_flags, these are the only ones used:

#define PD_HAS_FREE_LINES   0x0001  /* are there any unused line 
pointers? */
#define PD_PAGE_FULL0x0002  /* not enough free space for new

 * tuple? */
#define PD_ALL_VISIBLE  0x0004  /* all tuples on page are 
visible to

 * everyone */

#define PD_VALID_FLAG_BITS  0x0007  /* OR of all valid pd_flags 
bits */


  I'm also not very comfortable with the idea of having flags on the page
  indicating whether it has a checksum or not. It's not hard to imagine a
  software of firmware bug or hardware failure that would cause pd_flags field
  to be zeroed out altogether. It would be more robust if the expected
  bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
  at upgrade somehow. And it still feels a bit whacky anyway.

 Good idea. Lets use
 
 0-0-0 to represent upgraded from previous version, needs a bit set
 0-0-1 to represent new version number of page, no checksum
 1-1-1 to represent new version number of page, with checksum
 
 So we have 1 bit dedicated to the page version, 2 bits to the checksum 
 indicator

Interesting point that we would not be guarding against a bit flip from
1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
don't see how upgrade figures into this.

However, I am unclear how Simon's idea above actually works.  We need
two bits for redundancy, both 1, to mark a page as having a checksum.  I
don't think mixing the idea of a new page version and checksum enabled
really makes sense, especially since we have to plan for future page
version changes.

I think we dedicate 2 bits to say we have computed a checksum, and 3
bits to mark up to 8 possible page versions, so the logic is, in
pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
on the page, and we use 0x32 and later for the page version number.  We
can assume all the remaining bits are for the page version number until
we need to define new bits, and we can start storing them at the end
first, and work forward.  If all the version bits are zero, it means the
page version number is still stored in pd_pagesize_version.

  I wonder if we should just dedicate 3 page header bits, call that the
  page version number, and set this new version number to 1, and assume
  all previous versions were zero, and have them look in the old page
  version location if the new version number is zero.  I am basically
  thinking of how we can plan ahead to move the version number to a new
  location and have a defined way of finding the page version number using
  old and new schemes.
 
 
  Three bits seems short-sighted, but yeah, something like 6-8 bits should be
  enough. On the whole, though. I think we should bite the bullet and invent a
  way to extend the page header at upgrade.
 
 There are currently many spare bits. I don't see any need to allocate
 them to this specific use ahead of time - especially since that is the
 exact decision we took last time when we reserved 16 bits for the
 version.

Right, but I am thinking we should set things up so we can grow the page
version number into the unused bit, rather than box it between bits we
are already using.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA

2012-02-06 Thread Merlin Moncure
On Sat, Feb 4, 2012 at 5:38 AM, Thomas Munro mu...@ip9.org wrote:
 On 16 January 2012 21:30, Josh Berkus j...@agliodbs.com wrote:

 Useful, yes.  Harder than it looks, probably.  I tried to mock up a
 version of this years ago for a project where I needed it, and ran into
 all kinds of race conditions.

 Can you remember any details about those race conditions?

 Anyway, if it could be made to work, this is extremely useful for any
 application which needs queueing behavior (with is a large plurality, if
 not a majority, of applications).

 Ok, based on this feedback I decided to push further and try
 implementating this.  See POC/WIP patch attached.  It seems to work
 for simple examples but I haven't yet tried to break it or see how it
 interacts with more complicated queries or high concurrency levels.
 It probably contains at least a few rookie mistakes!  Any feedback
 gratefully received.

 The approach is described in my original email.  Short version:
 heap_lock_tuple now takes an enum called wait_policy instead of a
 boolean called nowait, with the following values:

  LockWaitBlock: wait for lock (like nowait = false before),

  LockWaitError: error if not immediately lockable (like nowait = true
                 before)

  LockWaitSkip:  give up and return HeapTupleWouldBlock if not
                 immediately lockable (this is a new policy)

 The rest of the patch is about getting the appropriate value down to
 that function call, following the example of the existing nowait
 support, and skipping rows if you said SKIP LOCKED DATA and you got
 HeapTupleWouldBlock.

 Compared to one very popular commercial database's implementation, I
 think this is a little bit friendlier for the user who wants to
 distribute work.  Let's say you want to lock one row without lock
 contention, which this patch allows with FETCH FIRST 1 ROW ONLY FOR
 UPDATE SKIP LOCKED DATA in an SQL query.  In that other system, the
 mechanism for limiting the number of rows fetched is done in the WHERE
 clause, and therefore the N rows are counted *before* checking if the
 lock can be obtained, so users sometimes have to resort to stored
 procedures so they can control the FETCH from a cursor imperatively.
 In another popular commercial database from Redmond, you can ask for
 the top (first) N rows while using the equivalent of SKIP LOCKED DATA
 and it has the same effect as this patch as far as I can tell, and
 another large blue system is the same.

 As discussed in another branch of this thread, you can probably get
 the same effect with transactional advisory locks.  But I personally
 like row skipping better as an explicit feature because:

 (1) I think there might be an order-of-evaluation problem with a WHERE
 clause containing both lock testing and row filtering expressions (ie
 it is undefined right?) which you might need subselects to work around
 (ie to be sure to avoid false positive locks, not sure about this).

 (2) The advisory technique requires you to introduce an integer
 identifier if you didn't already have one and then lock that despite
 having already said which rows you want to try to lock in standard row
 filtering expressions.

 (3) The advisory technique requires all users of the table to
 participate in the advisory lock protocol even if they don't want to
 use the option to skip lock.

 (4) It complements NOWAIT (which could also have been done with
 transactional advisory locks, with a function like try_lock_or_fail).

 (5) I like the idea of directly porting applications from other
 databases with this feature (that is what led me here).

Yeah -- also your stuff is also going to have much better interactions
with LIMIT.  Your enhancements will beat an advisory lock
implementation all day long.  the real competition is not advisory
locks, but the mvcc tricks played with PGQ.  It's all about
implementing producer consumer queues (arguments that databases should
not do that are 100% bogus) and I can't help but wonder if that's a
better system.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-02-06 Thread Marco Nenciarini
Hi guys,

Please find attached version 3 of our patch. We thoroughly followed your
suggestions and were able to implement EACH foreign key constraints 
with multi-column syntax as well.

EACH foreign key constraints represent PostgreSQL implementation of 
what are also known as Foreign Key Arrays.

Some limitations occur in this release, but as previously agreed these 
can be refined and defined in future release implementations.

This patch adds:

* support for EACH REFERENCES column constraint on array types 
  - e.g. c1 INT[] EACH REFERENCES t1
* support for EACH foreign key table constraints
  - e.g. FOREIGN KEY (EACH c1) REFERENCES t1
* support for EACH foreign keys in multi-column foreign key table 
  constraints
  - e.g. FOREIGN KEY (c1, EACH c2) REFERENCES t1 (u1, u2)
* support for two new referential actions on update/delete operations
  for single-column only EACH foreign keys:
** EACH CASCADE (deletes or updates elements inside the referencing
   array)
** EACH SET NULL (sets to NULL referencing element inside the foreign
   array)
* support for array_replace and array_remove functions as required by
the above actions

As previously said, we preferred to keep this patch simple for 9.2 and 
forbid EACH CASCADE and EACH SET NULL on multi-column foreign keys.
After all, majority of use cases is represented by EACH foreign key
column constraints (single-column foreign key arrays), and more
complicated use cases can be discussed for 9.3 - should this patch make
it. :)
We can use multi-dimensional arrays as well as referencing columns. In 
that case though, ON DELETE EACH CASCADE will behave like ON DELETE EACH
SET NULL. This is a safe way of implementing the action. 
We have some ideas on how to implement this, but we feel it is better to
limit the behaviour for now.

As far as documentation is concerned, we:
* added actions and constraint info in the catalog
* added an entire section on EACH foreign key constraints in the data 
definition language chapter (we've simplified the second example, 
greatly following Noah's advice - let us know if this is ok with you)
* added array_remove (currently limited to single-dimensional arrays) 
and array_replace in the array functions chapter
* modified REFERENCES/FOREIGN KEY section in the CREATE TABLE command's 
documentation and added a special section on the EACH REFERENCES clause 
(using square braces as suggested)

Here follows a short list of notes for Noah:

* You proposed these changes: ARRAY CASCADE - EACH CASCADE and ARRAY 
SET NULL - EACH SET NULL. We stack with EACH CASCADE and decided to
prepend the EACH keyword to standard's CASCADE and SET NULL. Grammar
is simpler and the emphasis is on the EACH keyword.
* Multi-dimensional arrays: ON DELETE EACH CASCADE - ON DELETE EACH SET
NULL. We cannot determine the array's number of dimensions at definition
time as it depends on the actual values. As anticipated above, we have 
some ideas on multi-dimensional element removal, but not for this patch 
for the aforementioned reasons.
* Support of EACH CASCADE/SET NULL in ConvertTriggerToFK(): we decided 
to leave it.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it 



EACH-foreign-key-constraints-aka-foreign-key-arrays.v3.patch.bz2
Description: application/bzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote:
 I wonder if we should just dedicate 3 page header bits, call that the
 page version number, and set this new version number to 1, and assume
 all previous versions were zero, and have them look in the old page
 version location if the new version number is zero.  I am basically
 thinking of how we can plan ahead to move the version number to a new
 location and have a defined way of finding the page version number using
 old and new schemes.
 
 Three bits seems short-sighted, but yeah, something like 6-8 bits
 should be enough. On the whole, though. I think we should bite the
 bullet and invent a way to extend the page header at upgrade.

I just emailed a possible layout that allows for future page version
number expansion.  

I don't think there is any magic bullet that will allow for page header
extension by pg_upgrade.  If it is going to be done, it would have to
happen in the backend while the system is running.  Anything that
requires pg_upgrade to do lots of reads or writes basically makes
pg_upgrade useless.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 The existing errmsg of tablespace is not empty doesn't cover all
 reasons why tablespace was not removed.

Yeah, in fact that particular statement is really pretty bogus for the
replay case, because as the comment says we know that the tablespace
*is* empty so far as full-fledged database objects are concerned.

 The final message should have
 errmsg tablespace not fully removed
 errhint you should resolve this manually if it causes further problems

Planning to go with this:

 errmsg(directories for tablespace %u could not be 
removed,
xlrec-ts_id),
 errhint(You can remove the directories manually if 
necessary.)));

I thought about an errdetail, but the preceding LOG entries from
destroy_tablespace_directories should provide the details reasonably
well.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote:
   I'm also not very comfortable with the idea of having flags on the page
   indicating whether it has a checksum or not. It's not hard to imagine a
   software of firmware bug or hardware failure that would cause pd_flags 
   field
   to be zeroed out altogether. It would be more robust if the expected
   bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with 
   that
   at upgrade somehow. And it still feels a bit whacky anyway.
 
  Good idea. Lets use
  
  0-0-0 to represent upgraded from previous version, needs a bit set
  0-0-1 to represent new version number of page, no checksum
  1-1-1 to represent new version number of page, with checksum
  
  So we have 1 bit dedicated to the page version, 2 bits to the checksum 
  indicator
 
 Interesting point that we would not be guarding against a bit flip from
 1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
 don't see how upgrade figures into this.
 
 However, I am unclear how Simon's idea above actually works.  We need
 two bits for redundancy, both 1, to mark a page as having a checksum.  I
 don't think mixing the idea of a new page version and checksum enabled
 really makes sense, especially since we have to plan for future page
 version changes.
 
 I think we dedicate 2 bits to say we have computed a checksum, and 3
 bits to mark up to 8 possible page versions, so the logic is, in
 pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
 on the page, and we use 0x32 and later for the page version number.  We
 can assume all the remaining bits are for the page version number until
 we need to define new bits, and we can start storing them at the end
 first, and work forward.  If all the version bits are zero, it means the
 page version number is still stored in pd_pagesize_version.

A simpler solution would be to place two bits for checksum after the
existing page bit usage, and place the page version on the last four
bits of the 16-bit field --- that still leaves us with 7 unused bits.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 6:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 The existing errmsg of tablespace is not empty doesn't cover all
 reasons why tablespace was not removed.

 Yeah, in fact that particular statement is really pretty bogus for the
 replay case, because as the comment says we know that the tablespace
 *is* empty so far as full-fledged database objects are concerned.

 The final message should have
 errmsg tablespace not fully removed
 errhint you should resolve this manually if it causes further problems

 Planning to go with this:

                         errmsg(directories for tablespace %u could not be 
 removed,
                                xlrec-ts_id),
                         errhint(You can remove the directories manually if 
 necessary.)));

 I thought about an errdetail, but the preceding LOG entries from
 destroy_tablespace_directories should provide the details reasonably
 well.

Looks good.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-06 Thread Michael Meskes
On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote:
 Shouldn't these be [7]? You can have up to 6 items, plus the terminator.

I take it only keywords have to be [7], right? Committed, thanks for spotting 
this.

There seems to be one more problem that I haven't had time to tackle yet. With
this patch the connection regression test (make checktcp) doesn't run through
anymore because one test connection cannot be made. It spits out the error
message:

FATAL:  invalid command-line arguments for server process 
HINT:  Try postgres --help for more information.

This connection used to work without the patch and besides the error message is
not really telling in this context.

So if anyone is interested in looking at this fine, if not I will as soon as I
find some spare time.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at googlemail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Alvaro Herrera

Excerpts from Robert Haas's message of lun feb 06 13:19:14 -0300 2012:
 
 On Mon, Feb 6, 2012 at 9:31 AM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Suppose you have a tuple A which is locked by a series of transactions
  T0, T1, T2, ...; AIUI, each new locker is going to have to create a
  new mxid with all the existing entries plus a new one for itself.
  But, unless I'm confused, as it's doing so, it can discard any entries
  for locks taken by transactions which are no longer running.
 
  That's correct.  But the problem is a tuple that is locked or updated by
  a very old transaction that doesn't commit or rollback, and the tuple is
  never locked again.  Eventually the Xid could remain live while the mxid
  is in wraparound danger.
 
 Ah, I see.  I think we should probably handle that the same way we do
 for XIDs: try to force autovac when things get tight, then start
 issuing warnings, and finally just refuse to assign any more MXIDs.

Agreed.

 Another thing that might make sense, for both XIDs and MXIDs, is to
 start killing transactions that are preventing vacuum/autovacuum from
 doing their thing.  This could mean either killing the people who are
 holding back RecentGlobalXmin, so that we can actually freeze the old
 stuff; or killing people who are holding a conflicting lock, using the
 recovery-conflict stuff or some adaptation of it.

Yeah -- right now we only emit some innocuous-looking messages, which
I've seen most people to ignore until they get bitten by a forced
anti-wraparound vacuum.  It'd be nice to get more agressive about that
as the situation gets more critical.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Fri, Jan 27, 2012 at 09:37:37AM -0500, Robert Haas wrote:
 On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
  Well, I don't think it's all that subjective - it's more the case that
  it is just difficult, or it gets that way as you consider more
  specialisations.
 
 Sure it's subjective.  Two well-meaning people could have different
 opinions without either of them being wrong.  If you do a lot of
 small, in-memory sorts, more of this stuff is going to seem worthwhile
 than if you don't.
 
  As for what types/specialisations may not make the cut, I'm
  increasingly convinced that floats (in the following order: float4,
  float8) should be the first to go. Aside from the fact that we cannot
  use their specialisations for anything like dates and timestamps,
  floats are just way less useful than integers in the context of
  database applications, or at least those that I've been involved with.
  As important as floats are in the broad context of computing, it's
  usually only acceptable to store data in a database as floats within
  scientific applications, and only then when their limitations are
  well-understood and acceptable. I think we've all heard anecdotes at
  one time or another, involving their limitations not being well
  understood.
 
 While we're waiting for anyone else to weigh in with an opinion on the
 right place to draw the line here, do you want to post an updated
 patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them. 
I think it would be few and perhaps should be eliminated.

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process.  He has done enough
analysis that I am hard-pressed to see how we would get similar
improvement using a different method, so I think it comes down to
whether we want the 28% speedup by adding 55k (1%) to the binary.

I think Peter has shown us how to get that, and what it will cost --- we
just need to decide now whether it is worth it.  What I am saying is
there probably isn't a cheaper way to get that speedup, either now or in
the next few years.  (COPY might need similar help for speedups.)

I believe this is a big win and well worth the increased binary size
because the speed up is significant, and because it is of general
usefulness for a wide range of queries.  Either of these would be enough
to justify the additional 1% size, but both make it an easy decision for
me.  

FYI, I believe COPY needs similar optimizations; we have gotten repeated
complaints about its performance and this method of optmization might
also be our only option.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure in AtCleanup_Portals

2012-02-06 Thread Tom Lane
Pavan Deolasee pavan.deola...@gmail.com writes:
 If I run the following sequence of commands, I get an assertion
 failure in current HEAD.

 postgres=# BEGIN;
 BEGIN
 postgres=# SELECT 1/0;
 ERROR:  division by zero
 postgres=# ROLLBACK TO A;
 ERROR:  no such savepoint
 postgres=# \q

 The process fails when the session is closed and aborted transaction
 gets cleaned at the proc_exit time. The stack trace is as below.

Hmm.  It works fine if you issue an actual ROLLBACK command there,
so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently
duplicating the full-fledged ROLLBACK code path.  No time to dig further
right now though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 04:19:07PM -0500, Bruce Momjian wrote:
 Peter Geoghegan obviously has done some serious work in improving
 sorting, and worked well with the community process.  He has done enough
 analysis that I am hard-pressed to see how we would get similar
 improvement using a different method, so I think it comes down to
 whether we want the 28% speedup by adding 55k (1%) to the binary.
 
 I think Peter has shown us how to get that, and what it will cost --- we
 just need to decide now whether it is worth it.  What I am saying is
 there probably isn't a cheaper way to get that speedup, either now or in
 the next few years.  (COPY might need similar help for speedups.)
 
 I believe this is a big win and well worth the increased binary size
 because the speed up is significant, and because it is of general
 usefulness for a wide range of queries.  Either of these would be enough
 to justify the additional 1% size, but both make it an easy decision for
 me.  
 
 FYI, I believe COPY needs similar optimizations; we have gotten repeated
 complaints about its performance and this method of optmization might
 also be our only option.

Sorry, that was wordy.  What I am saying is that years ago we did
hot-spot optimization for storage and tuple access using macros.  We
have looked for cheap optimizations for sort (and COPY) for years, but
haven't found it.  If we want optimizations in these areas, we are
probably going to need to do the same sort of hot-spot optimization we
did earlier, and Peter's work is an excellent candidate for that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Peter Geoghegan
On 6 February 2012 21:19, Bruce Momjian br...@momjian.us wrote:
 Peter Geoghegan obviously has done some serious work in improving
 sorting, and worked well with the community process.

Thank you for acknowledging that.

It's unfortunate that C does not support expressing these kinds of
ideas in a more natural way.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] semi-PoC: kNN-gist for cubes

2012-02-06 Thread Jay Levitt
I have a rough proof-of-concept for getting nearest-neighbor searches 
working with cubes.  When I say rough, I mean I have no idea what I'm 
doing and I haven't written C for 15 years but I hear it got standardized 
please don't hurt me.  It seems to be about 400x faster for a 3D cube with 
1 million rows, more like 10-30x for a 6D cube with 10 million rows.


The patch adds operator - (which is just the existing cube_distance 
function) and support function 8, distance (which is just g_cube_distance, a 
wrapper around cube_distance).


The code is in no way production-quality; it is in fact right around look! 
it compiles!, complete with pasted-in, commented-out code from something I 
was mimicking.  I thought I'd share at this early stage in the hopes I might 
get some pointers, such as:


- What unintended consequences should I be looking for?
- What benchmarks should I do?
- What kind of edge cases might I consider?
- I'm just wrapping cube_distance and calling it through DirectFunctionCall; 
it's probably more proper to extract out the real function and call it 
from both cube_distance and g_cube_distance. Right?

- What else don't I know?  (Besides C, funny man.)

The patch, such as it is, is at:

https://github.com/jaylevitt/postgres/commit/9cae4ea6bd4b2e582b95d7e1452de0a7aec12857

with an even-messier test at

https://github.com/jaylevitt/postgres/commit/daa33e30acaa2c99fe554d88a99dd7d78ff6c784

I initially thought this patch made inserting and indexing slower, but then 
I realized the fast version was doing 1 million rows, and the slow one did 
10 million rows.  Which means: dinnertime.


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote:
 On 6 February 2012 21:19, Bruce Momjian br...@momjian.us wrote:
  Peter Geoghegan obviously has done some serious work in improving
  sorting, and worked well with the community process.
 
 Thank you for acknowledging that.
 
 It's unfortunate that C does not support expressing these kinds of
 ideas in a more natural way.

Yes, it is a problem, and a benefit.  We have avoided C++ because these
types of trade-offs that we are discussing are often done invisibly, so
we can't make the decision ourselves.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 06:43:04PM -0500, Bruce Momjian wrote:
 On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote:
  On 6 February 2012 21:19, Bruce Momjian br...@momjian.us wrote:
   Peter Geoghegan obviously has done some serious work in improving
   sorting, and worked well with the community process.
  
  Thank you for acknowledging that.
  
  It's unfortunate that C does not support expressing these kinds of
  ideas in a more natural way.
 
 Yes, it is a problem, and a benefit.  We have avoided C++ because these
 types of trade-offs that we are discussing are often done invisibly, so
 we can't make the decision ourselves.

Let me add that while it is fine for languages like C++ to make these
decisions for application code automatically, operating systems and
high-performance databases developers prefer to make such decisions
explicitly, which is what we are discussing now.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-06 Thread Andrew Dunstan



On 11/16/2011 10:38 PM, Tom Lane wrote:

I wrote:

PFC, a patch that does this.

Upon further review, this patch would need some more work even for the
RowExpr case, because there are several places that build RowExprs
without bothering to build a valid colnames list.  It's clearly soluble
if anyone cares to put in the work, but I'm not personally excited
enough to pursue it ...



The patch itself causes a core dump with the current regression tests. 
This test:


   SELECT array_to_json(array_agg(q),false)
  FROM ( SELECT $$a$$ || x AS b, y AS c,
   ARRAY[ROW(x.*,ARRAY[1,2,3]),
   ROW(y.*,ARRAY[4,5,6])] AS z
 FROM generate_series(1,2) x,
  generate_series(4,5) y) q;


causes a failure at line 967 of execTuples.c:

   Assert(list_length(exprList) == list_length(namesList));

I'm investigating further.

I've been looking at the other places that build RowExprs. Most of them 
look OK and none seem clearly in need of fixing at first glance. Which 
do you think need attention?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-06 Thread Dan Scales
I don't know a lot about base backup, but it sounds like full_page_writes must 
be turned on for base backup, in order to deal with the inconsistent reads of 
pages (which you might call torn pages) that can happen when you backup the 
data files while the database is running.  The relevant parts of the WAL log 
are then copied separately (and consistently) once the backup of the data files 
is done, and used to recover the database into a consistent state later.

So, yes, good point -- double writes cannot replace the functionality of 
full_page_writes for base backup.  If double writes were in use, they might be 
automatically switched over to full page writes for the duration of the base 
backup.  And the double write file should not be part of the base backup.

Dan

- Original Message -
From: Fujii Masao masao.fu...@gmail.com
To: Dan Scales sca...@vmware.com
Cc: PG Hackers pgsql-hackers@postgresql.org
Sent: Monday, February 6, 2012 3:08:15 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales sca...@vmware.com wrote:
 Let me know if you have any thoughts/comments, etc.  The patch is
 enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-06 Thread Fujii Masao
Hi,

When I compiled HEAD with --disable-integer-datetimes and tested
pg_receivexlog, I encountered unexpected replication timeout. As
far as I read the pg_receivexlog code, the cause of this problem is
that pg_receivexlog handles the standby message timeout incorrectly
in --disable-integer-datetimes. The attached patch fixes this problem.
Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8ca3882..af4add0 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -62,7 +62,7 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir, char *namebu
 	f = open(fn, O_WRONLY | O_CREAT | PG_BINARY, S_IRUSR | S_IWUSR);
 	if (f == -1)
 	{
-		fprintf(stderr, _(%s: Could not open WAL segment %s: %s\n),
+		fprintf(stderr, _(%s: could not open WAL segment %s: %s\n),
 progname, fn, strerror(errno));
 		return -1;
 	}
@@ -190,6 +190,24 @@ localGetCurrentTimestamp(void)
 }
 
 /*
+ * Local version of TimestampDifferenceExceeds(), since we are not
+ * linked with backend code.
+ */
+static bool
+localTimestampDifferenceExceeds(TimestampTz start_time,
+		   TimestampTz stop_time,
+		   int msec)
+{
+	TimestampTz diff = stop_time - start_time;
+
+#ifdef HAVE_INT64_TIMESTAMP
+	return (diff = msec * INT64CONST(1000));
+#else
+	return (diff * 1000.0 = msec);
+#endif
+}
+
+/*
  * Receive a log stream starting at the specified position.
  *
  * If sysidentifier is specified, validate that both the system
@@ -301,7 +319,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *sysi
 		 */
 		now = localGetCurrentTimestamp();
 		if (standby_message_timeout  0 
-			last_status  now - standby_message_timeout * 100)
+			localTimestampDifferenceExceeds(last_status, now,
+			standby_message_timeout * 1000))
 		{
 			/* Time to send feedback! */
 			char		replybuf[sizeof(StandbyReplyMessage) + 1];

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-06 Thread Chetan Suttraway
On Thu, Feb 2, 2012 at 8:11 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Feb 1, 2012 at 5:23 AM, Chetan Suttraway
 chetan.suttra...@enterprisedb.com wrote:
  Hi All,
 
  This is regarding the TODO item :
  Add SPI_gettypmod() to return a field's typemod from a TupleDesc
 
  The related message is:
  http://archives.postgresql.org/pgsql-hackers/2005-11/msg00250.php
 
  This basically talks about having an SPI_gettypemod() which returns the
  typmod of a field of tupdesc
 
  Please refer the attached patch based on the suggested implementation.

 Please add this to the next CommitFest:

 https://commitfest.postgresql.org/action/commitfest_view/open

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


At the given link, I am able to choose only System administration under
commitfest topic.
I think there has to be server features or Miscellaneous.


Regards,
Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb


Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-06 Thread Fujii Masao
On Tue, Feb 7, 2012 at 2:58 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Hi,

 When I compiled HEAD with --disable-integer-datetimes and tested
 pg_receivexlog, I encountered unexpected replication timeout. As
 far as I read the pg_receivexlog code, the cause of this problem is
 that pg_receivexlog handles the standby message timeout incorrectly
 in --disable-integer-datetimes. The attached patch fixes this problem.
 Comments?

receivelog.c
---
timeout.tv_sec = last_status + standby_message_timeout - now - 1;
if (timeout.tv_sec = 0)
---

Umm.. the above code also handles the timestamp incorrectly. ISTM that the
root cause of these problems is that receivelog.c uses TimestampTz. What
about changing receivelog.c so that it uses time_t instead of TimestampTz?
Which would make the code simpler, I think.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers