Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-19 Thread Bruce Momjian
On Wed, Jul 18, 2012 at 09:36:51AM -0400, Robert Haas wrote:
 On Tue, Jul 17, 2012 at 6:02 PM, Bruce Momjian br...@momjian.us wrote:
  However, I have two ideas.  First, I don't know _why_ the
  primary/standby would be any different after pg_upgrade, so I added the
  documentation mention because I couldn't _guarantee_ they were the same.
  Actually, if people can test this, we might be able to say this is safe.
 
  Second, the user files (large) are certainly identical, it is only the
  system tables (small) that _might_ be different, so rsync'ing just those
  would add the guarantee, but I know of no easy way to rsync just the
  system tables.
 
 I'm scratching my head in confusion here.  After pg_upgrade, the
 master is a completely new cluster.  The system catalog contents are
 completely different, and so are things like the database system
 identifier and the WAL position - yeah, the latter is approximately
 the same, but almost doesn't count except in horseshoes.  Obviously
 any attempt to replay WAL from the new cluster on the old cluster is
 doomed to failure, at least unless we do a bunch more engineering here
 that hasn't really been thought about yet.

No, the point is they run pg_upgrade on the stopped primary and stopped
standbys.  Are those the same?  I am not really sure.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw in contrib

2012-07-19 Thread Kohei KaiGai
Hanada-san,

What about the status of your patch?

Even though the 1st commit-fest is getting closed soon,
I'd like to pay efforts for reviewing to pull up the status of
pgsql_fdw into ready for committer by beginning of the
upcoming commit-fest.

Thanks,

2012/7/13 Shigeru HANADA shigeru.han...@gmail.com:
 (2012/07/12 20:48), Kohei KaiGai wrote:
 It seems to me what postgresql_fdw_validator() is doing looks like
 a function to be named as libpq_fdw_validator().

 How about your opinion? It will help this namespace conflicts.

 I'd prefer dblink_fdw_validator.

 The name libpq_fdw_validator impresses me that a concrete FDW named
 libpq_fdw is somewhere and it retrieves external data *from* libpq.
 Indeed postgresql_fdw_validator allows only some of libpq options at the
 moment, but we won't be able to rename it for backward compatibility
 even if it wants to have non-libpq options in the future.

 IMO basically each FDW validator should be owned by a particular FDW,
 because in most cases validator should know FDW's internal deeply.  In
 addition, it would want to have new options for new features.

 Besides naming, as mentioned upthread, removing hard-coded libpq options
 list from dblink and leaving it to libpq client library would make
 dblink more robust about libpq option changes in future.

 Regards,
 --
 Shigeru HANADA


-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] XLogReader v2

2012-07-19 Thread Andres Freund
Hi,

Attached is v2 of the patch.

Changes are:
* more comments
* significantly cleaned/simpliefied coded
* crc validation
* addition of XLogReaderReadOne

Definitely needed are:
* better validation of records
* customizable error handling

The first is just work that needs to be done, nothing complicated. 
The second is a bit more complicated:
- We could have an bool had_error and a static char that contains the error 
message, the caller can handle that as wanted
- We could have a callback for error handling

I think I prefer the callback solution.


The second attached patch is a very, very preliminary xlog dumping utility 
which currently is more of a debugging facility (as evidenced by the fact that 
it needs and existing /tmp/xlog directory for writing out data) for the 
XLogReader. It reuses the builtin xlog dumping logic and thus has to link with 
backend code. I couldn't find a really sensible way to do this:

xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name 
objfiles.txt|xargs cat|tr -s   \012|grep -v /main.o|sed 's/^/..\/..\/..
$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

Perhaps somebody has a better idea? I think having an xlogdump utility in 
core/contrib would be a good idea now that it can be done without a huge 
amount of code duplication. I plan to check Satoshi-san's version of xlogdump 
whether I can crib some of the commandline interface and some code from there.

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From c32aa648a63f8ca78c03dd6ac5177afef9f0a8f3 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Thu, 7 Jun 2012 11:40:08 +0200
Subject: [PATCH 1/2] Add support for a generic wal reading facility dubbed
 XLogReader

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- compressing the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
 src/backend/access/transam/Makefile |2 +-
 src/backend/access/transam/xlogreader.c | 1026 +++
 src/include/access/xlogreader.h |  264 
 3 files changed, 1291 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogreader.c
 create mode 100644 src/include/access/xlogreader.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
-	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+	twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 000..c689a98
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,1026 @@
+/*-
+ *
+ * xlogreader.c
+ *		Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/readxlog.c
+ *
+ * NOTES
+ *		Documentation about how do use this interface can be found in
+ *		xlogreader.h, more specifically in the definition of the
+ *		XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * more extensive validation of read records
+ * * separation of reader/writer
+ * * customizable error response
+ * * usable without backend code around
+ *-
+ */
+
+#include postgres.h
+
+#include access/xlog_internal.h
+#include access/transam.h
+#include catalog/pg_control.h
+#include access/xlogreader.h
+
+/* If (very) verbose debugging is needed:
+ * #define VERBOSE_DEBUG
+ */
+
+XLogReaderState*
+XLogReaderAllocate(void)
+{
+	XLogReaderState* state = (XLogReaderState*)malloc(sizeof(XLogReaderState));
+	int i;
+
+	if (!state)
+		goto oom;
+
+	memset(state-buf.record, 0, sizeof(XLogRecord));
+	state-buf.record_data_size = XLOG_BLCKSZ*8;
+	state-buf.record_data =
+			malloc(state-buf.record_data_size);
+
+	if (!state-buf.record_data)
+		goto oom;
+
+	memset(state-buf.record_data, 0, state-buf.record_data_size);
+	state-buf.origptr = InvalidXLogRecPtr;
+
+	for (i = 0; i  XLR_MAX_BKP_BLOCKS; i++)
+	{
+		state-buf.bkp_block_data[i] =
+			malloc(BLCKSZ);
+
+		if (!state-buf.bkp_block_data[i])
+			goto 

Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Robert Haas
On Wed, Jul 18, 2012 at 5:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I've been chewing on this issue some more, and no longer like my
 previous proposal, which was

 ... What I'm thinking about
 is reducing the hash key to just RelFileNodeBackend + ForkNumber,
 so that there's one hashtable entry per fork, and then storing a
 bitmap to indicate which segment numbers need to be sync'd.  At
 one gigabyte to the bit, I think we could expect the bitmap would
 not get terribly large.  We'd still have a cancel flag in each
 hash entry, but it'd apply to the whole relation fork not each
 segment.

 The reason that's not so attractive is the later observation that what
 we really care about optimizing is FORGET_RELATION_FSYNC for all the
 forks of a relation at once, which we could produce just one request
 for with trivial refactoring of smgrunlink/mdunlink.  The above
 representation doesn't help for that.  So what I'm now thinking is that
 we should create a second hash table, with key RelFileNode only,
 carrying two booleans: a cancel-previous-fsyncs bool and a
 please-unlink-after-checkpoint bool.  (The latter field would allow us
 to drop the separate pending-unlinks data structure.)  Entries would
 be made in this table when we got a FORGET_RELATION_FSYNC or
 UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
 both message types for each relation, since they're both created during
 DROP.  (Maybe we could even combine these request types.)  To use the
 table, as we scan the existing per-fork-and-segment hash table, we'd
 have to do a lookup in the per-relation table to see if there was a
 later cancel message for that relation.  Now this does add a few cycles
 to the processing of each pendingOpsTable entry in mdsync ... but
 considering that the major work in that loop is an fsync call, it is
 tough to believe that anybody would notice an extra hashtable lookup.

Seems a bit complex, but it might be worth it.  Keep in mind that I
eventually want to be able to make an unlogged table logged or a visca
versa, which will probably entail unlinking just the init fork (for
the logged - unlogged direction).

 However, I also came up with an entirely different line of thought,
 which unfortunately seems incompatible with either of the improved
 table designs above.  It is this: instead of having a request queue
 that feeds into a hash table hidden within the checkpointer process,
 what about storing the pending-fsyncs table as a shared hash table
 in shared memory?  That is, ForwardFsyncRequest would not simply
 try to add the request to a linear array, but would do a HASH_ENTER
 call on a shared hash table.  This means the de-duplication occurs
 for free and we no longer need CompactCheckpointerRequestQueue at all.
 Basically, this would amount to saying that the original design was
 wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
 and that we'd rather pay a little more per ForwardFsyncRequest call
 to avoid the enormous response-time spike that will occur when
 CompactCheckpointerRequestQueue has to run.  (Not to mention that
 the checkpointer would eventually have to do HASH_ENTER anyway.)
 I think this would address your observation above that the request
 queue tends to contain an awful lot of duplicates.

I'm not concerned about the queue *containing* a large number of
duplicates; I'm concerned about the large number of duplicate
*requests*.  Under either the current system or this proposal, every
time we write a block, we must take and release CheckpointerCommLock.
Now, I have no evidence that there's actually a bottleneck there, but
if there is, this proposal won't fix it.  In fact, I suspect on the
whole it would make things worse, because while it's true that
CompactCheckpointerRequestQueue is expensive, it shouldn't normally be
happening at all, because the checkpointer should be draining the
queue regularly enough to prevent it from filling.  So except when the
system is in the pathological state where the checkpointer becomes
unresponsive because it's blocked in-kernel on a very long fsync and
there is a large amount of simultaneous write activity, each process
that acquires CheckpointerCommLock holds it for just long enough to
slam a few bytes of data into the queue, which is very cheap.  I
suspect that updating a hash table would be significantly more
expensive, and we'd pay whatever that extra overhead is on every fsync
request, not just in the unusual case where we manage to fill the
queue.  So I don't think this is likely to be a win.

If you think about the case of an UPDATE statement that hits a large
number of blocks in the same relation, it sends an fsync request for
every single block.  Really, it's only necessary to send a new fsync
request if the checkpointer has begun a new checkpoint cycle in the
meantime; otherwise, the old request is still pending and will cover
the new write as well.  But there's no way for the backend doing the
writes to know 

Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-19 Thread Robert Haas
On Wed, Jul 18, 2012 at 3:59 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In short, then, the background writer process is entirely useless for
 any database that fits completely into shared buffers.

Or to phrase that a bit more positively, there's no reason to do a
bunch of unnecessary writes if we are lucky enough to encounter the
happy situation where the database fits in shared buffers.  The
background writer's reason for existence is to make buffer eviction
faster by cleaning buffers that will soon be evicted, so if we're not
going to evict any buffers then we needn't clean them either (except
at checkpoint time).

 So that raises two independent sets of questions:

 1. Do we like the fact that the bgwriter isn't doing anything in this
 situation?  It seems arguably OK for writes to happen only for
 checkpointing purposes if there is no memory pressure.  But having the
 bgwriter wake up 5 times a second to decide it has nothing to do seems
 a bit wasteful.  I'm inclined to think maybe it should go into the
 recently added hibernation mode anytime the buffer freelist isn't
 empty.  Or maybe you could argue that this scenario isn't worth any
 optimization effort, but with many-gig RAM becoming more and more
 common, I don't think I agree.

I feel like the hibernation behavior ought to be tied to buffer
eviction, not the freelist.  When there's no buffer eviction
happening, the background writer should hibernate, because there's no
need to clean buffers in preparation for future eviction in that case.
 It is true that when the freelist is non-empty, there's no buffer
eviction occurring, but that will typically only happen at start-up.
It's not uncommon to have a database that is larger than
shared_buffers but whose active portion is smaller than
shared_buffers.  In that case you expect the freelist to converge to
empty (since the only things that put buffers back on the freelist
after startup are relation or database drops) but yet you probably
don't need the background writer working.

Another consideration is that we might actually want to arrange things
so that the free-list remains non-empty on an ongoing basis.  Right
now buffer eviction is a major scalability bottleneck.  Maybe we'll
find some other way to fix that, but then again maybe we won't.

 2. It's rather disturbing that a fairly large swath of functionality
 just stopped getting tested at all by the buildfarm.  Do we want to
 rethink the shared_buffers increase?  Or artificially bloat the
 regression database to make it larger than 128MB?  Or do something else
 to ensure we still exercise the DB-bigger-than-buffers case?

It seems like it could be useful to test with a variety of
shared_buffers settings.  Maybe we should even have one or two
buildfarm animals that run with a REALLY small shared_buffers setting,
like 1MB, just to see if that breaks anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-19 Thread Robert Haas
On Wed, Jul 18, 2012 at 5:30 PM, Andrew Dunstan and...@dunslane.net wrote:
 Or we could provide an initdb flag which would set an upper bound on
 shared_buffers, and have make check (at least) use it.

How about a flag that sets the exact value for shared_buffers, rather
than a maximum?  I think a lot of users would like initdb
--shared-buffers=8GB or whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CHECK NO INHERIT syntax

2012-07-19 Thread Robert Haas
On Wed, Jul 18, 2012 at 5:49 PM, Peter Eisentraut pete...@gmx.net wrote:
 Sorry to raise this once again, but I still find this CHECK NO INHERIT
 syntax to a bit funny.  We are currently using something like

 CHECK NO INHERIT (foo  0)

 But we already have a different syntax for attaching attributes to
 constraints (NOT DEFERRABLE, NOT VALID,  etc.), so it would make more
 sense to have

 CHECK (foo  0) NO INHERIT

 Besides consistency, this makes more sense, because the attribute is a
 property of the constraint as a whole, not of the checking.

 This would also extend more easily to other constraint types.  For
 example, when unifying CHECK and NOT NULL constraints, as is planned, or
 when allowing inherited unique constraints, as is planned further down
 the road.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-19 Thread Robert Haas
On Thu, Jul 19, 2012 at 2:38 AM, Bruce Momjian br...@momjian.us wrote:
 No, the point is they run pg_upgrade on the stopped primary and stopped
 standbys.  Are those the same?  I am not really sure.

Of course not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] reminder: 9.2 branch needs building by buildfarm animals

2012-07-19 Thread Andrew Dunstan
Alvaro has reminded me that I haven't sent out a notification to 
buildfarm owners that they might need to add REL9_2_STABLE, which was 
branched some time ago, to their rotations.


If you're using the latest code and running run_branches.pl and you have 
$conf{branches_to_build} = 'ALL' you won't need to worry about this 
because you would have picked this up automatically.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] isolation check takes a long time

2012-07-19 Thread Andrew Dunstan


On 07/17/2012 04:28 PM, Noah Misch wrote:

On Tue, Jul 17, 2012 at 01:56:19PM -0400, Alvaro Herrera wrote:

Excerpts from Andrew Dunstan's message of dom jul 15 16:42:22 -0400 2012:

I'm looking into that. But given that the default is to set
max_prepared_transactions to 0, shouldn't we just remove that test from the
normal installcheck schedule?

That's a thought -- AFAIR we do provide a numeric_big test that's not
exercised by the regular regress schedule, for a precedent.

It would be nice to have a pattern for adding tests run less often than every
commit but more often than whenever a human explicitly remembers the test
and invokes it manually.  Perhaps a schedule that the recommended buildfarm
configuration would somehow run every two weeks and before each release
(including betas and branch releases).



We have some support for that sort of thing. The optional_steps feature 
can run with a minimum number of hours between runs. Currently the only 
supported such steps are build_docs and find_typedefs. If there are 
extra tests we'd want run in that fashion then they would need code 
added for them.


Meanwhile, I would like to remove the prepared_transactions test from 
the main isolation schedule, and add a new Make target which runs that 
test explicitly. Is there any objection to that?


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Seems a bit complex, but it might be worth it.  Keep in mind that I
 eventually want to be able to make an unlogged table logged or a visca
 versa, which will probably entail unlinking just the init fork (for
 the logged - unlogged direction).

Well, as far as that goes, I don't see a reason why you couldn't unlink
the init fork immediately on commit.  The checkpointer should not have
to be involved at all --- there's no reason to send it a FORGET FSYNC
request either, because there shouldn't be any outstanding writes
against an init fork, no?

But having said that, this does serve as an example that we might
someday want the flexibility to kill individual forks.  I was
intending to kill smgrdounlinkfork altogether, but I'll refrain.

 I think this is just over-engineered.  The originally complained-of
 problem was all about the inefficiency of manipulating the
 checkpointer's backend-private data structures, right?  I don't see
 any particular need to mess with the shared memory data structures at
 all.  If you wanted to add some de-duping logic to retail fsync
 requests, you could probably accomplish that more cheaply by having
 each such request look at the last half-dozen or so items in the queue
 and skip inserting the new request if any of them match the new
 request.  But I think that'd probably be a net loss, because it would
 mean holding the lock for longer.

What about checking just the immediately previous entry?  This would
at least fix the problem for bulk-load situations, and the cost ought
to be about negligible compared to acquiring the LWLock.

I have also been wondering about de-duping on the backend side, but
the problem is that if a backend remembers its last few requests,
it doesn't know when that cache has to be cleared because of a new
checkpoint cycle starting.  We could advertise the current cycle
number in shared memory, but you'd still need to take a lock to
read it.  (If we had memory fence primitives it could be a bit
cheaper, but I dunno how much.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Jul 18, 2012 at 5:30 PM, Andrew Dunstan and...@dunslane.net wrote:
 Or we could provide an initdb flag which would set an upper bound on
 shared_buffers, and have make check (at least) use it.

 How about a flag that sets the exact value for shared_buffers, rather
 than a maximum?  I think a lot of users would like initdb
 --shared-buffers=8GB or whatever.

That would be significantly harder to deploy in the buildfarm context.
We don't know that all the animals are capable of coping with 16MB
(or whatever target we settle on for make check) today.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-19 Thread Andrew Dunstan


On 07/19/2012 10:12 AM, Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:

On Wed, Jul 18, 2012 at 5:30 PM, Andrew Dunstan and...@dunslane.net wrote:

Or we could provide an initdb flag which would set an upper bound on
shared_buffers, and have make check (at least) use it.

How about a flag that sets the exact value for shared_buffers, rather
than a maximum?  I think a lot of users would like initdb
--shared-buffers=8GB or whatever.

That would be significantly harder to deploy in the buildfarm context.
We don't know that all the animals are capable of coping with 16MB
(or whatever target we settle on for make check) today.




Yeah - unless we allow some fallback things could get ugly. I do like 
the idea of allowing a settable ceiling on shared_buffers instead of 
having it completely hardcoded as now.



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] pg_dump: Sort overloaded functions in deterministic order

2012-07-19 Thread Joel Jacobson
Makes pg_dump sort overloaded functions in deterministic order.

The field proiargs has been added to FuncInfo and is set by getFuncs()
and getAggregates() for all functions and aggregates.

DOTypeNameCompare uses this field to break ties if the name and number of
arguments are the same. This avoid having to default to OID sorting.

This patch is independent from the ongoing discussion of the pg_dump --split
option. Even if we can't agree on how to do the splitting of objects into
files, it still makes sense to fix the sort order of overloaded functions.


pg_dump_deterministic_order_v4.patch
Description: Binary data


pg_dump_deterministic_order.t
Description: Troff document

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-19 Thread Bruce Momjian
On Thu, Jul 19, 2012 at 09:41:29AM -0400, Robert Haas wrote:
 On Thu, Jul 19, 2012 at 2:38 AM, Bruce Momjian br...@momjian.us wrote:
  No, the point is they run pg_upgrade on the stopped primary and stopped
  standbys.  Are those the same?  I am not really sure.
 
 Of course not.

OK, but why?  When the clusters are stopped they are the same, you are
running the same initdb on both matchines, and running the same
pg_upgrade.  What would cause the difference, other than the Database
System Identifier, which we can deal with?  I don't think we can
guarantee they are the same, but what would guarantee they are
different?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Robert Haas
On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Seems a bit complex, but it might be worth it.  Keep in mind that I
 eventually want to be able to make an unlogged table logged or a visca
 versa, which will probably entail unlinking just the init fork (for
 the logged - unlogged direction).

 Well, as far as that goes, I don't see a reason why you couldn't unlink
 the init fork immediately on commit.  The checkpointer should not have
 to be involved at all --- there's no reason to send it a FORGET FSYNC
 request either, because there shouldn't be any outstanding writes
 against an init fork, no?

Well, it gets written when it gets created.  Some of those writes go
through shared_buffers.

 But having said that, this does serve as an example that we might
 someday want the flexibility to kill individual forks.  I was
 intending to kill smgrdounlinkfork altogether, but I'll refrain.

If you want to remove it, it's OK with me.  We can always put it back
later if it's needed.  We have an SCM that allows us to revert
patches.  :-)

 What about checking just the immediately previous entry?  This would
 at least fix the problem for bulk-load situations, and the cost ought
 to be about negligible compared to acquiring the LWLock.

Well, two things:

1. If a single bulk load is the ONLY activity on the system, or more
generally if only one segment in the system is being heavily written,
then that would reduce the number of entries that get added to the
queue, but if you're doing two bulk loads on different tables at the
same time, then it might not do much.  From Greg Smith's previous
comments on this topic, I understand that having two or three entries
alternating in the queue is a fairly common pattern.

2. You say fix the problem but I'm not exactly clear what problem
you think this fixes.  It's true that the compaction code is a lot
slower than an ordinary queue insertion, but I think it generally
doesn't happen enough to matter, and when it does happen the system is
generally I/O bound anyway, so who cares?  One possible argument in
favor of doing something along these lines is that it would reduce the
amount of data that the checkpointer would have to copy while holding
the lock, thus causing less disruption for other processes trying to
insert into the request queue.  But I don't know whether that effect
is significant enough to matter.

 I have also been wondering about de-duping on the backend side, but
 the problem is that if a backend remembers its last few requests,
 it doesn't know when that cache has to be cleared because of a new
 checkpoint cycle starting.  We could advertise the current cycle
 number in shared memory, but you'd still need to take a lock to
 read it.  (If we had memory fence primitives it could be a bit
 cheaper, but I dunno how much.)

Well, we do have those, as of 9.2.  There not being used for anything
yet, but I've been looking for an opportunity to put them into use.
sinvaladt.c's msgnumLock is an obvious candidate, but the 9.2 changes
to reduce the impact of sinval synchronization work sufficiently well
that I haven't been motivated to tinker with it any further.  Maybe it
would be worth doing just to exercise that code, though.

Or, maybe we can use them here.  But after some thought I can't see
exactly how we'd do it.  Memory barriers prevent a value from being
prefetched too early or written back to main memory too late, relative
to other memory operations by the same process, but the definition of
too early and too late is not quite clear to me here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-19 Thread Robert Haas
On Thu, Jul 19, 2012 at 12:02 PM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Jul 19, 2012 at 09:41:29AM -0400, Robert Haas wrote:
 On Thu, Jul 19, 2012 at 2:38 AM, Bruce Momjian br...@momjian.us wrote:
  No, the point is they run pg_upgrade on the stopped primary and stopped
  standbys.  Are those the same?  I am not really sure.

 Of course not.

 OK, but why?  When the clusters are stopped they are the same, you are
 running the same initdb on both matchines, and running the same
 pg_upgrade.  What would cause the difference, other than the Database
 System Identifier, which we can deal with?  I don't think we can
 guarantee they are the same, but what would guarantee they are
 different?

There isn't any guarantee that they are different.  There's just no
guarantee that they are the same, which is enough to make this idea a
non-starter.

In general, it's pretty easy to understand that if you perform the
same series of inserts, updates, and deletes on two systems, you might
not end up with the exact same binary contents.  There are a lot of
reasons for this: any concurrent activity whatsoever - even the exact
timing of autovacuum - can cause the same tuples can end up in
different places in the two systems.  Now, admittedly, in the case of
pg_upgrade, you're restoring the dump using a single process with
absolutely no concurrent activity and even autovacuum disabled, so the
chances of ending up with entirely identical binary contents are
probably higher than average.  But even there you could have
checkpoints trigger at slightly different times while restoring the
dumps, and of course checkpoints take buffer locks, and so now a HOT
prune might happen on one machine but get skipped on the other one
because the checkpointer has dropped the lock but not the pin, and now
you're hosed.

Even if you could control for that particular possibility, there are
surely others now and there will be more in the future.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-19 Thread Bruce Momjian
On Thu, Jul 19, 2012 at 12:43:23PM -0400, Robert Haas wrote:
 On Thu, Jul 19, 2012 at 12:02 PM, Bruce Momjian br...@momjian.us wrote:
  On Thu, Jul 19, 2012 at 09:41:29AM -0400, Robert Haas wrote:
  On Thu, Jul 19, 2012 at 2:38 AM, Bruce Momjian br...@momjian.us wrote:
   No, the point is they run pg_upgrade on the stopped primary and stopped
   standbys.  Are those the same?  I am not really sure.
 
  Of course not.
 
  OK, but why?  When the clusters are stopped they are the same, you are
  running the same initdb on both matchines, and running the same
  pg_upgrade.  What would cause the difference, other than the Database
  System Identifier, which we can deal with?  I don't think we can
  guarantee they are the same, but what would guarantee they are
  different?
 
 There isn't any guarantee that they are different.  There's just no
 guarantee that they are the same, which is enough to make this idea a
 non-starter.
 
 In general, it's pretty easy to understand that if you perform the
 same series of inserts, updates, and deletes on two systems, you might
 not end up with the exact same binary contents.  There are a lot of
 reasons for this: any concurrent activity whatsoever - even the exact
 timing of autovacuum - can cause the same tuples can end up in
 different places in the two systems.  Now, admittedly, in the case of
 pg_upgrade, you're restoring the dump using a single process with
 absolutely no concurrent activity and even autovacuum disabled, so the
 chances of ending up with entirely identical binary contents are
 probably higher than average.  But even there you could have
 checkpoints trigger at slightly different times while restoring the
 dumps, and of course checkpoints take buffer locks, and so now a HOT
 prune might happen on one machine but get skipped on the other one
 because the checkpointer has dropped the lock but not the pin, and now
 you're hosed.
 
 Even if you could control for that particular possibility, there are
 surely others now and there will be more in the future.

I think the checkpoint issue is the ideal killer --- thanks.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] XLogReader v2

2012-07-19 Thread Satoshi Nagayasu

2012/07/19 19:29, Andres Freund wrote:

Hi,

Attached is v2 of the patch.

Changes are:
* more comments
* significantly cleaned/simpliefied coded
* crc validation
* addition of XLogReaderReadOne

Definitely needed are:
* better validation of records
* customizable error handling

The first is just work that needs to be done, nothing complicated.
The second is a bit more complicated:
- We could have an bool had_error and a static char that contains the error
message, the caller can handle that as wanted
- We could have a callback for error handling

I think I prefer the callback solution.


The second attached patch is a very, very preliminary xlog dumping utility
which currently is more of a debugging facility (as evidenced by the fact that
it needs and existing /tmp/xlog directory for writing out data) for the
XLogReader. It reuses the builtin xlog dumping logic and thus has to link with
backend code. I couldn't find a really sensible way to do this:

xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name
objfiles.txt|xargs cat|tr -s   \012|grep -v /main.o|sed 's/^/..\/..\/..
 $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

Perhaps somebody has a better idea? I think having an xlogdump utility in
core/contrib would be a good idea now that it can be done without a huge
amount of code duplication. I plan to check Satoshi-san's version of xlogdump
whether I can crib some of the commandline interface and some code from there.


I agree with that we need more sophisticated way to share the code
between the backend and several utilities (including xlogdump),
but AFAIK, a contrib module must allow to be built *without* the core
source tree.

Any contrib module must be able to be built with only the header files
and the shared libraries when using PGXS. So, it could not assume
that it has the core source tree. (If we need to assume that, I think
xlogdump needs to be put into the core/bin directory.)

On the other hand, I have an issue to improve maintainancability of
the duplicated code at the xlogdump project.

Gather all the code which has been copied from the core.
https://github.com/snaga/xlogdump/issues/26

So, I agree with that we need another way to share the code
between the backend and the related utilities. Any good ideas?


I have one more concern for putting xlogdump into the core.

xlogdump is intended to deliver any new features and enhancements
to all the users who are using not only the latest major version,
but also older major versions maintained by the community, because
xlogdump must be a quite important tool when DBA needs it.

In fact, the latest xlogdump is now supporting 5 major versions,
from 8.3 to 9.2.
https://github.com/snaga/xlogdump/blob/master/README.xlogdump

But AFAIK, putting xlogdump into the core/contrib would mean that
a source tree of each major version could not have a large modification
after each release (or each code freeze, actually).

It would mean that the users using older major version could not take
advantage of new features and enhancements of the latest xlogdump,
but it's not what I wanted, actually.

Regards,



Greetings,

Andres







--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What about checking just the immediately previous entry?  This would
 at least fix the problem for bulk-load situations, and the cost ought
 to be about negligible compared to acquiring the LWLock.

 2. You say fix the problem but I'm not exactly clear what problem
 you think this fixes.

What I'm concerned about is that there is going to be a great deal more
fsync request queue traffic in 9.2 than there ever was before, as a
consequence of the bgwriter/checkpointer split.  The design expectation
for this mechanism was that most fsync requests would be generated
locally inside the bgwriter and thus go straight into the hash table
without having to go through the shared-memory queue.  I admit that
we have seen no benchmarks showing that there's a problem, but that's
because up till yesterday the bgwriter was failing to transmit such
messages at all.  So I'm looking for ways to cut the overhead.

But having said that, maybe we should not panic until we actually see
some benchmarks showing the problem.

Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
I have been looking at the two-hash-tables design I suggested before,
and realized that there's a timing issue: if we just stuff forget
requests into a separate table, there is no method for determining
whether a given fsync request arrived before or after a given forget
request.  This is problematic if the relfilenode gets recycled: we
need to be able to guarantee that a previously-posted forget request
won't cancel a valid fsync for the new relation.  I believe this is
soluble though, if we merge the forget requests with unlink requests,
because a relfilenode can't be recycled until we do the unlink.
So as far as the code goes:

1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
RelFileNode.  It acts the same as before, and shouldn't be materially
slower to process, but now we can determine in O(1) time whether there
is a pending unlink for a relfilenode.

2. Treat the existence of a pending unlink request as a relation-wide
fsync cancel; so the loop in mdsync needs one extra hashtable lookup
to determine validity of a PendingOperationEntry.  As before, this
should not matter much considering that we're about to do an fsync().

3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
message if it is sending an UNLINK_RELATION_REQUEST.  (A side benefit
is that this gives us another 2X reduction in fsync queue traffic,
and not just any queue traffic but the type of traffic that we must
not fail to queue.)

The FORGET_RELATION_FSYNC code path will still exist, and will still
require a full hashtable scan, but we don't care because it isn't
being used in common situations.  It would only be needed for stuff
like killing an init fork.

The argument that this is safe involves these points:

* mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
ftruncate on the main fork's first segment, because otherwise that
segment could theoretically get unlinked from under it before it can do
the truncate.  But this is okay since the ftruncate won't cause any
fsync the checkpointer might concurrently be doing to fail.  The
request *will* be sent before we unlink any other files, so mdsync
will be able to recover if it gets an fsync failure due to concurrent
unlink.

* Because a relfilenode cannot be recycled until we process and delete
the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
new fsync requests to arrive while the PendingUnlinkEntry still exists
to cause them to be considered canceled.

* Because we only process and delete PendingUnlinkEntrys that have been
there since before the checkpoint started, we can be sure that any
PendingOperationEntrys referring to the relfilenode will have been
scanned and deleted by mdsync before we remove the PendingUnlinkEntry.

Unless somebody sees a hole in this logic, I'll go make this happen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] very elaborate mkdir error checking in pg_dump

2012-07-19 Thread Peter Eisentraut
Is there a real point to all this code in pg_backup_directory.c?

static void
createDirectory(const char *dir)
{
struct stat st;

/* the directory must not exist yet. */
if (stat(dir, st) == 0)
{
if (S_ISDIR(st.st_mode))
exit_horribly(modulename,
  cannot create directory %s, it exists already\n,
  dir);
else
exit_horribly(modulename,
  cannot create directory %s, a file with this name 
  exists already\n, dir);
}

/*
 * Now we create the directory. Note that for some race condition we could
 * also run into the situation that the directory has been created just
 * between our two calls.
 */
if (mkdir(dir, 0700)  0)
exit_horribly(modulename, could not create directory %s: %s\n,
  dir, strerror(errno));
}

Couldn't we just call mkdir() and report the strerrno(errno) to begin
with, like everyone else does?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] very elaborate mkdir error checking in pg_dump

2012-07-19 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 Couldn't we just call mkdir() and report the strerrno(errno) to begin
 with, like everyone else does?

+1.  It'll provide pretty much the same information anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Robert Haas
On Thu, Jul 19, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What about checking just the immediately previous entry?  This would
 at least fix the problem for bulk-load situations, and the cost ought
 to be about negligible compared to acquiring the LWLock.

 2. You say fix the problem but I'm not exactly clear what problem
 you think this fixes.

 What I'm concerned about is that there is going to be a great deal more
 fsync request queue traffic in 9.2 than there ever was before, as a
 consequence of the bgwriter/checkpointer split.  The design expectation
 for this mechanism was that most fsync requests would be generated
 locally inside the bgwriter and thus go straight into the hash table
 without having to go through the shared-memory queue.  I admit that
 we have seen no benchmarks showing that there's a problem, but that's
 because up till yesterday the bgwriter was failing to transmit such
 messages at all.  So I'm looking for ways to cut the overhead.

 But having said that, maybe we should not panic until we actually see
 some benchmarks showing the problem.

+1 for not panicking.  I'm prepared to believe that there could be a
problem here, but I'm not prepared to believe that we've characterized
it well enough to be certain that any changes we choose to make will
make things better not worse.

 Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC.
 I have been looking at the two-hash-tables design I suggested before,
 and realized that there's a timing issue: if we just stuff forget
 requests into a separate table, there is no method for determining
 whether a given fsync request arrived before or after a given forget
 request.  This is problematic if the relfilenode gets recycled: we
 need to be able to guarantee that a previously-posted forget request
 won't cancel a valid fsync for the new relation.  I believe this is
 soluble though, if we merge the forget requests with unlink requests,
 because a relfilenode can't be recycled until we do the unlink.
 So as far as the code goes:

 1. Convert the PendingUnlinkEntry linked list to a hash table keyed by
 RelFileNode.  It acts the same as before, and shouldn't be materially
 slower to process, but now we can determine in O(1) time whether there
 is a pending unlink for a relfilenode.

 2. Treat the existence of a pending unlink request as a relation-wide
 fsync cancel; so the loop in mdsync needs one extra hashtable lookup
 to determine validity of a PendingOperationEntry.  As before, this
 should not matter much considering that we're about to do an fsync().

 3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC
 message if it is sending an UNLINK_RELATION_REQUEST.  (A side benefit
 is that this gives us another 2X reduction in fsync queue traffic,
 and not just any queue traffic but the type of traffic that we must
 not fail to queue.)

 The FORGET_RELATION_FSYNC code path will still exist, and will still
 require a full hashtable scan, but we don't care because it isn't
 being used in common situations.  It would only be needed for stuff
 like killing an init fork.

 The argument that this is safe involves these points:

 * mdunlink cannot send UNLINK_RELATION_REQUEST until it's done
 ftruncate on the main fork's first segment, because otherwise that
 segment could theoretically get unlinked from under it before it can do
 the truncate.  But this is okay since the ftruncate won't cause any
 fsync the checkpointer might concurrently be doing to fail.  The
 request *will* be sent before we unlink any other files, so mdsync
 will be able to recover if it gets an fsync failure due to concurrent
 unlink.

 * Because a relfilenode cannot be recycled until we process and delete
 the PendingUnlinkEntry during mdpostckpt, it is not possible for valid
 new fsync requests to arrive while the PendingUnlinkEntry still exists
 to cause them to be considered canceled.

 * Because we only process and delete PendingUnlinkEntrys that have been
 there since before the checkpoint started, we can be sure that any
 PendingOperationEntrys referring to the relfilenode will have been
 scanned and deleted by mdsync before we remove the PendingUnlinkEntry.

 Unless somebody sees a hole in this logic, I'll go make this happen.

What if we change the hash table to have RelFileNode as the key and an
array of MAX_FORKNUM bitmapsets as the value?  Then when you get a
forget request, you can just zap all the sets to empty.  That seems
a whole lot simpler than your proposal and I don't see any real
downside.  I can't actually poke a whole in your logic at the moment
but a simpler system that requires no assumptions about filesystem
behavior seems preferable to me.

You can still make an unlink request imply a corresponding
forget-request if you want, but now that's a separate optimization.

-- 
Robert Haas
EnterpriseDB: 

[HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Pavel Stehule
Hello

I did a backport of temp_file_limit feature to 9.1, but when we tested
this patch, we found very restristrictive limit to 2GB.

2GB is nonsense, because this is session limit of temp files, and
these files should be longer than 2GB.

Regards

Pavel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autocomplete - SELECT fx

2012-07-19 Thread Peter Eisentraut
On tis, 2012-07-10 at 07:29 -0700, Josh Kupershmidt wrote:
 On Sat, Jul 7, 2012 at 5:43 PM, Noah Misch n...@leadboat.com wrote:
  I like the patch, as far as it goes.  It's the natural addition to the
  completions we already offer; compare the simplistic completion after WHERE.
  Like Pavel and Robert, I think a delightful implementation of tab completion
  for SELECT statements would require radical change.
 
 Thanks for the feedback, Noah. Peter, are you interested in posting an
 updated version of your patch? (The only problems I remember are
 checking attisdropped and function visibility.)

Another problem was exluding functions that are not useful to call
directory, such as functions that used for type or operator definitions.
There is no simple solution for that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Joshua D. Drake


On 07/19/2012 01:04 PM, Pavel Stehule wrote:

I did a backport of temp_file_limit feature to 9.1, but when we tested
this patch, we found very restristrictive limit to 2GB.

2GB is nonsense, because this is session limit of temp files, and
these files should be longer than 2GB.


I haven't read the patch but... don't all 32bit platforms have a 2GB 
limit (by default)?


Sincerely,

jD


--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Christopher Browne
On Thu, Jul 19, 2012 at 4:29 PM, Joshua D. Drake j...@commandprompt.com wrote:

 On 07/19/2012 01:04 PM, Pavel Stehule wrote:

 I did a backport of temp_file_limit feature to 9.1, but when we tested
 this patch, we found very restristrictive limit to 2GB.

 2GB is nonsense, because this is session limit of temp files, and
 these files should be longer than 2GB.


 I haven't read the patch but... don't all 32bit platforms have a 2GB limit
 (by default)?

I don't think so.

LFS got done in the mid-90s, which is long enough ago for people to
start forgetting about it.  Are there any supported platforms that
didn't adopt LFS?

http://en.wikipedia.org/wiki/Large_file_support
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 What if we change the hash table to have RelFileNode as the key and an
 array of MAX_FORKNUM bitmapsets as the value?  Then when you get a
 forget request, you can just zap all the sets to empty.

Hm ... the only argument I can really make against that is that there'll
be no way to move such a table into shared memory; but there's probably
little hope of that anyway, given points made upthread.  The bitmapset
manipulations are a bit tricky but solvable, and I agree there's
something to be said for not tying this stuff so closely to the
mechanism for relfilenode recycling.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Joshua D. Drake


On 07/19/2012 01:48 PM, Christopher Browne wrote:


On Thu, Jul 19, 2012 at 4:29 PM, Joshua D. Drake j...@commandprompt.com wrote:


On 07/19/2012 01:04 PM, Pavel Stehule wrote:


I did a backport of temp_file_limit feature to 9.1, but when we tested
this patch, we found very restristrictive limit to 2GB.

2GB is nonsense, because this is session limit of temp files, and
these files should be longer than 2GB.



I haven't read the patch but... don't all 32bit platforms have a 2GB limit
(by default)?


I don't think so.

LFS got done in the mid-90s, which is long enough ago for people to
start forgetting about it.  Are there any supported platforms that
didn't adopt LFS?

http://en.wikipedia.org/wiki/Large_file_support


Note: by default :). I know they could support LFS but as I recall you 
had to compile specifically for it (at least on linux and old versions 
of pg).


So I was curious if it was that specific limitation or a limitation 
within the Pg code itself.


Sincerely,

Joshua D. Drake



--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Mark Kirkwood

On 20/07/12 09:08, Joshua D. Drake wrote:


On 07/19/2012 01:48 PM, Christopher Browne wrote:


On Thu, Jul 19, 2012 at 4:29 PM, Joshua D. Drake 
j...@commandprompt.com wrote:


On 07/19/2012 01:04 PM, Pavel Stehule wrote:


I did a backport of temp_file_limit feature to 9.1, but when we tested
this patch, we found very restristrictive limit to 2GB.

2GB is nonsense, because this is session limit of temp files, and
these files should be longer than 2GB.



I haven't read the patch but... don't all 32bit platforms have a 2GB 
limit

(by default)?


I don't think so.

LFS got done in the mid-90s, which is long enough ago for people to
start forgetting about it.  Are there any supported platforms that
didn't adopt LFS?

http://en.wikipedia.org/wiki/Large_file_support


Note: by default :). I know they could support LFS but as I recall 
you had to compile specifically for it (at least on linux and old 
versions of pg).


So I was curious if it was that specific limitation or a limitation 
within the Pg code itself.





It is to do with the datatype of the GUC used for the setting - I 
haven't got the patch in from of me to look at, but recall that going 
larger meant using a float type which meant you couldn't get nice units 
displayed (MB, GB etc).


I'll take a proper look later.

Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Mark Kirkwood

On 20/07/12 09:58, Mark Kirkwood wrote:

On 20/07/12 09:08, Joshua D. Drake wrote:


On 07/19/2012 01:48 PM, Christopher Browne wrote:


On Thu, Jul 19, 2012 at 4:29 PM, Joshua D. Drake 
j...@commandprompt.com wrote:


On 07/19/2012 01:04 PM, Pavel Stehule wrote:


I did a backport of temp_file_limit feature to 9.1, but when we 
tested

this patch, we found very restristrictive limit to 2GB.

2GB is nonsense, because this is session limit of temp files, and
these files should be longer than 2GB.



I haven't read the patch but... don't all 32bit platforms have a 
2GB limit

(by default)?


I don't think so.

LFS got done in the mid-90s, which is long enough ago for people to
start forgetting about it.  Are there any supported platforms that
didn't adopt LFS?

http://en.wikipedia.org/wiki/Large_file_support


Note: by default :). I know they could support LFS but as I recall 
you had to compile specifically for it (at least on linux and old 
versions of pg).


So I was curious if it was that specific limitation or a limitation 
within the Pg code itself.





It is to do with the datatype of the GUC used for the setting - I 
haven't got the patch in from of me to look at, but recall that going 
larger meant using a float type which meant you couldn't get nice 
units displayed (MB, GB etc).


I'll take a proper look later.





From src/backend/utils/misc/guc.c

{temp_file_limit, PGC_SUSET, RESOURCES_DISK,
gettext_noop(Limits the total size of all temp files used 
by each session.),

gettext_noop(-1 means no limit.),
GUC_UNIT_KB
},
temp_file_limit,
-1, -1, INT_MAX,
NULL, NULL, NULL
},



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] b-tree index search algorithms

2012-07-19 Thread Samuel Vogel

Am 18.07.12 23:56, schrieb Tom Lane:

Samuel Vogel s...@muel-vogel.de writes:

How would the b-tree know exactly that a value is only a reference? And
even in that case you say that it could get the bits, but make no use of
it, since it does not know what they represent, right?

It has access to the data type's basic storage parameters, which are
typbyval, typlen, and typalign; and we have standard conventions for
identifying the length etc of variable-length values.  It's just the
meaning of the payload data bytes that's data-type-private.


Okay, so with these I know if and how I would have to dereference the 
data.
But how do I get to this info from inside _bt_binsrch? 
RelationGetDescr(rel)-tdtypeid was my closest guess, but I need to get 
a reference to FormData_pg_type somehow.


Regards,
Samuel Vogel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CHECK NO INHERIT syntax

2012-07-19 Thread David Fetter
On Thu, Jul 19, 2012 at 12:49:37AM +0300, Peter Eisentraut wrote:
 Sorry to raise this once again, but I still find this CHECK NO INHERIT
 syntax to a bit funny.  We are currently using something like
 
 CHECK NO INHERIT (foo  0)
 
 But we already have a different syntax for attaching attributes to
 constraints (NOT DEFERRABLE, NOT VALID,  etc.), so it would make more
 sense to have
 
 CHECK (foo  0) NO INHERIT

How about this?

CHECK (foo  0) (INHERIT FALSE)

That leaves an obvious place for other options, which will doubtless
come.  EXPLAIN's options inspired this API design.

 Besides consistency, this makes more sense, because the attribute is a
 property of the constraint as a whole, not of the checking.

Good point.  The above change preserves this property.

 This would also extend more easily to other constraint types.  For
 example, when unifying CHECK and NOT NULL constraints, as is
 planned, or when allowing inherited unique constraints, as is
 planned further down the road.
 
 There is also a hole in the current implementation.  Domain
 constraints silently allow NO INHERIT to be specified, even though
 other senseless attributes are rejected.

That's probably a bug.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] b-tree index search algorithms

2012-07-19 Thread Tom Lane
Samuel Vogel s...@muel-vogel.de writes:
 Am 18.07.12 23:56, schrieb Tom Lane:
 It has access to the data type's basic storage parameters, which are
 typbyval, typlen, and typalign; and we have standard conventions for
 identifying the length etc of variable-length values.  It's just the
 meaning of the payload data bytes that's data-type-private.

 Okay, so with these I know if and how I would have to dereference the 
 data.
 But how do I get to this info from inside _bt_binsrch? 

RelationGetDescr(rel)-attrs[n]-attbyval etc.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Tom Lane
Pavel Stehule pavel.steh...@gmail.com writes:
 I did a backport of temp_file_limit feature to 9.1, but when we tested
 this patch, we found very restristrictive limit to 2GB.

 2GB is nonsense, because this is session limit of temp files, and
 these files should be longer than 2GB.

This claim is nonsense.  The variable's value is measured in KB, so the
effective limit is actually 2TB not 2GB.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pgbench -i order of vacuum

2012-07-19 Thread Jeff Janes
Is there a reason to vacuum the pgbench_* tables after the indexes on
them are built, rather than before?

Since the indexes are on fresh tables, they can't have anything that
needs to be cleaned.

I don't think the current order accomplishes anything, except to slow
down large initializations by ~25%.

The attached patch moves the vacuums up.

I also made -n skip the vacuums altogether.  Since -n is allowed under
-i, it would be nice if it did something, and there is only one
intuitive thing for it to do.  I don't know what the use case for is,
but I think I've heard grumbling about it before.

Cheers,

Jeff


pgbench_vacuum_order_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CHECK NO INHERIT syntax

2012-07-19 Thread Tom Lane
David Fetter da...@fetter.org writes:
 On Thu, Jul 19, 2012 at 12:49:37AM +0300, Peter Eisentraut wrote:
 But we already have a different syntax for attaching attributes to
 constraints (NOT DEFERRABLE, NOT VALID,  etc.), so it would make more
 sense to have
 
 CHECK (foo  0) NO INHERIT

 How about this?

 CHECK (foo  0) (INHERIT FALSE)

The SQL spec already says what the syntax is for options attached to
constraints, and that's not it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Mark Kirkwood

On 20/07/12 12:02, Tom Lane wrote:

Pavel Stehule pavel.steh...@gmail.com writes:

I did a backport of temp_file_limit feature to 9.1, but when we tested
this patch, we found very restristrictive limit to 2GB.
2GB is nonsense, because this is session limit of temp files, and
these files should be longer than 2GB.

This claim is nonsense.  The variable's value is measured in KB, so the
effective limit is actually 2TB not 2GB.




Did you guys perchance pick up one of the earlier patches that had 
MAX_KILOBYTES instead of INT_MAX as the limit? i.e:


{temp_file_limit, PGC_SUSET, RESOURCES_DISK,
gettext_noop(Limits the total size of all temp files used 
by each session.),

gettext_noop(-1 means no limit.),
GUC_UNIT_KB
},
temp_file_limit,
-1, -1, **MAX_KILOBYTES**, 
NULL, NULL, NULL
},




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row literal problem

2012-07-19 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 here's a cut down example:
 with q as (select max(v) from (select 1 as v) q group by v) select q from q;

I traced through this example, and find that it's another issue in an
area we've hacked at before.  ExecEvalVar, when it finds that it's
dealing with a whole-row Var of type RECORD, just assumes that the tuple
descriptor of the source TupleTableSlot is the correct descriptor for
the whole-row result.  Now, in the case where the whole-row Var has a
named composite type, we have found that we have to go to quite some
lengths to deal with possible discrepancies in the desired and actual
rowtypes; in particular notice this comment:

 * ... Also, we have to allow the case that the slot has
 * more columns than the Var's type, because we might be looking
 * at the output of a subplan that includes resjunk columns. (XXX
 * it would be nice to verify that the extra columns are all
 * marked resjunk, but we haven't got access to the subplan
 * targetlist here...)

I think the way to solve this is to do whatever it takes to get access
to the subplan targetlist.  We could then do something a bit cleaner
than what the named-rowtype code is currently doing: if there are
resjunk columns in the subplan targetlist, use the tlist to create a
JunkFilter, and then pass the tuples through that.  After that we can
insist that the tuples don't have any extra columns.

Getting hold of that tlist is going to be a bit messy, though.  I think
what we can do is create a special ExprState variant for whole-row Vars
(which we'd need anyway to hold the JunkFilter), and have ExecInitExpr
store the parent PlanState pointer into it.  Then at the first call
of ExecEvalVar, dig down to the appropriate tlist depending on what
type of PlanState we find ourselves running in.  This shouldn't be
too painful because a whole-row Var can only appear in a simple scan
node, not an upper-level plan node, so there are not as many cases
to deal with as you might think.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Pavel Stehule
2012/7/20 Tom Lane t...@sss.pgh.pa.us:
 Pavel Stehule pavel.steh...@gmail.com writes:
 I did a backport of temp_file_limit feature to 9.1, but when we tested
 this patch, we found very restristrictive limit to 2GB.

 2GB is nonsense, because this is session limit of temp files, and
 these files should be longer than 2GB.

 This claim is nonsense.  The variable's value is measured in KB, so the
 effective limit is actually 2TB not 2GB.

you have true - it works on 9.2. The problem will be somewhere in
backport on 9.1

sorry for false alarm

Regards

Pavel Stěhule


 regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2GB limit for temp_file_limit on 32bit platform

2012-07-19 Thread Pavel Stehule
2012/7/20 Pavel Stehule pavel.steh...@gmail.com:
 2012/7/20 Tom Lane t...@sss.pgh.pa.us:
 Pavel Stehule pavel.steh...@gmail.com writes:
 I did a backport of temp_file_limit feature to 9.1, but when we tested
 this patch, we found very restristrictive limit to 2GB.

 2GB is nonsense, because this is session limit of temp files, and
 these files should be longer than 2GB.

 This claim is nonsense.  The variable's value is measured in KB, so the
 effective limit is actually 2TB not 2GB.

 you have true - it works on 9.2. The problem will be somewhere in
 backport on 9.1

it works well with 9.1.4, but not with 9.1.3


 sorry for false alarm

 Regards

 Pavel Stěhule


 regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers