Re: [HACKERS] Page Checksums

2012-01-24 Thread jesper
 * Robert Treat:

 Would it be unfair to assert that people who want checksums but aren't
 willing to pay the cost of running a filesystem that provides
 checksums aren't going to be willing to make the cost/benefit trade
 off that will be asked for? Yes, it is unfair of course, but it's
 interesting how small the camp of those using checksummed filesystems
 is.

 Don't checksumming file systems currently come bundled with other
 features you might not want (such as certain vendors)?

I would chip in and say that I would prefer sticking to well-known proved
filesystems like xfs/ext4 and let the application do the checksumming.

I dont forsee fully production-ready checksumming filesystems readily
available in the standard Linux distributions within a near future.

And yes, I would for sure turn such functionality on if it were present.

-- 
Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Florian Weimer
 I would chip in and say that I would prefer sticking to well-known proved
 filesystems like xfs/ext4 and let the application do the checksumming.

Yes, that's a different way of putting my concern.  If you want a proven
file system with checksumming (and an fsck), options are really quite
limited.

 And yes, I would for sure turn such functionality on if it were present.

Same here.  I already use page-level checksum with Berkeley DB.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-24 Thread Fujii Masao
On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Why does walrestore need to be invoked even when restore_command is
 not specified? It seems to be useless. We invoke walreceiver only when
 primary_conninfo is specified now. Similarly we should invoke walrestore
 only when restore_command is specified?

 walreceiver is shutdown and restarted in case of failed connection.
 That never happens with walrestore because the command is run each
 time - when we issue system(3) a new process is forked to run the
 command. So there is no specific cleanup to perform and so no reason
 for a managed cleanup process.

 So I can't see a specific reason to change that. Do you think it makes
 a difference?

Yes. When restore_command is not specified in recovery.conf, walrestore
process doesn't do any useful activity and just wastes CPU cycle. Which
might be harmless for a functionality of recovery, but ISTM it's better not
to start up walrestore in that case to avoid the waste of cycle.

 Cleaned up the points noted, new patch attached in case you wish to
 review further.

 Still has bug, so still with me to fix.

Thanks! Will review further.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 9:43 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Why does walrestore need to be invoked even when restore_command is
 not specified? It seems to be useless. We invoke walreceiver only when
 primary_conninfo is specified now. Similarly we should invoke walrestore
 only when restore_command is specified?

 walreceiver is shutdown and restarted in case of failed connection.
 That never happens with walrestore because the command is run each
 time - when we issue system(3) a new process is forked to run the
 command. So there is no specific cleanup to perform and so no reason
 for a managed cleanup process.

 So I can't see a specific reason to change that. Do you think it makes
 a difference?

 Yes. When restore_command is not specified in recovery.conf, walrestore
 process doesn't do any useful activity and just wastes CPU cycle. Which
 might be harmless for a functionality of recovery, but ISTM it's better not
 to start up walrestore in that case to avoid the waste of cycle.

It just sleeps on a latch when it has nothing to do, so no wasted cycles.

Asking the postmaster seemed the easier option, I guess I could have
chosen the other way also.

I'll look at this when this is the last thing left to resolve to see
if that improves things.


 Cleaned up the points noted, new patch attached in case you wish to
 review further.

 Still has bug, so still with me to fix.

 Thanks! Will review further.

Much appreciated.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-24 Thread Fujii Masao
On Mon, Jan 23, 2012 at 10:11 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Jan 23, 2012 at 10:29 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Jan 20, 2012 at 11:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Jan 20, 2012 at 12:54 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Thanks for the review!

 On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I'm looking at this patch and wondering why we're doing so many
 press-ups to ensure full_page_writes parameter is on. This will still
 fail if you use a utility that removes the full page writes, but fail
 silently.

 I think it would be beneficial to explicitly check that all WAL
 records have full page writes actually attached to them until we
 achieve consistency.

 I agree that it's worth adding such a safeguard. That can be a 
 self-contained
 feature, so I'll submit a separate patch for that, to keep each patch 
 small.

 Maybe, but you mean do this now as well? Not sure I like silent errors.

 If many people think the patch is not acceptable without such a safeguard,
 I will do that right now. Otherwise, I'd like to take more time to do
 that, i.e.,
 add it to 9.2dev Oepn Items.

 I've not come up with good idea. Ugly idea is to keep track of all replays of
 full_page_writes for every buffer pages (i.e., prepare 1-bit per buffer page
 table and set the specified bit to 1 when full_page_writes is applied),
 and then check whether full_page_writes has been already applied when
 replaying normal WAL record... Do you have any better idea?

 Not sure.

 I think the only possible bug here is one introduced by an outside utility.

 In that case, I don't think it should be the job of the backend to go
 too far to protect against such atypical error. So if we can't solve
 it fairly easily and with no overhead then I'd say lets skip it. We
 could easily introduce a bug here just by having faulty checking code.

 So lets add it to 9.2 open items as a non-priority item.

Agreed.

 I'll proceed to commit for this now.

Thanks a lot!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-24 Thread Fujii Masao
On Tue, Jan 24, 2012 at 6:49 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jan 24, 2012 at 9:43 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Why does walrestore need to be invoked even when restore_command is
 not specified? It seems to be useless. We invoke walreceiver only when
 primary_conninfo is specified now. Similarly we should invoke walrestore
 only when restore_command is specified?

 walreceiver is shutdown and restarted in case of failed connection.
 That never happens with walrestore because the command is run each
 time - when we issue system(3) a new process is forked to run the
 command. So there is no specific cleanup to perform and so no reason
 for a managed cleanup process.

 So I can't see a specific reason to change that. Do you think it makes
 a difference?

 Yes. When restore_command is not specified in recovery.conf, walrestore
 process doesn't do any useful activity and just wastes CPU cycle. Which
 might be harmless for a functionality of recovery, but ISTM it's better not
 to start up walrestore in that case to avoid the waste of cycle.

 It just sleeps on a latch when it has nothing to do, so no wasted cycles.

Right, since walrestore process wakes up just every 10 seconds, a waste of
cycle is low. But what I feel uncomfortable is that walrestore process has
nothing to do *from start to end*, when restore_command is not specified,
but it's started up. I guess that many people would get surprised at that.
Of course, if restore_command can be changed without restarting the server,
I agree with you because walrestore process might do an useful activity
later. But currently not.

 Asking the postmaster seemed the easier option, I guess I could have
 chosen the other way also.

 I'll look at this when this is the last thing left to resolve to see
 if that improves things.

Okay.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New replication mode: write

2012-01-24 Thread Fujii Masao
On Mon, Jan 23, 2012 at 9:53 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Jan 23, 2012 at 10:03 AM, Fujii Masao masao.fu...@gmail.com wrote:

 To make the walreceiver call WaitLatchOrSocket(), we would need to
 merge it and libpq_select() into one function. But the former is the 
 backend
 function and the latter is the frontend one. Now I have no good idea to
 merge them cleanly.

 We can wait on the socket wherever it comes from. poll/select doesn't
 care how we got the socket.

 So we just need a common handler that calls either
 walreceiver/libpqwalreceiver function as required to handle the
 wakeup.

 I'm afraid I could not understand your idea. Could you explain it in
 more detail?

 We either tell libpqwalreceiver about the latch, or we tell
 walreceiver about the socket used by libpqwalreceiver.

 In either case we share a pointer from one module to another.

The former seems difficult because it's not easy to link libpqwalreceiver.so
to the latch. I will consider about the latter.

 If we send back the reply as soon as the Apply pointer is changed, I'm
 afraid quite lots of reply messages are sent frequently, which might
 cause performance problem. This is also one of the reasons why I didn't
 implement the quick-response feature. To address this problem, we might
 need to change the master so that it sends the Wait pointer to the standby,
 and change the standby so that it replies whenever the Apply pointer
 catches up with the Wait one. This can reduce the number of useless
 reply from the standby about the Apply pointer.

 We send back one reply per incoming message. The incoming messages
 don't know request state and checking that has a cost which I don't
 think is an appropriate payment since we only need this info when the
 link goes quiet.

 When the link goes quiet we still need to send replies if we have
 apply mode, but we only need to send apply messages if the lsn has
 changed because of a commit. That will considerably reduce the
 messages sent so I don't see a problem.

 You mean to change the meaning of apply_location? Currently it indicates
 the end + 1 of the last replayed WAL record, regardless of whether it's
 a commit record or not. So too many replies can be sent per incoming
 message because it might contain many WAL records. But you mean to
 change apply_location only when a commit record is replayed?

 There is no change to the meaning of apply_location. The only change
 is that we send that message only when it has an updated value of
 committed lsn.

This means that apply_location might return the different location from
pg_last_xlog_replay_location() on the standby, though in 9.1 they return
the same. Which might confuse a user. No?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao masao.fu...@gmail.com wrote:

 I'll proceed to commit for this now.

 Thanks a lot!

Can I just check a few things?

You say
/*
+* Update full_page_writes in shared memory and write an
+* XLOG_FPW_CHANGE record before resource manager writes cleanup
+* WAL records or checkpoint record is written.
+*/

why does it need to be before the cleanup and checkpoint?

You say
/*
+* Currently only non-exclusive backup can be taken during recovery.
+*/

why?

You mention in the docs
The backup history file is not created in the database cluster backed up.
but we need to explain the bad effect, if any.

You say
If the standby is promoted to the master during online backup, the
backup fails.
but no explanation of why?

I could work those things out, but I don't want to have to, plus we
may disagree if I did.

There are some good explanations in comments of other things, just not
everywhere needed.

What happens if we shutdown the WALwriter and then issue SIGHUP?

Are we sure we want to make the change of file format mandatory? That
means earlier versions of clients such as pg_basebackup will fail
against this version. Should we allow that if BACKUP FROM is missing
we assume it was master?

There are no docs to explain the new feature is available in the main
docs, or to explain the restrictions.
I expect you will add that later after commit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 10:54 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao masao.fu...@gmail.com wrote:

 I'll proceed to commit for this now.

 Thanks a lot!

 Can I just check a few things?

Just to clarify, not expecting another patch version, just reply here
and I can edit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New replication mode: write

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 10:47 AM, Fujii Masao masao.fu...@gmail.com wrote:

 I'm afraid I could not understand your idea. Could you explain it in
 more detail?

 We either tell libpqwalreceiver about the latch, or we tell
 walreceiver about the socket used by libpqwalreceiver.

 In either case we share a pointer from one module to another.

 The former seems difficult because it's not easy to link libpqwalreceiver.so
 to the latch. I will consider about the latter.

Yes, it might be too hard, but lets look.

 You mean to change the meaning of apply_location? Currently it indicates
 the end + 1 of the last replayed WAL record, regardless of whether it's
 a commit record or not. So too many replies can be sent per incoming
 message because it might contain many WAL records. But you mean to
 change apply_location only when a commit record is replayed?

 There is no change to the meaning of apply_location. The only change
 is that we send that message only when it has an updated value of
 committed lsn.

 This means that apply_location might return the different location from
 pg_last_xlog_replay_location() on the standby, though in 9.1 they return
 the same. Which might confuse a user. No?

The two values only match on a quiet system anyway, since both are
moving forwards.

They will still match on a quiet system.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2012-01-24 Thread Alexander Korotkov
Hi!

New version of patch is attached.
On Thu, Dec 22, 2011 at 11:46 AM, Jeff Davis pg...@j-davis.com wrote:

 A few comments:

 * In range_gist_picksplit, it would be nice to have a little bit more
 intuitive description of what's going on with the nonEmptyCount and
 nonInfCount numbers. For instance, it appears to depend on the fact that
 a range must either be in nonEmptyCount or in nonInfCount. Also, can you
 describe the reason you're multiplying by two and taking the absolute
 value? It seems to work, but I am missing the intuition behind those
 operations.

total_count - 2*nonEmptyCount = (total_count - nonEmptyCount) -
nonEmptyCount = emptyCount - nonEmptyCount
So, it's really not evident. I've simplified it.



 * The penalty function is fairly hard to read still. At a high level, I
 think we're trying to accomplish a few things (in order from most to
 least important):
  - Keep normal ranges separate.
  - Avoid broadening the class of the original predicate (e.g. turning
 single-infinite into double-infinite).
  - Avoid broadening (as determined by subtype_diff) the original
 predicate.
  - Favor adding ranges to narrower original predicates.

 Do you agree? If so, perhaps we can attack those more directly and it
 might be a little more readable.

 Additionally, the arbitrary numbers might become a problem. Can we
 choose better constants there? They would still be arbitrary when
 compared with real numbers derived from subtype_diff, but maybe we can
 still do better than what's there.

I've changes some comments and add constants for penalty values.


 * Regarding the leftover common entries that can go to either side:
 what is the delta measure trying to accomplish? When I work through
 some examples, it seems to favor putting larger common ranges on the
 left (small delta) and smaller common ranges on the right (smaller
 delta). Why is that good? Or did I misread the code?

 Intuitively, I would think that we'd want to push the ranges with lower
 upper bounds to the left and higher lower bounds to the right -- in
 other words, recurse. Obviously, we'd need to make sure it terminated at
 some point, but splitting the common entries does seem like a smaller
 version of the original problem. Thoughts?

That was a bug. Actually, no abs is needed. Indeed it doesn't affect
result significantly.

-
With best regards,
Alexander Korotkov.


rangetypegist-0.6.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basic pgbench runs with various performance-related patches

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 1:26 AM, Tatsuo Ishii is...@postgresql.org wrote:
 ** pgbench, permanent tables, scale factor 100, 300 s **
 1 group-commit-2012-01-21 614.425851 -10.4%
 8 group-commit-2012-01-21 4705.129896 +6.3%
 16 group-commit-2012-01-21 7962.131701 +2.0%
 24 group-commit-2012-01-21 13074.939290 -1.5%
 32 group-commit-2012-01-21 12458.962510 +4.5%
 80 group-commit-2012-01-21 12907.062908 +2.8%

 Interesting. Comparing with this:
 http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php
 you achieved very small enhancement. Do you think of any reason which
 makes the difference?

My test was run with synchronous_commit=off, so I didn't expect the
group commit patch to have much of an impact.  I included it mostly to
see whether by chance it helped anyway (since it also helps other WAL
flushes, not just commits) or whether it caused any regression.

One somewhat odd thing about these numbers is that, on permanent
tables, all of the patches seemed to show regressions vs. master in
single-client throughput.  That's a slightly difficult result to
believe, though, so it's probably a testing artifact of some kind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Robert Haas
On Mon, Jan 23, 2012 at 5:49 PM, Merlin Moncure mmonc...@gmail.com wrote:
 I'm not sure that you're getting anything with that user facing
 complexity.  The only realistic case I can see for explicit control of
 wire formats chosen is to defend your application from format changes
 in the server when upgrading the server and/or libpq.   This isn't a
 let's get better compression problem, this is I upgraded my
 database and my application broke problem.

 Fixing this problem in non documentation fashion is going to require a
 full protocol change, period.

Our current protocol allocates a 2-byte integer for the purposes of
specifying the type of each parameter, and another 2-byte integer for
the purpose of specifying the result type... but only one bit is
really needed at present: text or binary.  If we revise the protocol
version at some point, we might want to use some of that bit space to
allow some more fine-grained negotiation of the protocol version.  So,
for example, we might define the top 5 bits as reserved (always pass
zero), the next bit as a text/binary flag, and the remaining 10 bits
as a 10-bit format version number.  When a change like this comes
along, we can bump the highest binary format version recognized by the
server, and clients who request the new version can get it.

Alternatively, we might conclude that a 2-byte integer for each
parameter is overkill and try to cut back... but the point is there's
a bunch of unused bitspace there now.  In theory we could even do
something this without bumping the protocol version since the
documentation seems clear that any value other than 0 and 1 yields
undefined behavior, but in practice that seems like it might be a bit
too edgy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inline Extension

2012-01-24 Thread Robert Haas
On Mon, Jan 23, 2012 at 7:59 PM, Daniel Farina dan...@heroku.com wrote:
 Things are still a bit ugly in the more complex cases: consider
 PostGIS's linkage against libproj and other libraries.  In order to
 cover all cases, I feel that what I need is an optional hook (for the
 same of argument, let's say it's another command type hook, e.g.
 archive_command) to be executed when extension (un)installation is
 occurs on a primary or is replayed on a standby whereby I can acquire
 the necessary dependencies for an extension or signal some kind of
 error (as to exactly how that interfaces with the server is delicate,
 should one want to supply good error messages to the user).

There aren't a whole lot of good options for handling errors on the
standby, in general.  If the operation has already occurred on the
master, it's too late to decide that we're not going to have it happen
on the standby as well.

Of course, if we confine ourselves to control and SQL files, then it
doesn't really matter: the catalog entries from the master will make
it to the standby regardless of the absence of the SQL and control
files, and maybe we could simply decide that the operations which
write in pg_extension aren't WAL-logged and can be separately
performed on the standby if you want them there as well.

But, that's a bit unlike how we normally do things.  And if we're
going to WAL-log the writing of the extension files, then I start to
feel like we should go back to putting the data in a system catalog
rather than the filesystem, because inventing a new WAL record type
for this seems like it will make things quite a bit more complicated
for no particular benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)

2012-01-24 Thread Jeff Janes
On Sat, Jan 21, 2012 at 2:29 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 20, 2012, at 11:54 AM, Heikki Linnakangas wrote:

 So, you're proposing that we remove freelist altogether? Sounds reasonable, 
 but that needs to be performance tested somehow. I'm not sure what exactly 
 the test should look like, but it probably should involve an OLTP workload, 
 and large tables being created, populated to fill the cache with pages from 
 the table, and dropped, while the OLTP workload is running. I'd imagine that 
 the worst case scenario looks something like that.

 We should also look at having the freelist do something useful, instead of 
 just dropping it completely. Unfortunately that's probably more work...

If the head and tail are both protected by BufFreelistLock, I'm pretty
sure this will make things worse, not better.

If we change to having head and tail protected by separate spinlocks,
then I don't see how you can remove the last buffer from the list, or
add a buffer to an empty list, without causing havoc.

Does anyone have ideas for implementing a cross-platform, lock-free,
FIFO linked list?  Without that, I don't see how we are going to get
anywhere on this approach.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Robert Treat
On Tue, Jan 24, 2012 at 3:02 AM,  jes...@krogh.cc wrote:
 * Robert Treat:

 Would it be unfair to assert that people who want checksums but aren't
 willing to pay the cost of running a filesystem that provides
 checksums aren't going to be willing to make the cost/benefit trade
 off that will be asked for? Yes, it is unfair of course, but it's
 interesting how small the camp of those using checksummed filesystems
 is.

 Don't checksumming file systems currently come bundled with other
 features you might not want (such as certain vendors)?

 I would chip in and say that I would prefer sticking to well-known proved
 filesystems like xfs/ext4 and let the application do the checksumming.


*shrug* You could use Illumos or BSD and you'd get generally vendor
free systems using ZFS, which I'd say offers more well-known and
proved checksumming than anything cooking in linux land, or than the
as-to-be-written yet checksumming in postgres.

 I dont forsee fully production-ready checksumming filesystems readily
 available in the standard Linux distributions within a near future.

 And yes, I would for sure turn such functionality on if it were present.


That's nice to say, but most people aren't willing to take a 50%
performance hit. Not saying what we end up with will be that bad, but
I've seen people get upset about performance hits much lower than
that.


Robert Treat
conjecture: xzilla.net
consulting: omniti.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)

2012-01-24 Thread Jeff Janes
On Mon, Jan 23, 2012 at 5:06 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jan 23, 2012 at 12:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Jan 21, 2012 at 5:29 PM, Jim Nasby j...@nasby.net wrote:
 We should also look at having the freelist do something useful, instead of 
 just dropping it completely. Unfortunately that's probably more work...

 That's kinda my feeling as well.  The free list in its current form is
 pretty much useless, but I don't think we'll save much by getting rid
 of it, because that's just a single test.  The expensive part of what
 we do while holding BufFreelistLock is, I think, iterating through
 buffers taking and releasing a spinlock on each one (!).

 Yeah ... spinlocks that, by definition, will be uncontested.

 What makes you think that they are uncontested?

It is automatically partitioned over 131,072 spinlocks if
shared_buffers=1GB, so the chances of contention seem pretty low.
If a few very hot index root blocks are heavily contested, still that
would only be
encountered a few times out of the 131,072.  I guess we could count
them, similar
to your LWLOCK_STATS changes.

 Or for that matter,
 that even an uncontested spinlock operation is cheap enough to do
 while holding a badly contended LWLock?

Is the concern that getting a uncontested spinlock has lots of
instructions, or that the associated fences are expensive?


 So I think
 it would be advisable to prove rather than just assume that that's a
 problem.

 It's pretty trivial to prove that there is a very serious problem with
 BufFreelistLock.  I'll admit I can't prove what the right fix is just
 yet, and certainly measurement is warranted.


From my own analysis and experiments, the mere act of getting the
BufFreelistLock when it is contended is far more important than
anything actually done while holding that lock.  When the clock-sweep
is done often enough that the lock is contended, then it is done often
enough to keep the average usagecount low and so the average number of
buffers visited during a clock sweep before finding a usable one was
around 2.0.  YMMV.

Can we get some people who run busy high-CPU servers that churn the
cache and think they may be getting hit with this problem post the
results of this query:

select usagecount, count(*) from pg_buffercache group by usagecount;


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)

2012-01-24 Thread Simon Riggs
On Mon, Jan 23, 2012 at 4:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 The real
 problem there is that BufFreelistLock is also used to protect the
 clock sweep pointer.

Agreed

 I think basically we gotta find a way to allow
 multiple backends to run clock sweeps concurrently.  Or else fix
 things so that the freelist never (well, hardly ever) runs dry.

Agreed.

The only question is what do we do now. I'm happy with the thought of
jam tomorrow, I'd like to see some minor improvements now, given that
when we say we'll do that even better in the next release it often
doesn't happen. Which is where those patches come in.

I've posted an improved lock wait analysis patch and have scheduled
some runs on heavily used systems to see what this tells us. Results
from live machines also greatly appreciated, since test systems seem
likely to be inappropriate tests. Patch copied here again.

This is a key issue since RAM is cheap enough now that people are
swamped by it, so large shared_buffers are very common.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat r...@xzilla.net wrote:
 And yes, I would for sure turn such functionality on if it were present.


 That's nice to say, but most people aren't willing to take a 50%
 performance hit. Not saying what we end up with will be that bad, but
 I've seen people get upset about performance hits much lower than
 that.

When we talk about a 50% hit, are we discussing (1) checksums that are
checked on each I/O, or (2) checksums that are checked each time we
re-pin a shared buffer?  The 50% hit was my estimate of (2) and has
not yet been measured, so shouldn't be used unqualified when
discussing checksums. Same thing is also true I would use it
comments, since we're not sure whether you're voting for (1) or (2).

As to whether people will actually use (1), I have no clue. But I do
know is that many people request that feature, including people that
run heavy duty Postgres production systems and who also know about
filesystems. Do people need (2)? It's easy enough to add as an option,
once we have (1) and there is real interest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithread Query Planner

2012-01-24 Thread Robert Haas
On Mon, Jan 23, 2012 at 2:45 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Yes, but OP is proposing to use multiple threads inside the forked
 execution process.  That's a completely different beast.  Many other
 databases support parallel execution of a single query and it might
 very well be better/easier to do that with threads.

I doubt it.  Almost nothing in the backend is thread-safe.  You can't
acquire a heavyweight lock, a lightweight lock, or a spinlock. You
can't do anything that might elog() or ereport().  None of those
things are reentrant.  Consequently, you can't do anything that
involves reading or pinning a buffer, making a syscache lookup, or
writing WAL.  You can't even  do something like parallelize the
qsort() of a chunk of data that's already been read into a private
buffer... because you'd have to call the comparison functions for the
data type, and they might elog() or ereport().  Of course, in certain
special cases (like int4) you could make it safe, but it's hard for to
imagine anyone wanting to go to that amount of effort for such a small
payoff.

If we're going to do parallel query in PG, and I think we are going to
need to do that eventually, we're going to need a system where large
chunks of work can be handed off, as in the oft-repeated example of
parallelizing an append node by executing multiple branches
concurrently.  That's where the big wins are.  And that means either
overhauling the entire backend to make it thread-safe, or using
multiple backends.  The latter will be hard, but it'll still be a lot
easier than the former.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Merlin Moncure
On Tue, Jan 24, 2012 at 8:26 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jan 23, 2012 at 5:49 PM, Merlin Moncure mmonc...@gmail.com wrote:
 I'm not sure that you're getting anything with that user facing
 complexity.  The only realistic case I can see for explicit control of
 wire formats chosen is to defend your application from format changes
 in the server when upgrading the server and/or libpq.   This isn't a
 let's get better compression problem, this is I upgraded my
 database and my application broke problem.

 Fixing this problem in non documentation fashion is going to require a
 full protocol change, period.

 Our current protocol allocates a 2-byte integer for the purposes of
 specifying the type of each parameter, and another 2-byte integer for
 the purpose of specifying the result type... but only one bit is
 really needed at present: text or binary.  If we revise the protocol
 version at some point, we might want to use some of that bit space to
 allow some more fine-grained negotiation of the protocol version.  So,
 for example, we might define the top 5 bits as reserved (always pass
 zero), the next bit as a text/binary flag, and the remaining 10 bits
 as a 10-bit format version number.  When a change like this comes
 along, we can bump the highest binary format version recognized by the
 server, and clients who request the new version can get it.

 Alternatively, we might conclude that a 2-byte integer for each
 parameter is overkill and try to cut back... but the point is there's
 a bunch of unused bitspace there now.  In theory we could even do
 something this without bumping the protocol version since the
 documentation seems clear that any value other than 0 and 1 yields
 undefined behavior, but in practice that seems like it might be a bit
 too edgy.

Yeah.  But again, this isn't a contract between libpq and the server,
but between the application and the server...unless you want libpq to
do format translation to something the application can understand (but
even then the application is still involved).  I'm not very
enthusiastic about encouraging libpq application authors to pass
format #defines for every single parameter and consumed datum to get
future proofing on wire formats.  So I'd vote against any format code
beyond the text/binary switch that currently exists (which, by the
way, while useful, is one of the great sins of libpq that we have to
deal with basically forever).  While wire formatting is granular down
to the type level, applications should not have to deal with that.
They should Just Work.  So who decides what format code to stuff into
the protocol?  Where are the codes defined?

I'm very much in the camp that sometime, presumably during connection
startup, the protocol accepts a non-#defined-in-libpq token (database
version?) from the application that describes to the server what wire
formats can be used and the server sends one back.  There probably has
to be some additional facilities for non-core types but let's put that
aside for the moment.  Those two tokens allow the server to pick the
highest supported wire format (text and binary!) that everybody
understands.  The server's token is useful if we're being fancy and we
want libpq to translate an older server's wire format to a newer one
for the application.  This of course means moving some of the type
system into the client, which is something we might not want to do
since among other things it puts a heavy burden on non-libpq driver
authors (but then again, they can always stay on the v3 protocol,
which can benefit from being frozen in terms of wire formats).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2012-01-24 Thread Jaime Casanova
On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch n...@leadboat.com wrote:
 On Mon, Jan 23, 2012 at 04:56:24PM -0300, Alvaro Herrera wrote:

 Hm.  Leaf pages hold as much tuples as non-leaf pages, no?  I mean
 for each page element there's a value and a CTID.  In non-leaf those
 CTIDs point to other index pages, one level down the tree; in leaf pages
 they point to the heap.

 That distinction seemed important when I sent my last message, but now I agree
 that it's largely irrelevant for free space purposes.  If someone feels like
 doing it, +1 for making pgstattuple() count non-leaf free space.


actually i agreed that non-leaf pages are irrelevant... i just
confirmed that in a production system with 300GB none of the indexes
in an 84M rows table nor in a heavily updated one has more than 1 root
page, all the rest are deleted, half_dead or leaf. so the posibility
of bloat coming from non-leaf pages seems very odd

but the possibility of bloat coming from the meta page doesn't exist,
AFAIUI at least

we need the most accurate value about usable free space, because the
idea is to add a sampler mode to the function so we don't scan the
whole relation. that's why we still need the function.

btw... pgstattuple also has the problem that it's not using a ring buffer


attached are two patches:
- v5: is the same original patch but only track space in leaf, deleted
and half_dead pages
- v5.1: adds the same for all kind of indexes (problem is that this is
inconsistent with the fact that pageinspect only manages btree indexes
for everything else)

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
new file mode 100644
index 13ba6d3..63fab95
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
*** MODULE_big	= pageinspect
*** 4,10 
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
--- 4,12 
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--1.1.sql \
!pageinspect--1.0--1.1.sql \
!pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
new file mode 100644
index dbb2158..9c0b0fb
*** a/contrib/pageinspect/btreefuncs.c
--- b/contrib/pageinspect/btreefuncs.c
***
*** 34,39 
--- 34,40 
  #include utils/builtins.h
  #include utils/rel.h
  
+ #include btreefuncs.h
  
  extern Datum bt_metap(PG_FUNCTION_ARGS);
  extern Datum bt_page_items(PG_FUNCTION_ARGS);
*** GetBTPageStatistics(BlockNumber blkno, B
*** 155,160 
--- 156,216 
  		stat-avg_item_size = 0;
  }
  
+ /*
+  * GetBTRelationFreeSpace
+  *
+  * Get the free space for a btree index.
+  * This is a helper function for relation_free_space()
+  *
+  */
+ float4
+ GetBTRelationFreeSpace(Relation rel)
+ {
+ 	BTPageStat stat;
+ 
+ 	Buffer		buffer;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+ BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Skip page 0 because it is a metapage */
+ 	for (blkno = 1; blkno  totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		/* 
+ 		 * get the statistics of the indexes and use that info
+ 		 * to determine free space on the page
+ 		 */
+ 		GetBTPageStatistics(blkno, buffer, stat);
+ 		/* 
+ 		 * Consider pages DELETED and HALF_DEAD as empty,
+ 		 * besides those only consider LEAF pages
+ 		 */
+ 		if (stat.type == 'd' || stat.type == 'e')
+ 		{
+ 			free_space += stat.page_size;
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (stat.type == 'l')
+ 		{
+ 			free_space += stat.free_size;		
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted  0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ 
  /* ---
   * bt_page()
   *
diff --git a/contrib/pageinspect/btreefuncs.h b/contrib/pageinspect/btreefuncs.h
new file mode 100644
index ...549f878
*** a/contrib/pageinspect/btreefuncs.h
--- b/contrib/pageinspect/btreefuncs.h
***
*** 0 
--- 1,5 
+ /*
+  * contrib/pageinspect/btreefuncs.h
+  */
+ 
+ float4 GetBTRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
new file mode 100644
index 260ccff..8e3961d
*** a/contrib/pageinspect/heapfuncs.c
--- b/contrib/pageinspect/heapfuncs.c

Re: [HACKERS] Multithread Query Planner

2012-01-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I doubt it.  Almost nothing in the backend is thread-safe.  You can't
 acquire a heavyweight lock, a lightweight lock, or a spinlock. You
 can't do anything that might elog() or ereport().  None of those
 things are reentrant.

Not to mention palloc, another extremely fundamental and non-reentrant
subsystem.

Possibly we could work on making all that stuff re-entrant, but it would
be a huge amount of work for a distant and uncertain payoff.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithread Query Planner

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 11:25 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I doubt it.  Almost nothing in the backend is thread-safe.  You can't
 acquire a heavyweight lock, a lightweight lock, or a spinlock. You
 can't do anything that might elog() or ereport().  None of those
 things are reentrant.

 Not to mention palloc, another extremely fundamental and non-reentrant
 subsystem.

 Possibly we could work on making all that stuff re-entrant, but it would
 be a huge amount of work for a distant and uncertain payoff.

Right.  I think it makes more sense to try to get parallelism working
first with the infrastructure we have.  Converting to use threading,
if we ever do it at all, should be something we view as a later
performance optimization.  But I suspect we won't want to do it
anyway; I think there will be easier ways to get where we want to be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Robert Haas
On Sat, Jan 21, 2012 at 1:06 PM, Peter Eisentraut pete...@gmx.net wrote:
 So, here are three patches that could solve this issue.

 pg-cassert-unused-attribute.patch, the one I already showed, using
 __attribute__((unused).

 pg-cassert-unused-ifdef.patch, using only additional #ifdef
 USE_ASSERT_CHECKING.  This does have additional documentation value, but
 you can see that it gets bulky and complicated.  (I introduced several
 bugs merely while creating this patch.)

 pg-cassert-unused-void.patch is an alternative approach that avoids the
 warnings by casting the arguments of Assert() to void if assertions are
 disabled.  This has the least code impact, but it changes the
 traditional semantics of asserts, which is that they disappear
 completely when not enabled.  You can see how this creates a problem in
 src/backend/replication/syncrep.c, where the nontrivial call to
 SyncRepQueueIsOrderedByLSN() now becomes visible even in non-assert
 builds.  I played around with some other options like
 __attribute__((pure)) to make the compiler optimize the function away in
 that case, but that didn't appear to work.  So this might not be a
 workable solution (and it would be GCC-specific anyway).

I think the third approach is unacceptable from a performance point of view.

Some of these problems can be fixed without resorting to as much
hackery as you have here.  For example, in nodeWorkTableScan.c, you
could easily fix the problem by getting rid of the estate variable and
replacing its single use within the assertion with the expression from
used to initialize it on the previous line.  I think this might the
cleanest solution in general.

I'm not sure what to do about the cases where that isn't practical.
Spraying the code with __attribute__((unused)) is somewhat undesirable
because it could mask a failure to properly initialize the variable in
an assert-enabled build.  We could have a macro
PG_UNUSED_IF_NO_ASSERTS or something, but that doesn't have any
obvious advantage over just getting rid of the variable altogether in
such cases.  I lean toward the view that variables not needed in
assertion-free builds should be #ifdef'd out altogether, as in your
second patch, but we should try to minimize the number of places where
we need to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Spraying the code with __attribute__((unused)) is somewhat undesirable
 because it could mask a failure to properly initialize the variable in
 an assert-enabled build.

Ouch.  Is it really true that __attribute__((unused)) disables detection
of use of uninitialized variables?  That would be nasty, and it's not
obvious to me why it should need to work like that.  But if it is true,
then I agree that that makes this approach not too tenable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 12:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Spraying the code with __attribute__((unused)) is somewhat undesirable
 because it could mask a failure to properly initialize the variable in
 an assert-enabled build.

 Ouch.  Is it really true that __attribute__((unused)) disables detection
 of use of uninitialized variables?  That would be nasty, and it's not
 obvious to me why it should need to work like that.  But if it is true,
 then I agree that that makes this approach not too tenable.

Oh, I think maybe I am confused.  The downsides of disabling *unused*
variable detection are obviously much less than the downsides of
disabling *uninitialized* variable declaration... although neither is
ideal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Jim Nasby
On Jan 24, 2012, at 9:15 AM, Simon Riggs wrote:
 On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat r...@xzilla.net wrote:
 And yes, I would for sure turn such functionality on if it were present.
 
 
 That's nice to say, but most people aren't willing to take a 50%
 performance hit. Not saying what we end up with will be that bad, but
 I've seen people get upset about performance hits much lower than
 that.
 When we talk about a 50% hit, are we discussing (1) checksums that are
 checked on each I/O, or (2) checksums that are checked each time we
 re-pin a shared buffer?  The 50% hit was my estimate of (2) and has
 not yet been measured, so shouldn't be used unqualified when
 discussing checksums. Same thing is also true I would use it
 comments, since we're not sure whether you're voting for (1) or (2).
 
 As to whether people will actually use (1), I have no clue. But I do
 know is that many people request that feature, including people that
 run heavy duty Postgres production systems and who also know about
 filesystems. Do people need (2)? It's easy enough to add as an option,
 once we have (1) and there is real interest.

Some people will be able to take a 50% hit and will happily turn on 
checksumming every time a page is pinned. But I suspect a lot of folks can't 
afford that kind of hit, but would really like to have their filesystem cache 
protected (we're certainly in the later camp).

As for checksumming filesystems, I didn't see any answers about whether the 
filesystem *cache* was also protected by the filesystem checksum. Even if it 
is, the choice of checksumming filesystems is certainly limited... ZFS is the 
only one that seems to have real traction, but that forces you off of Linux, 
which is a problem  for many shops.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PgNext: CFP

2012-01-24 Thread Joshua D. Drake


Hello,

The call for papers for PgNext (the old PgWest/PgEast) is now open:

January 19th: Talk submission opens
April 15th: Talk submission closes
April 30th: Speaker notification

Submit: https://www.postgresqlconference.org/talk_types

Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 11:16 AM, Merlin Moncure mmonc...@gmail.com wrote:
 Our current protocol allocates a 2-byte integer for the purposes of
 specifying the type of each parameter, and another 2-byte integer for
 the purpose of specifying the result type... but only one bit is
 really needed at present: text or binary.  If we revise the protocol
 version at some point, we might want to use some of that bit space to
 allow some more fine-grained negotiation of the protocol version.  So,
 for example, we might define the top 5 bits as reserved (always pass
 zero), the next bit as a text/binary flag, and the remaining 10 bits
 as a 10-bit format version number.  When a change like this comes
 along, we can bump the highest binary format version recognized by the
 server, and clients who request the new version can get it.

 Alternatively, we might conclude that a 2-byte integer for each
 parameter is overkill and try to cut back... but the point is there's
 a bunch of unused bitspace there now.  In theory we could even do
 something this without bumping the protocol version since the
 documentation seems clear that any value other than 0 and 1 yields
 undefined behavior, but in practice that seems like it might be a bit
 too edgy.

 Yeah.  But again, this isn't a contract between libpq and the server,
 but between the application and the server...

I don't see how this is relevant.  The text/binary format flag is
there in both libpq and the underlying protocol.

  So I'd vote against any format code
 beyond the text/binary switch that currently exists (which, by the
 way, while useful, is one of the great sins of libpq that we have to
 deal with basically forever).  While wire formatting is granular down
 to the type level, applications should not have to deal with that.
 They should Just Work.  So who decides what format code to stuff into
 the protocol?  Where are the codes defined?

 I'm very much in the camp that sometime, presumably during connection
 startup, the protocol accepts a non-#defined-in-libpq token (database
 version?) from the application that describes to the server what wire
 formats can be used and the server sends one back.  There probably has
 to be some additional facilities for non-core types but let's put that
 aside for the moment.  Those two tokens allow the server to pick the
 highest supported wire format (text and binary!) that everybody
 understands.  The server's token is useful if we're being fancy and we
 want libpq to translate an older server's wire format to a newer one
 for the application.  This of course means moving some of the type
 system into the client, which is something we might not want to do
 since among other things it puts a heavy burden on non-libpq driver
 authors (but then again, they can always stay on the v3 protocol,
 which can benefit from being frozen in terms of wire formats).

I think it's sensible for the server to advertise a version to the
client, but I don't see how you can dismiss add-on types so blithely.
The format used to represent any given type is logically a property of
that type, and only for built-in types is that associated with the
server version.

I do wonder whether we are making a mountain out of a mole-hill here,
though.  If I properly understand the proposal on the table, which
it's possible that I don't, but if I do, the new format is
self-identifying: when the optimization is in use, it sets a bit that
previously would always have been clear.  So if we just go ahead and
change this, clients that have been updated to understand the new
format will work just fine.  The server uses the proposed optimization
only for arrays that meet certain criteria, so any properly updated
client must still be able to handle the case where that bit isn't set.
 On the flip side, clients that aren't expecting the new optimization
might break.  But that's, again, no different than what happened when
we changed the default bytea output format.  If you get bit, you
either update your client or shut off the optimization and deal with
the performance consequences of so doing.  In fact, the cases are
almost perfectly analogous, because in each case the proposal was
based on the size of the output format being larger than necessary,
and wanting to squeeze it down to a smaller size for compactness.

And more generally, does anyone really expect that we're never going
to change the output format of any type we support ever again, without
retaining infinite backward compatibility?  I didn't hear any screams
of outrage when we updated the hyphenation rules for contrib/isbn -
well, ok, there were some howls, but that was because the rules were
still incomplete and US-centric, not so much because people thought it
was unacceptable for the hyphenation rules to be different in major
release N+1 than they were in major release N.  If the IETF goes and
defines a new standard for formatting IPv6 addresses, we're likely to
eventually support it via the inet and 

Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 24, 2012 at 12:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ouch. Is it really true that __attribute__((unused)) disables detection
 of use of uninitialized variables?

 Oh, I think maybe I am confused.  The downsides of disabling *unused*
 variable detection are obviously much less than the downsides of
 disabling *uninitialized* variable declaration... although neither is
 ideal.

OK.  MHO is that __attribute__((unused)) is probably less annoying than
#ifdef overall.  Also, it occurs to me that an intermediate macro
PG_USED_FOR_ASSERTS_ONLY would be a good idea, first because it
documents *why* you want to mark the variable as possibly unused,
and second because changing the macro definition would provide an easy way
to check for totally-unused variables, in case we wanted to periodically
make such checks.

This is all modulo the question of what pgindent will do with it,
which I would still like to see tested before we commit to a method.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
I wrote:
 Also, it occurs to me that an intermediate macro
 PG_USED_FOR_ASSERTS_ONLY would be a good idea, first because it
 documents *why* you want to mark the variable as possibly unused,
 and second because changing the macro definition would provide an easy way
 to check for totally-unused variables, in case we wanted to periodically
 make such checks.

Uh, wait a second.  Why not

#ifdef USE_ASSERT_CHECKING
#define PG_USED_FOR_ASSERTS_ONLY
#else
#define PG_USED_FOR_ASSERTS_ONLY __attribute__((unused))
#endif

Then, when you build with asserts on, you *automatically* get told
if the variable is entirely unused.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 1:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 Also, it occurs to me that an intermediate macro
 PG_USED_FOR_ASSERTS_ONLY would be a good idea, first because it
 documents *why* you want to mark the variable as possibly unused,
 and second because changing the macro definition would provide an easy way
 to check for totally-unused variables, in case we wanted to periodically
 make such checks.

 Uh, wait a second.  Why not

 #ifdef USE_ASSERT_CHECKING
 #define PG_USED_FOR_ASSERTS_ONLY
 #else
 #define PG_USED_FOR_ASSERTS_ONLY __attribute__((unused))
 #endif

 Then, when you build with asserts on, you *automatically* get told
 if the variable is entirely unused.

Yes, that's what I meant when I suggested it originally.  I'm just not
sure it's any nicer than adding ifdefs for USE_ASSERT_CHECKING.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-01-24 Thread Robert Haas
On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires jul...@gmail.com wrote:
 2011/12/15 Alvaro Herrera alvhe...@commandprompt.com:

 Uhm, surely you could compare the original toast tablespace to the heap
 tablespace, and if they differ, handle appropriately when creating the
 new toast table?  Just pass down the toast tablespace into
 AlterTableCreateToastTable, instead of having it assume that
 rel-rd_rel-relnamespace is sufficient.  This should be done in all
 cases where a toast tablespace is created, which shouldn't be more than
 a handful of them.

 Thank you, that way seems right.
 Now, I distinguish before each creation of a TOAST table with
 AlterTableCreateToastTable() : if it will create a new one or recreate
 an existing one.
 Thus, in create_toast_table() when toastTableSpace is equal to
 InvalidOid, we are able :
 - to fallback to the main table tablespace in case of new TOAST table creation
 - to keep it previous tablespace in case of recreation.

 Here's a new version rebased against HEAD.

To ask more directly the question that's come up a few times upthread,
why do *you* think this is useful?  What motivated you to want this
behavior, and/or how do you think it could benefit other PostgreSQL
users?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Yes, that's what I meant when I suggested it originally.  I'm just not
 sure it's any nicer than adding ifdefs for USE_ASSERT_CHECKING.

I'm inclined to think that it probably is nicer, just because of less
vertical space used.  But again, this opinion is contingent on what it
will look like after pgindent gets done with it ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-24 Thread Dave Page
What date  venue?

On Tuesday, January 24, 2012, Joshua D. Drake j...@commandprompt.com wrote:

 Hello,

 The call for papers for PgNext (the old PgWest/PgEast) is now open:

 January 19th: Talk submission opens
 April 15th: Talk submission closes
 April 30th: Speaker notification

 Submit: https://www.postgresqlconference.org/talk_types

 Sincerely,

 JD

 --
 Command Prompt, Inc. - http://www.commandprompt.com/
 PostgreSQL Support, Training, Professional Services and Development
 The PostgreSQL Conference - http://www.postgresqlconference.org/
 @cmdpromptinc - @postgresconf - 509-416-6579

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] controlling the location of server-side SSL files

2012-01-24 Thread Peter Eisentraut
On tor, 2012-01-19 at 15:44 -0500, Robert Haas wrote:
 On Sat, Jan 14, 2012 at 8:40 AM, Peter Eisentraut pete...@gmx.net wrote:
  On mån, 2012-01-02 at 06:32 +0200, Peter Eisentraut wrote:
  I think I would like to have a set of GUC parameters to control the
  location of the server-side SSL files.
 
  Here is the patch for this.
 
  One thing that is perhaps worth thinking about:  Currently, we just
  ignore missing root.crt and root.crl files.  With this patch, we still
  do this, even if the user has given a specific nondefault location.
  That seems a bit odd, but I can't think of a simple way to do it better.
 
 There's a review in the CF app for this finding only minor issues, so
 I'm marking this patch therein as Ready for Committer.

OK, no one had any concerns about the missing file behavior I described
above?  If not, then I'll commit it soon.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-24 Thread Stefan Kaltenbrunner
On 01/24/2012 07:36 PM, Dave Page wrote:
 What date  venue?
 
 On Tuesday, January 24, 2012, Joshua D. Drake j...@commandprompt.com wrote:

 Hello,

 The call for papers for PgNext (the old PgWest/PgEast) is now open:

 January 19th: Talk submission opens
 April 15th: Talk submission closes
 April 30th: Speaker notification

 Submit: https://www.postgresqlconference.org/talk_types


the url has in big letters has Denver Convention Center on June
26th-29th, 2012...


Stefan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New replication mode: write

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Yes, it might be too hard, but lets look.

Your committer has timed out ;-)

committed write mode only

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-24 Thread Robert Haas
Early yesterday morning, I was able to use Nate Boley's test machine
do a single 30-minute pgbench run at scale factor 300 using a variety
of trees built with various patches, and with the -l option added to
track latency on a per-transaction basis.  All tests were done using
32 clients and permanent tables.  The configuration was otherwise
identical to that described here:

http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com

By doing this, I hoped to get a better understanding of (1) the
effects of a scale factor too large to fit in shared_buffers, (2) what
happens on a longer test run, and (3) how response time varies
throughout the test.  First, here are the raw tps numbers:

background-clean-slru-v2: tps = 2027.282539 (including connections establishing)
buffreelistlock-reduction-v1: tps = 2625.155348 (including connections
establishing)
buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149
(including connections establishing)
freelist-ok-v2: tps = 2467.065010 (including connections establishing)
group-commit-2012-01-21: tps = 2205.128609 (including connections establishing)
master: tps = 2200.848350 (including connections establishing)
removebufmgrfreelist-v1: tps = 2679.453056 (including connections establishing)
xloginsert-scale-6: tps = 3675.312202 (including connections establishing)

Obviously these numbers are fairly noisy, especially since this is
just one run, so the increases and decreases might not be all that
meaningful.  Time permitting, I'll try to run some more tests to get
my hands around that situation a little better,

Graphs are here:

http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

There are two graphs for each branch.  The first is a scatter plot of
latency vs. transaction time.  I found that graph hard to understand,
though; I couldn't really tell what I was looking at.  So I made a
second set of graphs which graph number of completed transactions in a
given second of the test against time.  The results are also included
on the previous page, below the latency graphs, and I find them much
more informative.

A couple of things stand out at me from these graphs.  First, some of
these transactions had really long latency.  Second, there are a
remarkable number of seconds all of the test during which no
transactions at all manage to complete, sometimes several seconds in a
row.  I'm not sure why.  Third, all of the tests initially start of
processing transactions very quickly, and get slammed down very hard,
probably because the very high rate of transaction processing early on
causes a checkpoint to occur around 200 s.  I didn't actually log when
the checkpoints were occuring, but it seems like a good guess.  It's
also interesting to wonder whether the checkpoint I/O itself causes
the drop-off, or the ensuing full page writes.  Fourth,
xloginsert-scale-6 helps quite a bit; in fact, it's the only patch
that actually changes the whole shape of the tps graph.  I'm
speculating here, but that may be because it blunts the impact of full
page writes by allowing backends to copy their full page images into
the write-ahead log in parallel.

One thing I also noticed while running the tests is that the system
was really not using much CPU time.  It was mostly idle.  That could
be because waiting for I/O leads to waiting for locks, or it could be
fundamental lock contention.  I don't know which.

A couple of obvious further tests suggest themselves: (1) rerun some
of the tests with full_page_writes=off, and (2) repeat this test with
the remaining performance-related patches.  It would be especially
interesting, I think, to see what effect the checkpoint-related
patches have on these graphs.  But I plan to drop
buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs
based on Simon's comments elsewhere.  I'm including the results here
just because these tests were already running when he made those
comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Merlin Moncure
On Tue, Jan 24, 2012 at 11:55 AM, Robert Haas robertmh...@gmail.com wrote:
 I do wonder whether we are making a mountain out of a mole-hill here,
 though.  If I properly understand the proposal on the table, which
 it's possible that I don't, but if I do, the new format is
 self-identifying: when the optimization is in use, it sets a bit that
 previously would always have been clear.  So if we just go ahead and
 change this, clients that have been updated to understand the new
 format will work just fine.  The server uses the proposed optimization
 only for arrays that meet certain criteria, so any properly updated
 client must still be able to handle the case where that bit isn't set.
  On the flip side, clients that aren't expecting the new optimization
 might break.  But that's, again, no different than what happened when
 we changed the default bytea output format.  If you get bit, you
 either update your client or shut off the optimization and deal with
 the performance consequences of so doing.

Well, the bytea experience was IMNSHO a complete disaster (It was
earlier mentioned that jdbc clients were silently corrupting bytea
datums) and should be held up as an example of how *not* to do things;
it's better to avoid having to depend on the GUC or defensive
programmatic intervention to prevent further occurrences of
application failure since the former doesn't work and the latter won't
be reliably done.  Waiting for applications to break in the field only
to point affected users at the GUC is weak sauce.  It's creating a
user culture that is terrified of database upgrades which hurts
everybody.

Database apps tend to have long lives in computer terms such that they
can greatly outlive the service life of a particular postgres dot
release or even the programmers who originally wrote the application.
I'm not too concerned about the viability of a programming department
with Robert Haas at the helm, but what about when he leaves?  What
about critical 3rd party software that is no longer maintained?

In regards to the array optimization, I think it's great -- but if you
truly want to avoid blowing up user applications, it needs to be
disabled automatically.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 8:53 PM, Robert Haas robertmh...@gmail.com wrote:

 do a single 30-minute pgbench run at scale factor 300 using a variety

Nice

A minor but necessary point: Repeated testing of the Group commit
patch when you have synch commit off is clearly pointless, so
publishing numbers for that without saying very clearly that's what's
happening doesn't really help.

 Graphs are here:

 http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

Nicer

 There are two graphs for each branch.  The first is a scatter plot of
 latency vs. transaction time.  I found that graph hard to understand,
 though; I couldn't really tell what I was looking at.  So I made a
 second set of graphs which graph number of completed transactions in a
 given second of the test against time.  The results are also included
 on the previous page, below the latency graphs, and I find them much
 more informative.

 A couple of things stand out at me from these graphs.  First, some of
 these transactions had really long latency.  Second, there are a
 remarkable number of seconds all of the test during which no
 transactions at all manage to complete, sometimes several seconds in a
 row.  I'm not sure why.  Third, all of the tests initially start of
 processing transactions very quickly, and get slammed down very hard,
 probably because the very high rate of transaction processing early on
 causes a checkpoint to occur around 200 s.

Check

I'm happy that this exposes characteristics I've seen and have been
discussing for a while now.

It would be useful to calculate the slow-down contribution of the
longer txns. What % of total time is given over to slowest 10% of
transactions. Looking at that, I think you'll see why we should care
about sorting out what happens in the worst cases.

  I didn't actually log when
 the checkpoints were occuring, but it seems like a good guess.  It's
 also interesting to wonder whether the checkpoint I/O itself causes
 the drop-off, or the ensuing full page writes.  Fourth,
 xloginsert-scale-6 helps quite a bit; in fact, it's the only patch
 that actually changes the whole shape of the tps graph.  I'm
 speculating here, but that may be because it blunts the impact of full
 page writes by allowing backends to copy their full page images into
 the write-ahead log in parallel.

I think we should be working to commit XLogInsert and then Group
Commit, then come back to the discussion.

There's no way we're committing other patches but not those, so
everything needs to be viewed with those two in. i.e. commit and then
re-baseline.

So I'd say no more tests just yet, but then lots of testing next week++.

 But I plan to drop
 buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs
 based on Simon's comments elsewhere.  I'm including the results here
 just because these tests were already running when he made those
 comments.

Yep

One you aren't testing is clog_history, which is designed to work in
conjunction with background_clean_slru. That last one clearly needs a
little tuning though...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Different error messages executing CREATE TABLE or ALTER TABLE to create a column xmin

2012-01-24 Thread Vik Reykja
I took my first stab at hacking the sources to fix the bug reported here:

http://archives.postgresql.org/pgsql-bugs/2012-01/msg00152.php

It's such a simple patch but it took me several hours with Google and IRC
and I'm sure I did many things wrong.  It seems to work as advertised,
though, so I'm submitting it.

I suppose since I have now submitted a patch, it is my moral obligation to
review at least one.  I'm not sure how helpful I'll be, but I'll go bite
the bullet and sign myself up anyway.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
new file mode 100644
index cb8ac67..7555d19
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** ATExecAddColumn(List **wqueue, AlteredTa
*** 4235,4240 
--- 4235,4241 
  	List	   *children;
  	ListCell   *child;
  	AclResult	aclresult;
+ 	HeapTuple	attTuple;
  
  	/* At top level, permission check was done in ATPrepCmd, else do it */
  	if (recursing)
*** ATExecAddColumn(List **wqueue, AlteredTa
*** 4314,4326 
  	 * this test is deliberately not attisdropped-aware, since if one tries to
  	 * add a column matching a dropped column name, it's gonna fail anyway.
  	 */
! 	if (SearchSysCacheExists2(ATTNAME,
! 			  ObjectIdGetDatum(myrelid),
! 			  PointerGetDatum(colDef-colname)))
! 		ereport(ERROR,
! (errcode(ERRCODE_DUPLICATE_COLUMN),
!  errmsg(column \%s\ of relation \%s\ already exists,
! 		colDef-colname, RelationGetRelationName(rel;
  
  	/* Determine the new attribute's number */
  	if (isOid)
--- 4315,4341 
  	 * this test is deliberately not attisdropped-aware, since if one tries to
  	 * add a column matching a dropped column name, it's gonna fail anyway.
  	 */
! 	attTuple = SearchSysCache2(ATTNAME,
! 			   ObjectIdGetDatum(myrelid),
! 			   PointerGetDatum(colDef-colname));
! 
! 	if (HeapTupleIsValid(attTuple))
! 	{
! 	int attnum;
! 	attnum = ((Form_pg_attribute) GETSTRUCT(attTuple))-attnum;
! 	if (attnum = 0)
! 		ereport(ERROR,
! (errcode(ERRCODE_DUPLICATE_COLUMN),
!  errmsg(column \%s\ conflicts with a system column name,
! 		colDef-colname)));
! 	else
! 		ereport(ERROR,
! (errcode(ERRCODE_DUPLICATE_COLUMN),
!  errmsg(column \%s\ of relation \%s\ already exists,
! 		colDef-colname, RelationGetRelationName(rel;
! 
! 		ReleaseSysCache(attTuple);
! }
  
  	/* Determine the new attribute's number */
  	if (isOid)
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
new file mode 100644
index e992549..8385afb
*** a/src/test/regress/expected/alter_table.out
--- b/src/test/regress/expected/alter_table.out
*** COMMENT ON TABLE tmp_wrong IS 'table com
*** 7,12 
--- 7,14 
  ERROR:  relation tmp_wrong does not exist
  COMMENT ON TABLE tmp IS 'table comment';
  COMMENT ON TABLE tmp IS NULL;
+ ALTER TABLE tmp ADD COLUMN xmin integer; -- fails
+ ERROR:  column xmin conflicts with a system column name
  ALTER TABLE tmp ADD COLUMN a int4 default 3;
  ALTER TABLE tmp ADD COLUMN b name;
  ALTER TABLE tmp ADD COLUMN c text;
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
new file mode 100644
index d9bf08d..d4e4c49
*** a/src/test/regress/sql/alter_table.sql
--- b/src/test/regress/sql/alter_table.sql
*** COMMENT ON TABLE tmp_wrong IS 'table com
*** 9,14 
--- 9,16 
  COMMENT ON TABLE tmp IS 'table comment';
  COMMENT ON TABLE tmp IS NULL;
  
+ ALTER TABLE tmp ADD COLUMN xmin integer; -- fails
+ 
  ALTER TABLE tmp ADD COLUMN a int4 default 3;
  
  ALTER TABLE tmp ADD COLUMN b name;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basic pgbench runs with various performance-related patches

2012-01-24 Thread Tatsuo Ishii
 My test was run with synchronous_commit=off, so I didn't expect the
 group commit patch to have much of an impact.  I included it mostly to
 see whether by chance it helped anyway (since it also helps other WAL
 flushes, not just commits) or whether it caused any regression.

Oh, I see.

 One somewhat odd thing about these numbers is that, on permanent
 tables, all of the patches seemed to show regressions vs. master in
 single-client throughput.  That's a slightly difficult result to
 believe, though, so it's probably a testing artifact of some kind.

Maybe kernel cache effect?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Tue, Jan 24, 2012 at 11:55 AM, Robert Haas robertmh...@gmail.com wrote:
 I do wonder whether we are making a mountain out of a mole-hill here,
 though.  If I properly understand the proposal on the table, which
 it's possible that I don't, but if I do, the new format is
 self-identifying: when the optimization is in use, it sets a bit that
 previously would always have been clear.  So if we just go ahead and
 change this, clients that have been updated to understand the new
 format will work just fine.  The server uses the proposed optimization
 only for arrays that meet certain criteria, so any properly updated
 client must still be able to handle the case where that bit isn't set.
  On the flip side, clients that aren't expecting the new optimization
 might break.  But that's, again, no different than what happened when
 we changed the default bytea output format.  If you get bit, you
 either update your client or shut off the optimization and deal with
 the performance consequences of so doing.

 Well, the bytea experience was IMNSHO a complete disaster (It was
 earlier mentioned that jdbc clients were silently corrupting bytea
 datums) and should be held up as an example of how *not* to do things;

Yeah.  In both cases, the (proposed) new output format is
self-identifying *to clients that know what to look for*.  Unfortunately
it would only be the most anally-written pre-existing client code that
would be likely to spit up on the unexpected variations.  What's much
more likely to happen, and did happen in the bytea case, is silent data
corruption.  The lack of redundancy in binary data makes this even more
likely, and the documentation situation makes it even worse.  If we had
had a clear binary-data format spec from day one that told people that
they must check for unexpected contents of the flag field and fail, then
maybe we could get away with considering not doing so to be a
client-side bug ... but I really don't think we have much of a leg to
stand on given the poor documentation we've provided.

 In regards to the array optimization, I think it's great -- but if you
 truly want to avoid blowing up user applications, it needs to be
 disabled automatically.

Right.  We need to fix things so that this format will not be sent to
clients unless the client code has indicated ability to accept it.
A GUC is a really poor proxy for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 8:13 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, the bytea experience was IMNSHO a complete disaster (It was
 earlier mentioned that jdbc clients were silently corrupting bytea
 datums) and should be held up as an example of how *not* to do things;

 Yeah.  In both cases, the (proposed) new output format is
 self-identifying *to clients that know what to look for*.Unfortunately
 it would only be the most anally-written pre-existing client code that
 would be likely to spit up on the unexpected variations.  What's much
 more likely to happen, and did happen in the bytea case, is silent data
 corruption.  The lack of redundancy in binary data makes this even more
 likely, and the documentation situation makes it even worse.  If we had
 had a clear binary-data format spec from day one that told people that
 they must check for unexpected contents of the flag field and fail, then
 maybe we could get away with considering not doing so to be a
 client-side bug ... but I really don't think we have much of a leg to
 stand on given the poor documentation we've provided.

 In regards to the array optimization, I think it's great -- but if you
 truly want to avoid blowing up user applications, it needs to be
 disabled automatically.

 Right.  We need to fix things so that this format will not be sent to
 clients unless the client code has indicated ability to accept it.
 A GUC is a really poor proxy for that.

OK.  It seems clear to me at this point that there is no appetite for
this patch in its present form:

https://commitfest.postgresql.org/action/patch_view?id=715

Furthermore, while we haven't settled the question of exactly what a
good negotiation facility would look like, we seem to agree that a GUC
isn't it.  I think that means this isn't going to happen for 9.2, so
we should mark this patch Returned with Feedback and return to this
topic for 9.3.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix for pg_upgrade tablespace function usage

2012-01-24 Thread Bruce Momjian
I have applied the attached patch to git head to fix the new SQL tablespace
location function usage in pg_upgrade to properly check cluster version
numbers, and a fix missing table alias.

I found this problem during testing.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
new file mode 100644
index e8361ce..692cdc2
*** a/contrib/pg_upgrade/info.c
--- b/contrib/pg_upgrade/info.c
*** get_db_infos(ClusterInfo *cluster)
*** 204,210 
/* we don't preserve pg_database.oid so we sort by name */
ORDER BY 2,
/* 9.2 removed the spclocation column */
!   (GET_MAJOR_VERSION(old_cluster.major_version) = 901) ?
t.spclocation : 
pg_catalog.pg_tablespace_location(t.oid) AS spclocation);
  
res = executeQueryOrDie(conn, %s, query);
--- 204,210 
/* we don't preserve pg_database.oid so we sort by name */
ORDER BY 2,
/* 9.2 removed the spclocation column */
!   (GET_MAJOR_VERSION(cluster-major_version) = 901) ?
t.spclocation : 
pg_catalog.pg_tablespace_location(t.oid) AS spclocation);
  
res = executeQueryOrDie(conn, %s, query);
*** get_rel_infos(ClusterInfo *cluster, DbIn
*** 287,293 
/* we preserve pg_class.oid so we sort by it to match old/new */
 ORDER BY 1;,
/* 9.2 removed the spclocation column */
!(GET_MAJOR_VERSION(old_cluster.major_version) = 901) ?
 t.spclocation : 
pg_catalog.pg_tablespace_location(t.oid) AS spclocation,
/* see the comment at the top of old_8_3_create_sequence_script() */
 (GET_MAJOR_VERSION(old_cluster.major_version) = 803) ?
--- 287,293 
/* we preserve pg_class.oid so we sort by it to match old/new */
 ORDER BY 1;,
/* 9.2 removed the spclocation column */
!(GET_MAJOR_VERSION(cluster-major_version) = 901) ?
 t.spclocation : 
pg_catalog.pg_tablespace_location(t.oid) AS spclocation,
/* see the comment at the top of old_8_3_create_sequence_script() */
 (GET_MAJOR_VERSION(old_cluster.major_version) = 803) ?
diff --git a/contrib/pg_upgrade/tablespace.c b/contrib/pg_upgrade/tablespace.c
new file mode 100644
index 11fd9d0..6b61f4b
*** a/contrib/pg_upgrade/tablespace.c
--- b/contrib/pg_upgrade/tablespace.c
*** get_tablespace_paths(void)
*** 53,59 
   spcname != 'pg_global',
/* 9.2 removed the spclocation column */
(GET_MAJOR_VERSION(old_cluster.major_version) = 901) ?
!   t.spclocation : 
pg_catalog.pg_tablespace_location(oid) AS spclocation);
  
res = executeQueryOrDie(conn, %s, query);
  
--- 53,59 
   spcname != 'pg_global',
/* 9.2 removed the spclocation column */
(GET_MAJOR_VERSION(old_cluster.major_version) = 901) ?
!   spclocation : pg_catalog.pg_tablespace_location(oid) 
AS spclocation);
  
res = executeQueryOrDie(conn, %s, query);
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade with plpython is broken

2012-01-24 Thread Bruce Momjian
On Fri, Jan 20, 2012 at 07:01:46AM +0200, Peter Eisentraut wrote:
 On tor, 2012-01-19 at 17:04 -0500, Bruce Momjian wrote:
  For that reason, I wonder if I should just hard-code the plpython
  rename into the pg_upgrade test in check_loadable_libraries().
 
 Yes, I haven't come up with a better solution either.
 
 If this becomes a general problem, we might need to add a command line
 option to ignore certain names or something.  But not yet.

Well, the problem is a little more complex than reported.  It turns out
in PG 9.0 we kept the plpython.so file and symlinked plpython2.so to it.
In PG 9.1, we removed plpython.so, and only have plpython2.so, so the
problem is with PG = 9.1, and does not affect 9.0, which explains why
we didn't get any 9.0 reports of a problem.

I have applied the attached patch to PG head and 9.1 to fix the library
checking problem.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/function.c b/contrib/pg_upgrade/function.c
new file mode 100644
index 54f139a..4505e51
*** a/contrib/pg_upgrade/function.c
--- b/contrib/pg_upgrade/function.c
*** check_loadable_libraries(void)
*** 228,235 
  		char	   *cmd = (char *) pg_malloc(8 + 2 * llen + 1);
  		PGresult   *res;
  
  		strcpy(cmd, LOAD ');
! 		PQescapeStringConn(conn, cmd + 6, lib, llen, NULL);
  		strcat(cmd, ');
  
  		res = PQexec(conn, cmd);
--- 228,251 
  		char	   *cmd = (char *) pg_malloc(8 + 2 * llen + 1);
  		PGresult   *res;
  
+ 		/*
+ 		 *	In Postgres 9.0, Python 3 support was added, and to do that, a
+ 		 *	plpython2u language was created with library name plpython2.so
+ 		 *	as a symbolic link to plpython.so.  In Postgres 9.1, only the
+ 		 *	plpython2.so library was created, and both plpythonu and
+ 		 *	plpython2u pointing to it.  For this reason, any reference to
+ 		 *	library name plpython in an old PG = 9.1 cluster must look
+ 		 *	for plpython2 in the new cluster.
+ 		 */
+ 		if (GET_MAJOR_VERSION(old_cluster.major_version)  901 
+ 			strcmp(lib, $libdir/plpython) == 0)
+ 		{
+ 			lib = $libdir/plpython2;
+ 			llen = strlen(lib);
+ 		}
+ 
  		strcpy(cmd, LOAD ');
! 		PQescapeStringConn(conn, cmd + strlen(cmd), lib, llen, NULL);
  		strcat(cmd, ');
  
  		res = PQexec(conn, cmd);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-24 Thread Simon Riggs
On Fri, Jan 20, 2012 at 10:30 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 I spent some time cleaning this up. Details below, but here are the
 highlights:

 * Reverted the removal of wal_writer_delay
 * Removed heuristic on big flushes

No contested viewpoints on anything there.

 * Doesn't rely on WAL writer. Any process can act as the leader now.
 * Uses PGSemaphoreLock/Unlock instead of latches

Thanks for producing an alternate version, it will allow us to comment
on various approaches.

There is much yet to discuss so please don't think about committing
anything yet.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-24 Thread Pavan Deolasee
On Wed, Jan 25, 2012 at 2:23 AM, Robert Haas robertmh...@gmail.com wrote:
 Early yesterday morning, I was able to use Nate Boley's test machine
 do a single 30-minute pgbench run at scale factor 300 using a variety
 of trees built with various patches, and with the -l option added to
 track latency on a per-transaction basis.  All tests were done using
 32 clients and permanent tables.  The configuration was otherwise
 identical to that described here:

 http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com

 By doing this, I hoped to get a better understanding of (1) the
 effects of a scale factor too large to fit in shared_buffers, (2) what
 happens on a longer test run, and (3) how response time varies
 throughout the test.  First, here are the raw tps numbers:

 background-clean-slru-v2: tps = 2027.282539 (including connections 
 establishing)
 buffreelistlock-reduction-v1: tps = 2625.155348 (including connections
 establishing)
 buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149
 (including connections establishing)
 freelist-ok-v2: tps = 2467.065010 (including connections establishing)
 group-commit-2012-01-21: tps = 2205.128609 (including connections 
 establishing)
 master: tps = 2200.848350 (including connections establishing)
 removebufmgrfreelist-v1: tps = 2679.453056 (including connections 
 establishing)
 xloginsert-scale-6: tps = 3675.312202 (including connections establishing)

 Obviously these numbers are fairly noisy, especially since this is
 just one run, so the increases and decreases might not be all that
 meaningful.  Time permitting, I'll try to run some more tests to get
 my hands around that situation a little better,


This is nice. I am sure long running tests will point out many more
issues. If we are doing these tests, it might be more effective if we
run even longer runs, such as to get at least 3-4
checkpoints/vacuum/analyze (and other such events which can impact
final numbers either way) per test. Otherwise, one patch may stood out
if it avoids, say one checkpoint.

It would definitely help to log the checkpoint,
auto-vacuum/auto-analyze details and plot them on the graph to see if
the drop in performance has anything to do with these activities. It
might also be a good idea to collect pg statistics such as relation
sizes at the end of the run.

Thanks,
Pavan
-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-24 Thread Dave Page
On Tuesday, January 24, 2012, Stefan Kaltenbrunner ste...@kaltenbrunner.cc
wrote:
 On 01/24/2012 07:36 PM, Dave Page wrote:
 What date  venue?

 On Tuesday, January 24, 2012, Joshua D. Drake j...@commandprompt.com
wrote:

 Hello,

 The call for papers for PgNext (the old PgWest/PgEast) is now open:

 January 19th: Talk submission opens
 April 15th: Talk submission closes
 April 30th: Speaker notification

 Submit: https://www.postgresqlconference.org/talk_types


 the url has in big letters has Denver Convention Center on June
 26th-29th, 2012...

Not too visible on my phone...

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-24 Thread Fujii Masao
On Wed, Jan 25, 2012 at 5:35 AM, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Tue, Jan 24, 2012 at 3:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Add new replication mode synchronous_commit = 'write'.
 Replication occurs only to memory on standby, not to disk,
 so provides additional performance if user wishes to
 reduce durability level slightly. Adds concept of multiple
 independent sync rep queues.

 Fujii Masao and Simon Riggs


 i guess, you should add the new value in postgresql.conf too.

Yes, I forgot to do that. Patch attached.

 just a question, why not add a flush value and make it behave like on?
 actually sync rep is on in both cases and having the different names
 makes more easy to unserstand what is happening.
 i only want to keep on for compatibility but i wouldn't mind if we
 remove it in favor of more descriptive names.

Also we should add async as an alias for off?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


noname
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New replication mode: write

2012-01-24 Thread Fujii Masao
On Wed, Jan 25, 2012 at 5:28 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Yes, it might be too hard, but lets look.

 Your committer has timed out ;-)

 committed write mode only

Thanks for the commit!

The apply mode is attractive, but I need more time to implement that completely.
I might not be able to complete that within this CF. So committing the
write mode
only is right decision, I think. If I have time after all of the
patches which I'm interested
in will have been committed, I will try the apply mode again, but
maybe for 9.3dev.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers