Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Simon Riggs
On 17 January 2013 03:02, Jeff Davis pg...@j-davis.com wrote:

 Rebased patch attached. No significant changes.

Jeff, can you summarise/collate why we're doing this, what concerns it
raises and how you've dealt with them? That will help decide whether
to commit.

Thanks

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4

2013-01-17 Thread Magnus Hagander
On Jan 17, 2013 8:15 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote:

 At 2013-01-17 16:05:05 +0900, michael.paqu...@gmail.com wrote:
 
  Is it really necessary to create a new commit fest just to move the
  items? Marking the patches that are considered as being too late for
  9.3 should be just returned with feedback.

 Opening 2013-03 is not so much to move existing patches, but to give
 people a place to submit *new* post-9.3 patches without interrupting
 the 2013-01 CF.

Yeah, and +1 for doing that. The sooner the better. By whichever of the
time frames for the cf that had been discussed, it should certainly not be
open for accepting new patches for 9.3 anymore.

/Magnus


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Pavan Deolasee
On Thu, Jan 17, 2013 at 2:11 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 17 January 2013 03:02, Jeff Davis pg...@j-davis.com wrote:

  Rebased patch attached. No significant changes.

 Jeff, can you summarise/collate why we're doing this, what concerns it
 raises and how you've dealt with them? That will help decide whether
 to commit.


+1. On another thread Set visibility map bit after HOT prune, Robert
mentioned that its not such a good idea to remove the PD_ALL_VISIBLE
flag because it helps us to reduce the contention on the VM page,
especially when we need to reset the VM bit. Here is an excerpt from
Robert's comment that thread:

Sure, but you're zipping rather blithely past the disadvantages of
such an approach.  Jeff Davis recently proposed getting rid of
PD_ALL_VISIBLE, and Tom and I both expressed considerable skepticism
about that; this proposal has the same problems.  One of the major
benefits of PD_ALL_VISIBLE is that, when it isn't set, inserts,
updates, and deletes to the page can ignore the visibility map.  That
means that a server under heavy concurrency is much less likely to
encounter contention on the visibility map blocks.  Now, maybe that's
not really a problem, but I sure haven't seen enough evidence to make
me believe it.  If it's really true that PD_ALL_VISIBLE needn't fill
this role, then Heikki wasted an awful lot of time implementing it,
and I wasted an awful lot of time keeping it working when I made the
visibility map crash-safe for IOS.  That could be true, but I tend to
think it isn't.

May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4 (was Re: Parallel query execution)

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 8:19 AM, Pavan Deolasee
pavan.deola...@gmail.com wrote:

 On Wed, Jan 16, 2013 at 1:51 PM, Abhijit Menon-Sen a...@2ndquadrant.com
 wrote:

 At 2013-01-16 02:07:29 -0500, t...@sss.pgh.pa.us wrote:
 
  In case you hadn't noticed, we've totally lost control of
  the CF process.

 What can we do to get it back on track?

 I know various people (myself included) have been trying to keep CF3
 moving, e.g. sending followup mail, adjusting patch status, etc.

 I want to help, but I don't know what's wrong. What are the committers
 working on, and what is the status of the Ready for commiter patches?
 Is the problem that the patches marked Ready aren't, in fact, ready? Or
 is it lack of feedback from authors? Or something else?


 ISTM that even committers are often overwhelmed with the work, their own as
 well as that of reviewing other's patches and committing them. Especially
 when a patch is large or touches core areas, I can feel the significant work
 that the committer has to do even after some help and initial review from CF
 reviewers. On the other hand, if the patches are not committed in time, we
 loose context, interest dies down and when the patch is actually picked up
 by a committer who is often not involved in the original discussion, many
 points need to be revisited and reworked.

 Would it help to step up a few developers and create a second line of
 committers ? The commits by the second line committers will still be
 reviewed by the first line committers before they make into the product, but
 may be at later stage or when we are in beta. I thought of even suggesting
 that the master branch will contain only commits by the first line
 committers. We then maintain a secondary branch which also have commits from
 the second line committers in addition to all commits from the master
 branch. The first line committers can then cherry pick from the secondary
 branch at some later stage. But not sure if this will add more overhead and
 defeat the problem at hand.

While we can certainly do that, it would probably help just to havae a
second line of reviewers. Basically a set of more senior reviewers -
so a patch would go submission - reviewer - senior reviewer -
committer. With the second line of reviewers focusing more on the
whole how to do things, etc.

As you, I'm not sure if it creates more overhead than it solves, but
it might be worth a try. In a way it already exists, because I'm sure
committers pay slightly more attention to reviews by people who ahve
been doing it a lot and are known to process those things, than to new
entries. All that woudl bee needed was for those people to realize it
would be helpful for them to do a second-stage review even if somebody
else has done the first one.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Abhijit Menon-Sen
At 2013-01-17 08:41:37 +, si...@2ndquadrant.com wrote:

 Jeff, can you summarise/collate why we're doing this, what concerns it
 raises and how you've dealt with them?

Since I was just looking at the original patch and discussion, and since
Pavan has posted an excerpt from one objection to it, here's an excerpt
from Jeff's original post titled Do we need so many hint bits?

http://www.postgresql.org/message-id/1353026577.14335.91.camel@sussancws0025

Also, I am wondering about PD_ALL_VISIBLE. It was originally
introduced in the visibility map patch, apparently as a way to know
when to clear the VM bit when doing an update. It was then also used
for scans, which showed a significant speedup. But I wonder: why not
just use the visibilitymap directly from those places? It can be
used for the scan because it is crash safe now (not possible
before). And since it's only one lookup per scanned page, then I
don't think it would be a measurable performance loss there.
Inserts/updates/deletes also do a significant amount of work, so
again, I doubt it's a big drop in performance there -- maybe under a
lot of concurrency or something.

The benefit of removing PD_ALL_VISIBLE would be significantly
higher. It's quite common to load a lot of data, and then do some
reads for a while (setting hint bits and flushing them to disk), and
then do a VACUUM a while later, setting PD_ALL_VISIBLE and writing
all of the pages again. Also, if I remember correctly, Robert went
to significant effort when making the VM crash-safe to keep the
PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed
before?

There was considerable discussion after this (accessible through the
archives link above), which I won't attempt to summarise.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org wrote:
 This might be way more than we want to do, but there is an article
 that describes some techniques for doing what seems to be missing
 (AIUI):

 http://msdn.microsoft.com/en-us/magazine/cc163996.aspx
 Even this would be doable, I'm afraid it may not fit in 9.3 if we
 think about the current status of CF. So our choice would be:

 1) Postpone the patch to 9.4

 2) Commit the patch in 9.3 without Windows support

 I personally am ok with #2. We traditionally avoid particular paltform
 specific features on PostgreSQL.  However I think the policiy could be
 losen for contrib staffs. Also pgbench is just a client program. We
 could always use pgbench on UNIX/Linux if we truely need the feature.

 What do you think?

 Fair enough, I was just trying to point out alternatives. We have
 committed platform-specific features before now. I hope it doesn't
 just get left like this, though.

We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, so
there isn't much we can do there.


 Yeah, I hope someone pick this up and propose as a TODO item. In the
 mean time, I'm going to commit the patch without Windows support
 unless there's objection.

Perhaps we should actually hold off until someone committs to actually
getting it fixed in the next version? If we do have that, then we can
commit it as a partial feature, but if we just hope someone picks it
up, that's leaving it very loose..

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Pavan Deolasee
On Thu, Jan 17, 2013 at 2:57 PM, Abhijit Menon-Sen a...@2ndquadrant.comwrote:



 There was considerable discussion after this (accessible through the
 archives link above), which I won't attempt to summarise.


I thought Robert made those comments after considerable discussions on
Jeff's approach. So he probably still stands by his objections or at least
not satisfied/seen the numbers.

Now that I look at the patch, I wonder if there is another fundamental
issue with the patch. Since the patch removes WAL logging for the VM set
operation, this can happen:

1. Vacuum kicks in and clears all dead tuples in a page and decides that
its all-visible
2. Vacuum WAL-logs the cleanup activity and marks the page dirty
3. Vacuum sets the visibility bit and marks the VM page dirty
4. Say the VM page gets written to the disk. The heap page is not yet
written neither the WAL log corresponding to the cleanup operation
5. CRASH

After recovery, the VM bit will remain set because the VM page got written
before the crash. But since heap page's cleanup WAL did not made to the
disk, those operations won't be replayed. The heap page will be left with
not-all-visible tuples in that case and its not a good state to be in.

The original code does not have this problem because the VM set WAL gets
written after the heap page cleanup WAL. So its guaranteed that the VM bit
will be set during recovery only if the cleanup WAL is replayed too (there
is more magic than what meets the eye and I think its not fully
documented).

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Dave Page
On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote:
 On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org wrote:
 This might be way more than we want to do, but there is an article
 that describes some techniques for doing what seems to be missing
 (AIUI):

 http://msdn.microsoft.com/en-us/magazine/cc163996.aspx
 Even this would be doable, I'm afraid it may not fit in 9.3 if we
 think about the current status of CF. So our choice would be:

 1) Postpone the patch to 9.4

 2) Commit the patch in 9.3 without Windows support

 I personally am ok with #2. We traditionally avoid particular paltform
 specific features on PostgreSQL.  However I think the policiy could be
 losen for contrib staffs. Also pgbench is just a client program. We
 could always use pgbench on UNIX/Linux if we truely need the feature.

 What do you think?

 Fair enough, I was just trying to point out alternatives. We have
 committed platform-specific features before now. I hope it doesn't
 just get left like this, though.

 We have committed platform-specific features before, but generally
 only when it's not *possible* to do them for all platforms. For
 example the posix_fadvise stuff isn't available on Windows at all, so
 there isn't much we can do there.

Right - having platform specific features for other reasons like lack
of time is a slippery slope in my opinion. We should not get into such
a habit or Windows will quickly become a second class platform as far
as PostgreSQL features are concerned.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 11:09 AM, Dave Page dp...@pgadmin.org wrote:
 On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote:
 On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org wrote:
 This might be way more than we want to do, but there is an article
 that describes some techniques for doing what seems to be missing
 (AIUI):

 http://msdn.microsoft.com/en-us/magazine/cc163996.aspx
 Even this would be doable, I'm afraid it may not fit in 9.3 if we
 think about the current status of CF. So our choice would be:

 1) Postpone the patch to 9.4

 2) Commit the patch in 9.3 without Windows support

 I personally am ok with #2. We traditionally avoid particular paltform
 specific features on PostgreSQL.  However I think the policiy could be
 losen for contrib staffs. Also pgbench is just a client program. We
 could always use pgbench on UNIX/Linux if we truely need the feature.

 What do you think?

 Fair enough, I was just trying to point out alternatives. We have
 committed platform-specific features before now. I hope it doesn't
 just get left like this, though.

 We have committed platform-specific features before, but generally
 only when it's not *possible* to do them for all platforms. For
 example the posix_fadvise stuff isn't available on Windows at all, so
 there isn't much we can do there.

 Right - having platform specific features for other reasons like lack
 of time is a slippery slope in my opinion. We should not get into such
 a habit or Windows will quickly become a second class platform as far
 as PostgreSQL features are concerned.

Especially since there is no lack of time - the functionality is
there, it just looks (significantly) different.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 - adds ddl_command_trace and ddl_command_end events
 - causes those events to be called not only for actual SQL commands
 but also for things that happen, at present, to go through the same
 code path
 - adds additional magic variables to PL/pgsql to expose new
 information not previously exposed
 - adds CONTEXT as an additional event-trigger-filter variable, along with TAG

 In my mind, that's four or five separate patches.  You don't have to
 agree, but that's what I think, and I'm not going to apologize for
 thinking it.  Moreover, I think you will struggle to find examples of
 patches committed in the last three years that made as many different,
 only loosely-related changes as you're proposing for this one.

So, say I do that. Now we have 5 patches to review. They all are against
the same set of files, so there's a single possible ordering to apply
them, and so to review them. When we reach to an agreement out-of-order
(we will), it takes a sensible amount of time to rebuild the patch set
in a different order. I don't think we have the time to do that.

Now, if that's what it takes, I'll spend time on it. In which exact
order do you want to be reviewing and applying that series of patches?

 As for the patch not being well-thought-out, I'm sticking by that,
 too.  Alvaro's comments about lock escalations should be enough to
 tickle your alarm bells, but there are plenty of other problems here,
 too.

Here's the comment in the code where the lock problems are discussed:

/*
 * Fill in the EventTriggerTargetOid for a DROP command and a
 * ddl_command_start Event Trigger: the lookup didn't happen yet and we
 * still want to provide the user with that information when possible.
 *
 * We only do the lookup if the command contains a single element, which is
 * often forced to be the case as per the grammar (see the drop_type:
 * production for a list of cases where more than one object per command is
 * allowed).
 *
 * We take no exclusive lock here, the main command will have to do the
 * name lookup all over again and take appropriate locks down after the
 * User Defined Code runs anyway, and the commands executed by the User
 * Defined Code will take their own locks. We only lock the objects here so
 * that they don't disapear under us in another concurrent transaction,
 * hence using ShareUpdateExclusiveLock.  XXX this seems prone to lock
 * escalation troubles.
 */

My thinking is that the user code can do anything, even run a DROP
command against the object that we are preparing ourselves to be
dropping. I don't think we can bypass doing the lookup twice.

The way not to have to do any lookup in our code is not to report the
OID of the Object being dropped. Then, the event trigger code will do
that lookup as its first thing, certainly. The lookup will happen.

   RAISE NOTICE 'drop table % (%)', tg_objectname::regclass,
tg_objectname::regclass::oid;

That is, at least, the angle I considered when crafting this patch. I'm
happy to change that angle if needed and adjust to things I missed and
you will tell me about.

 I think you might want to review the use case behing ddl_command_trace,
 that has been asked by who users wanted to see their use case supported
 in some easier way than just what you're talking about here.

 […]
 Being able to do
 that in one command instead of two is not a sufficient reason to add
 another event type.

That feature was proposed in past october (with a patch) and received no
comment, so I went on with it. The performance costs of this patch is
another couple of cache lookups when doing a DDL command, which I saw as
acceptable.

  http://www.postgresql.org/message-id/m28vbgcc30@2ndquadrant.fr

I'm not so attached to it that I will put the rest of the patch in
danger with it, I'm fine about removing that part if we have a good
reason to. And yes, you (politely) saying I don't like it is a good
enough reason to.

Our process for a non-commiter proposal seems to me that you have to
send a patch showing what you're talking about for a thread to form and
to get some comments on the idea, either rejecting it or reaching to a
consensus about it. So the opening of that thread was in october and the
discussion happens now: it's very fine, and I could be happy and
thankfull about this fact with some subtle changes in the form of the
messages.

 I'm very much opposed to that proposal, as I am to your proposal to
 expose internal and generated events to users.  Recursing into
 ProcessUtility is a nasty, messy hook that is responsible for subtle
 bugs and locking problems in our current DDL implementation.  We

We are publishing 2 pieces of information tied to the implementation in
the current patch:

 - the parse tree, only available to C code, and without any API
   stability promises

 - the fact that somme commands are running nested commands: serial does
   a create sequence, create schema create 

Re: [HACKERS] Multiple --table options for other commands

2013-01-17 Thread Magnus Hagander
On Fri, Dec 14, 2012 at 4:14 PM, Karl O. Pinc k...@meme.com wrote:
 On 12/13/2012 11:02:56 PM, Josh Kupershmidt wrote:
 On Thu, Dec 13, 2012 at 9:03 PM, Karl O. Pinc k...@meme.com wrote:

  Sorry to be so persnickety, and unhelpful until now.
  It seemed like it should be doable, but something
  was going wrong between keyboard and chair.  I guess
  I should be doing this when better rested.

 No problem, here is v5 with changed synopses.

 The clusterdb synopsis had tabs in it, which I understand
 is frowned upon in the docs.  I've fixed this.
 It looks good to me, passes check and so forth.

 Attached is a v6 patch, with no tabs in docs and based
 off the latest head.

 I'm marking it ready for committer.

Thanks. Applied, with only some small whitespace changes.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby conflict resolution handling

2013-01-17 Thread Andres Freund
On 2013-01-17 01:38:31 -0500, Tom Lane wrote:
 But having said that ... are we sure this code is not actually broken?
 ISTM that if we dare not interrupt for fear of confusing OpenSSL, we
 cannot safely attempt to send an error message to the client either;
 but ereport(FATAL) will try exactly that.

You're absolutely right.

ISTM, to fix it we would have to either provide a COMERROR like facility
for FATAL errors or just remove the requirement of raising an error
exactly there.
If I remember what I tried before correctly the latter seems to involve
setting openssl into nonblocking mode and check for the saved error in
the EINTR loop in be-secure and re-raise the error from there.

Do we want to backport either - there hasn't been any report that I
could link to it right now, but on the other hand it could possibly
cause rather ugly problems (data leakage, segfaults, code execution
aren't all that improbable)?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
 Hi all,
 
 There is a strange bug with the latest master head (commit 7fcbf6a).
 When the WAL stream with a master is cut on a slave, slave returns a FATAL
 (well normal...), but then enters in recovery process and automatically
 promotes.
 Here are more details about the logs on slave (I simply killed the master
 manually):
 FATAL:  could not receive data from WAL stream:
 cp: cannot stat
 ‘/home/michael/bin/pgsql/archive/master/00010004’: No such
 file or directory
 LOG:  record with zero length at 0/401E1B8
 LOG:  redo done at 0/401E178
 LOG:  last completed transaction was at log time 2013-01-17
 20:27:53.180971+09
 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0002.history’:
 No such file or directory
 LOG:  selected new timeline ID: 2
 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0001.history’:
 No such file or directory
 LOG:  archive recovery complete
 DEBUG:  resetting unlogged relations: cleanup 0 init 1
 LOG:  database system is ready to accept connections
 LOG:  autovacuum launcher started
 DEBUG:  archived transaction log file 00010004
 DEBUG:  archived transaction log file 0002.history
 LOG:  statement: create table bn (a int);
 DEBUG:  autovacuum: processing database postgres
 
 Slave does not try anymore to reconnect to master with messages of the type:
 FATAL:  could not connect to the primary server
 
 I also noticed that there is some delay until modifications on master are
 visible on slave.
 For example run a simple CREATE TABLE and the new table is not
 
 [some bisecting later...]
 
 I think that bug has been introduced by commit 7fcbf6a.
 Before splitting xlog reading as a separate facility things worked
 correctly.
 There are also no delay problems before this commit.
 
 Does someone else noticed that?

No, at least I haven't, but it got committed only a rather short time
ago ;)
Looking, I have an inkling where the rpoblem could be.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Tomas Vondra

Dne 17.01.2013 10:36, Magnus Hagander napsal:
On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org 
wrote:
This might be way more than we want to do, but there is an 
article

that describes some techniques for doing what seems to be missing
(AIUI):

http://msdn.microsoft.com/en-us/magazine/cc163996.aspx

Even this would be doable, I'm afraid it may not fit in 9.3 if we
think about the current status of CF. So our choice would be:

1) Postpone the patch to 9.4

2) Commit the patch in 9.3 without Windows support

I personally am ok with #2. We traditionally avoid particular 
paltform
specific features on PostgreSQL.  However I think the policiy 
could be
losen for contrib staffs. Also pgbench is just a client program. 
We
could always use pgbench on UNIX/Linux if we truely need the 
feature.


What do you think?


Fair enough, I was just trying to point out alternatives. We have
committed platform-specific features before now. I hope it doesn't
just get left like this, though.


We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, so
there isn't much we can do there.


Maybe, although this platform-dependence already exists in pgbench to 
some
extent (the Windows branch is unable to log the timestamps of 
transactions).


It would certainly be better if pgbench was able to handle Windows and
Linux equally, but that was not the aim of this patch.


Yeah, I hope someone pick this up and propose as a TODO item. In the
mean time, I'm going to commit the patch without Windows support
unless there's objection.


Perhaps we should actually hold off until someone committs to 
actually

getting it fixed in the next version? If we do have that, then we can
commit it as a partial feature, but if we just hope someone picks it
up, that's leaving it very loose..


Well, given that I'm an author of that patch, that 'someone' would have
to be me. The problem is I have access to absolutely no Windows 
machines,
not mentioning the development tools (and that I have no clue about 
it).


I vaguely remember there were people on this list doing Windows 
development
on a virtual machine or something. Any interest in working on this / 
giving

me some help?

Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Tomas Vondra

Dne 17.01.2013 11:16, Magnus Hagander napsal:
On Thu, Jan 17, 2013 at 11:09 AM, Dave Page dp...@pgadmin.org 
wrote:
On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander 
mag...@hagander.net wrote:

We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, 
so

there isn't much we can do there.


Right - having platform specific features for other reasons like 
lack
of time is a slippery slope in my opinion. We should not get into 
such
a habit or Windows will quickly become a second class platform as 
far

as PostgreSQL features are concerned.


Especially since there is no lack of time - the functionality is
there, it just looks (significantly) different.


Really? Any link to relevant docs or something?

When doing some research in this field, the only option I was able to 
come up
with was combining gettimeofday() with the timing functionality, and do 
something

like this:

  1) call gettimeofday() at thread start, giving a common unix 
timestamp
  2) measure the time from the thread start using the conters (for each 
transaction)

  3) combine those values

This might of course give up to a second difference compared to the 
actual time

(because of the gettimeofday precision), but IMHO that's fine.

An even simpler option would omit the (1), so the timestamps would 
start at 0.


Or is there a better way?

Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-01-17 Thread Robert Haas
On Wed, Jan 16, 2013 at 1:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Where I really need someone to hit me upside the head with a
 clue-stick is the code I added to the bottom of RelationBuildDesc()
 in relcache.c. The idea is that on first access to an unlogged MV,
 to detect that the heap has been replaced by the init fork, set
 relisvalid to false, and make the heap look normal again.

 Hmm.  I agree that relcache.c has absolutely no business doing that,
 but not sure what else to do instead.  Seems like it might be better
 done at first touch of the MV in the parser, rewriter, or planner ---
 but the fact that I can't immediately decide which of those is right
 makes me feel that it's still too squishy.

I think we shouldn't be doing that at all.  The whole business of
transferring the relation-is-invalid information from the relation to
a pg_class flag seems like a Rube Goldberg device to me.  I'm still
not convinced that we should have a relation-is-invalid flag at all,
but can we at least not have two?

It seems perfectly adequate to detect that the MV is invalid when we
actually try to execute a plan - that is, when we first access the
heap or one of its indexes.  So the bit can just live in the
file-on-disk, and there's no need to have a second copy of it in
pg_class.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Andrew Dunstan


On 01/17/2013 06:11 AM, Tomas Vondra wrote:

Dne 17.01.2013 11:16, Magnus Hagander napsal:

On Thu, Jan 17, 2013 at 11:09 AM, Dave Page dp...@pgadmin.org wrote:
On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander 
mag...@hagander.net wrote:

We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, so
there isn't much we can do there.


Right - having platform specific features for other reasons like lack
of time is a slippery slope in my opinion. We should not get into such
a habit or Windows will quickly become a second class platform as far
as PostgreSQL features are concerned.


Especially since there is no lack of time - the functionality is
there, it just looks (significantly) different.


Really? Any link to relevant docs or something?

When doing some research in this field, the only option I was able to 
come up
with was combining gettimeofday() with the timing functionality, and 
do something

like this:

  1) call gettimeofday() at thread start, giving a common unix timestamp
  2) measure the time from the thread start using the conters (for 
each transaction)

  3) combine those values

This might of course give up to a second difference compared to the 
actual time

(because of the gettimeofday precision), but IMHO that's fine.


The link I posted showed a technique which essentially uses edge 
detection on gettimeofday() to get an accurate correspondence between 
the time of day and the performance counter, following which you can 
supposedly calculate the time of day with a high degree of accuracy just 
from the performance counter. You should be able to do this just once, 
at program start. If you don't care that much about the initial 
precision then your procedure should work fine, I think.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
 Hi all,
 
 There is a strange bug with the latest master head (commit 7fcbf6a).
 When the WAL stream with a master is cut on a slave, slave returns a FATAL
 (well normal...), but then enters in recovery process and automatically
 promotes.
 Here are more details about the logs on slave (I simply killed the master
 manually):
 FATAL:  could not receive data from WAL stream:
 cp: cannot stat
 ‘/home/michael/bin/pgsql/archive/master/00010004’: No such
 file or directory
 LOG:  record with zero length at 0/401E1B8
 LOG:  redo done at 0/401E178
 LOG:  last completed transaction was at log time 2013-01-17
 20:27:53.180971+09
 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0002.history’:
 No such file or directory
 LOG:  selected new timeline ID: 2
 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0001.history’:
 No such file or directory
 LOG:  archive recovery complete
 DEBUG:  resetting unlogged relations: cleanup 0 init 1
 LOG:  database system is ready to accept connections
 LOG:  autovacuum launcher started
 DEBUG:  archived transaction log file 00010004
 DEBUG:  archived transaction log file 0002.history
 LOG:  statement: create table bn (a int);
 DEBUG:  autovacuum: processing database postgres
 
 Slave does not try anymore to reconnect to master with messages of the type:
 FATAL:  could not connect to the primary server
 
 I also noticed that there is some delay until modifications on master are
 visible on slave.
 For example run a simple CREATE TABLE and the new table is not
 
 [some bisecting later...]
 
 I think that bug has been introduced by commit 7fcbf6a.
 Before splitting xlog reading as a separate facility things worked
 correctly.
 There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4

2013-01-17 Thread Craig Ringer
On 01/17/2013 04:43 PM, Magnus Hagander wrote:


 On Jan 17, 2013 8:15 AM, Abhijit Menon-Sen a...@2ndquadrant.com
 mailto:a...@2ndquadrant.com wrote:
 
  At 2013-01-17 16:05:05 +0900, michael.paqu...@gmail.com
 mailto:michael.paqu...@gmail.com wrote:
  
   Is it really necessary to create a new commit fest just to move the
   items? Marking the patches that are considered as being too late for
   9.3 should be just returned with feedback.
 
  Opening 2013-03 is not so much to move existing patches, but to give
  people a place to submit *new* post-9.3 patches without interrupting
  the 2013-01 CF.

 Yeah, and +1 for doing that. The sooner the better. By whichever of
 the time frames for the cf that had been discussed, it should
 certainly not be open for accepting new patches for 9.3 anymore.

I've moved all pending patches from 2012-11 to 2013-01. I'll go through
and poke them for aliveness and start chasing things up; in the mean
time, any chance of closing 2012-11?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Andrew Dunstan


On 01/17/2013 06:04 AM, Tomas Vondra wrote:

The problem is I have access to absolutely no Windows machines,
not mentioning the development tools (and that I have no clue about it).

I vaguely remember there were people on this list doing Windows 
development
on a virtual machine or something. Any interest in working on this / 
giving

me some help?





One of the items on my very long TODO list (see stuff elsewhere about 
committers not doing enough reviewing and committing) is to create a 
package that can easily be run to set Windows Postgres development 
environments for MSVC, Mingw and Cygwin on Amazon, GoGrid etc.


Once I get that done I'll be a lot less sympathetic to I don't have 
access to Windows pleas.


Windows does run in a VM very well, but if you're going to set it up on 
your own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own 
legal copy to install there. If you don't already have one, that will 
set you back about $140.00 (for w7 Pro) in the USA. Note that that's 
significantly better than the situation with OSX, which you can't run at 
all except on Apple hardware.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Dave Page
On Thu, Jan 17, 2013 at 1:29 PM, Andrew Dunstan and...@dunslane.net wrote:

 On 01/17/2013 06:04 AM, Tomas Vondra wrote:

 The problem is I have access to absolutely no Windows machines,
 not mentioning the development tools (and that I have no clue about it).

 I vaguely remember there were people on this list doing Windows
 development
 on a virtual machine or something. Any interest in working on this /
 giving
 me some help?




 One of the items on my very long TODO list (see stuff elsewhere about
 committers not doing enough reviewing and committing) is to create a package
 that can easily be run to set Windows Postgres development environments for
 MSVC, Mingw and Cygwin on Amazon, GoGrid etc.

FYI, I have a similar item on my TODO list, though I was looking
primarily at VC++ on AWS. It did get close to the top last week, but
then I got busy with other things :-/. Anyway, I'd suggest we ping
each other if either actually get started, to avoid duplication of
effort.


--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 2:29 PM, Andrew Dunstan and...@dunslane.net wrote:

 On 01/17/2013 06:04 AM, Tomas Vondra wrote:

 The problem is I have access to absolutely no Windows machines,
 not mentioning the development tools (and that I have no clue about it).

 I vaguely remember there were people on this list doing Windows
 development
 on a virtual machine or something. Any interest in working on this /
 giving
 me some help?




 One of the items on my very long TODO list (see stuff elsewhere about
 committers not doing enough reviewing and committing) is to create a package
 that can easily be run to set Windows Postgres development environments for
 MSVC, Mingw and Cygwin on Amazon, GoGrid etc.

 Once I get that done I'll be a lot less sympathetic to I don't have access
 to Windows pleas.

 Windows does run in a VM very well, but if you're going to set it up on your
 own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own legal
 copy to install there. If you don't already have one, that will set you back
 about $140.00 (for w7 Pro) in the USA. Note that that's significantly better
 than the situation with OSX, which you can't run at all except on Apple
 hardware.

Yeah. I used to have an AMI with the VS environment preinstalled on
Amazon, but I managed to fat finger things and delete it at some point
and haven't really had time to rebuild it.

Having a script that would download and install all the pre-requisites
on such a box would be *great*. Then you could get up and going pretty
quickly, and getting a Windows box up running for a few hours there is
almost free, and you don't have to deal with licensing hassles.

(Of course, the AMI method doesn't work all the way since you'd be
distributing Visual Studio, but if we can have a script that
auto-downloads-and-installs it as necessary we can get around that)


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small pg_basebackup display bug

2013-01-17 Thread Magnus Hagander
On Sun, Dec 16, 2012 at 7:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 On Sat, Dec 15, 2012 at 2:24 PM, Erik Rijkers e...@xs4all.nl wrote:
 That would make such a truncation less frequent, and after all a truncated 
 display is not
 particular useful.

 Agreed - it's useful during testing, but not in a typical production
 use. It might actually be more useful if it's truncated in in the
 other end (keeping the last 30 instead of the first 30 chars)

 +1 for truncating from the left.  I think pg_upgrade already does that
 in its progress messages.

Fixed. I also fixed the output of the size parameter to be a fixed
length, so the whole row doesn't shift left and right depending on how
far long the process is.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] default SSL compression (was: libpq compression)

2013-01-17 Thread Magnus Hagander
On Wed, Jan 2, 2013 at 3:17 PM, Magnus Hagander mag...@hagander.net wrote:
 On Wed, Jan 2, 2013 at 3:15 PM, Noah Misch n...@leadboat.com wrote:
 On Wed, Jan 02, 2013 at 02:03:20PM +0100, Magnus Hagander wrote:
 On Wed, Jan 2, 2013 at 1:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  So +1 for changing it to DEFAULT from me, too.  There's no reason to
  think we know more about this than the OpenSSL authors.

 The DEFAULT value in OpenSSL 1.0 means ALL:!aNULL:!eNULL.

 Researching some more, this might cause a problem actually, which
 would explain some of the things that are in our default. For example,
 an ADH algorithm doesn't use certificates - but it uses DH parameters,
 so it likely won't work anyway. EDH uses certs, but also requires DH
 parameters.

 Maybe what we nede is DEFAULT:!ADH:@STRENGTH as the default?

 I understand aNULL to include ADH.

 Hmm. Seems you're right when I run a test on it, I was reading it wrong.


 The other difference is that our current string denies 40 and 56 bit
 encryptions (low and export strenghts). Do we stll want to do that?

 On the one hand, those seem bad to permit by default in 2013.  On the other
 hand, if so, why hasn't OpenSSL removed them from DEFAULT?  Perhaps it has
 backward-compatibility concerns that wouldn't apply to us by virtue of having
 disabled them for some time.  Sounds reasonable to continue disabling them.

 So then the default would be DEFAULT:!LOW:!EXP:@STRENGTH

 (the @STRENGTH part is the sorting key for preference, which the
 default one seems not to have)

 The biggest difference being that we start from DEFAULT rather than ALL.

I've applied a change that does this, including still rejecting MD5.
Meaning that the difference is we start from DEFAULT instead of ALL
(and the ADH rule is removed, since !aNULL is already in the default).

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] could not create directory ...: File exists

2013-01-17 Thread Stephen Frost
Greetings,

  We've come across this rather annoying error happening during our
  builds:

  ERROR:  could not create directory pg_tblspc/25120/PG_9.3_201212081/231253: 
File exists

  It turns out that this is coming from copydir() when called by
  createdb() during a CREATE DATABASE .. FROM TEMPLATE where the
  template has tables in tablespaces.

  It turns out that createdb() currently only takes an AccessShareLock
  on pg_tablespace when scanning it with SnapshotNow, making it possible
  for a concurrent process to make some uninteresting modification to a
  tablespace (such as an ACL change) which will cause the heap scan in
  createdb() to see a given tablespace multiple times.  copydir() will
  then, rightfully, complain that it's being asked to create a directory
  which already exists.

  Given that this is during createdb(), I'm guessing we don't have any
  option but to switch the scan to using ShareLock, since there isn't a
  snapshot available to do an MVCC scan with (I'm guessing that there
  could be other issues trying to do that anyway).

  Attached is a patch which does this and corrects the problem for us
  (of course, we're now going to go rework our build system to not
  modify tablespace ACLs, since with this patch concurrent builds end up
  blocking on each other, which is annoying).

Thanks,

Stephen
colordiff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
new file mode 100644
index 1f6e02d..60a5099
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
*** createdb(const CreatedbStmt *stmt)
*** 549,556 
  		/*
  		 * Iterate through all tablespaces of the template database, and copy
  		 * each one to the new database.
  		 */
! 		rel = heap_open(TableSpaceRelationId, AccessShareLock);
  		scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
--- 549,563 
  		/*
  		 * Iterate through all tablespaces of the template database, and copy
  		 * each one to the new database.
+ 		 *
+ 		 * We need to use ShareLock to prevent other processes from updating a
+ 		 * tablespace (such as changing an ACL, for example) or we will end up
+ 		 * running into the same tablespace multiple times during our heap scan
+ 		 * (the prior-to-update tuple and then the new tuple after the update)
+ 		 * and copydir() will, rightfully, complain that the directory already
+ 		 * exists.
  		 */
! 		rel = heap_open(TableSpaceRelationId, ShareLock);
  		scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
*** createdb(const CreatedbStmt *stmt)
*** 607,613 
  			}
  		}
  		heap_endscan(scan);
! 		heap_close(rel, AccessShareLock);
  
  		/*
  		 * We force a checkpoint before committing.  This effectively means
--- 614,620 
  			}
  		}
  		heap_endscan(scan);
! 		heap_close(rel, ShareLock);
  
  		/*
  		 * We force a checkpoint before committing.  This effectively means


signature.asc
Description: Digital signature


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 May be you've already addressed that concern with the proven
 performance numbers, but I'm not sure.

It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.

Jeff's approach of holding the VM pins for longer certainly mitigates
some of the damage, in the sense that it reduces buffer lookups and
pin/unpin cycles - and might be worth doing independently of the rest
of the patch if we think it's a win.  Index-only scans already use a
similar optimization, so extending it to inserts, updates, and deletes
is surely worth considering.  The main question in my mind is whether
there are any negative consequences to holding a VM buffer pin for
that long without interruption.  The usual consideration - namely,
blocking vacuum - doesn't apply here because vacuum does not take a
cleanup lock on the visibility map page, only the heap page, but I'm
not sure if there are others.

The other part of the issue is cache pressure. I don't think I can say
it better than what Tom already wrote:

# I'd be worried about the case of a lot of sessions touching a lot of
# unrelated tables.  This change implies doubling the number of buffers
# that are held pinned by any given query, and the distributed overhead
# from that (eg, adding cycles to searches for free buffers) is what you
# ought to be afraid of.

I agree that we ought to be afraid of that.  A pgbench test isn't
going to find a problem in this area because there you have a bunch of
sessions all accessing the same small group of tables.  To find a
problem of the type above, you'd need lots of backends accessing lots
of different, small tables.  That's not the use case we usually
benchmark, but I think there are a fair number of people doing things
like that - for example, imagine a hosting provider or web application
with many databases or schemas on a single instance.  AFAICS, Jeff
hasn't tried to test this scenario.

Now, on the flip side, we should also be thinking about what we would
gain from this patch, and what other ways there might be to achieve
the same goals.  As far as I can see, the main gain is that if you
bulk-load a table, don't vacuum it right away, get all the hint bits
set by some other mechanism, and then vacuum the table, you'll only
read the whole table instead of rewriting the whole table.  So ISTM
that, for example, if we adopted the idea of making HOT prune set
visibility-map bits, most of the benefit of this change evaporates,
but whatever costs it may have will remain.  There are other possible
ways of getting the same benefit as well - for example, I've been
thinking for a while now that we should try to find some judicious way
of vacuuming insert-only tables, perhaps only in small chunks when
there's nothing else going on.  That wouldn't as clearly obsolete this
patch, but it would address a very similar use case and would also
preset hint bits, which would help a lot of people.  Some of the ideas
we've had about a HEAP_XMIN_FREEZE intersect with this idea, too - if
we can do an early freeze without losing forensic information, we can
roll setting the hint bit, setting PD_ALL_VISIBLE, and freezing the
page into a single write.

All of which is to say that I think this patch is premature.  If we
adopt something like this, we're likely never going to revert back to
the way we do it now, and whatever cache-pressure or other costs this
approach carries will be hear to stay - so we had better think awfully
carefully before we do that.  And, FWIW, I don't believe that there is
sufficient time in this release cycle to carefully test this patch and
the other alternative designs that aim toward the same ends.  Even if
there were, this is exactly the sort of thing that should be committed
at the beginning of a release cycle, not the end, so as to allow
adequate time for discovery of unforeseen consequences before the code
ends up out in the wild.

Of course, there's another issue here too, which is that as Pavan
points out, the page throws crash-safety out the window, which breaks
index-only scans (if you have a crash).  HEAP_XLOG_VISIBLE is intended
principally to protect the VM bit, not PD_ALL_VISIBLE, but the patch
rips it out even though its purpose is to remove the latter, not the
former.  Removing PD_ALL_VISIBLE eliminates the need to keep the
visibility-map bit consist with PD_ALL_VISIBLE, but it does not
eliminate the need to keep PD_ALL_VISIBLE consistent with the page
contents.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 8:17 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 I've moved all pending patches from 2012-11 to 2013-01. I'll go through and
 poke them for aliveness and start chasing things up; in the mean time, any
 chance of closing 2012-11?

Done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Robert Haas
On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I'd prefer to leave the .partial suffix in place, as the segment really
 isn't complete. It doesn't make a difference when you recover to the latest
 timeline, but if you have a more complicated scenario with multiple
 timelines that are still alive, ie. there's a server still actively
 generating WAL on that timeline, you'll easily get confused.

 As an example, imagine that you have a master server, and one standby. You
 maintain a WAL archive for backup purposes with pg_receivexlog, connected to
 the standby. Now, for some reason, you get a split-brain situation and the
 standby server is promoted with new timeline 2, while the real master is
 still running. The DBA notices the problem, and kills the standby and
 pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
 pg_receivexlog's target directory, and re-points pg_recevexlog to the master
 while he re-builds the standby server from backup. At that point,
 pg_receivexlog will start streaming from the end of the zero-padded segment,
 not knowing that it was partial, and you have a hole in the archived WAL
 stream. Oops.

 The DBA could avoid that by also removing the last WAL segment on timeline
 1, the one that was partial. But it's really not obvious that there's
 anything wrong with that segment. Keeping the .partial suffix makes it
 clear.

I shudder at the idea that the DBA is manually involved in any of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 16:56, Robert Haas wrote:

On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

I'd prefer to leave the .partial suffix in place, as the segment really
isn't complete. It doesn't make a difference when you recover to the latest
timeline, but if you have a more complicated scenario with multiple
timelines that are still alive, ie. there's a server still actively
generating WAL on that timeline, you'll easily get confused.

As an example, imagine that you have a master server, and one standby. You
maintain a WAL archive for backup purposes with pg_receivexlog, connected to
the standby. Now, for some reason, you get a split-brain situation and the
standby server is promoted with new timeline 2, while the real master is
still running. The DBA notices the problem, and kills the standby and
pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
pg_receivexlog's target directory, and re-points pg_recevexlog to the master
while he re-builds the standby server from backup. At that point,
pg_receivexlog will start streaming from the end of the zero-padded segment,
not knowing that it was partial, and you have a hole in the archived WAL
stream. Oops.

The DBA could avoid that by also removing the last WAL segment on timeline
1, the one that was partial. But it's really not obvious that there's
anything wrong with that segment. Keeping the .partial suffix makes it
clear.


I shudder at the idea that the DBA is manually involved in any of this.


The scenario I described is that you screwed up your failover 
environment, and end up with a split-brain situation by accident. The 
DBA certainly needs to be involved to recover from that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 16:53, Robert Haas wrote:

On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
pavan.deola...@gmail.com  wrote:

May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.


It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.


The idea was to avoid clearing the bit in the VM page on every update, 
when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is 
not set. I assumed the traffic and contention on the VM page would be a 
killer otherwise. I don't remember if I ever actually tested that 
though. Maybe I was worrying about nothing and hitting the VM page on 
every update is ok.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 9:59 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The scenario I described is that you screwed up your failover environment,
 and end up with a split-brain situation by accident. The DBA certainly needs
 to be involved to recover from that.

OK, I agree, but I still think a lot of DBAs would have no idea how to
handle that situation.  I agree with your proposal, don't get me wrong
- I just think there's still an awful lot of room for operator error
in these more complex replication scenarios.  I don't have a clue how
to fix that, and it's certainly not the purpose of this thread to fix
that; I'm just venting.

Actually, I'm really glad to see all the work you've done to improve
the way that some of these scenarios work and eliminate various bugs
and other surprising failure modes over the last couple of months.
It's great stuff.  Alas, I think we still some distance from being
able to provide an easy button.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 15:05, Andres Freund wrote:

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.


Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.


Got a patch?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby conflict resolution handling

2013-01-17 Thread Tom Lane
Pavan Deolasee pavan.deola...@gmail.com writes:
 On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ISTM that if we dare not interrupt for fear of confusing OpenSSL, we
 cannot safely attempt to send an error message to the client either;
 but ereport(FATAL) will try exactly that.

 I thought since FATAL will force the backend to exit, we don't care much
 about corrupted OpenSSL state. I even thought that's why we raise ERROR to
 FATAL so that the backend can start in a clean state. But clearly I'm
 missing a point here because you don't think that way.

If we were to simply exit(1), leaving the kernel to close the client
socket, it'd be safe enough because control would never have returned to
OpenSSL.  But this code doesn't do that.  What we're looking at is that
we've interrupted OpenSSL at some arbitrary point, and now we're going
to make fresh calls to it to try to pump the FATAL error message out to
the client.  It seems fairly unlikely that that's safe.  I'm not sure
I credit Andres' worry of arbitrary code execution, but I do fear that
OpenSSL could get confused to the point of freezing up, or even more
likely that it would transmit garbage to the client, which rather
defeats the purpose.

Don't see a nice fix.  The COMMERROR approach (ie, don't try to send
anything to the client, only the log) is not nice at all since the
client would get the impression that the server crashed.  On the other
hand, anything else requires waiting till we get control back from
OpenSSL, which might be a long time, and meanwhile we're still holding
locks that prevent WAL recovery from proceeding.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:
 On 17.01.2013 15:05, Andres Freund wrote:
 On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
 I think that bug has been introduced by commit 7fcbf6a.
 Before splitting xlog reading as a separate facility things worked
 correctly.
 There are also no delay problems before this commit.
 
 Ok, my inkling proved to be correct, its two related issues:
 
 a ) The error handling in XLogReadRecord is inconsistent, it doesn't
 always reset the necessary things.
 
 b) ReadRecord: We cannot not break out of the retry loop in readRecord
 just so, just removing the break seems correct.
 
 c) ReadRecord: (minor): We should log an error even if errormsg is not
 set, otherwise we wont jump out far enough.
 
 I think at least a) and b) is the result of merges between development
 of different people, sorry for that.
 
 Got a patch?

Yes, I am just testing some scenarios with it, will send it after that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 16:23:44 +0100, Andres Freund wrote:
 On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:
  On 17.01.2013 15:05, Andres Freund wrote:
  On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
  I think that bug has been introduced by commit 7fcbf6a.
  Before splitting xlog reading as a separate facility things worked
  correctly.
  There are also no delay problems before this commit.
  
  Ok, my inkling proved to be correct, its two related issues:
  
  a ) The error handling in XLogReadRecord is inconsistent, it doesn't
  always reset the necessary things.
  
  b) ReadRecord: We cannot not break out of the retry loop in readRecord
  just so, just removing the break seems correct.
  
  c) ReadRecord: (minor): We should log an error even if errormsg is not
  set, otherwise we wont jump out far enough.
  
  I think at least a) and b) is the result of merges between development
  of different people, sorry for that.
  
  Got a patch?
 
 Yes, I am just testing some scenarios with it, will send it after that.

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch.  I also noticed that the TLI check
didn't mark the last source as failed.

Not a real issue and its independent from this patch but I found that
when promoting from streaming rep the code first fails over to archive
recovery and only then to recovering from pg_xlog.  Is that intended?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70cfabc..e33bd7a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3209,12 +3209,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader-EndRecPtr;
 		if (record == NULL)
 		{
-			/* not all failures fill errormsg; report those that do */
-			if (errormsg  errormsg[0] != '\0')
-ereport(emode_for_corrupt_record(emode,
- RecPtr ? RecPtr : EndRecPtr),
-		(errmsg_internal(%s, errormsg) /* already translated */));
-
 			lastSourceFailed = true;
 
 			if (readFile = 0)
@@ -3222,7 +3216,23 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 close(readFile);
 readFile = -1;
 			}
-			break;
+
+			/*
+			 * We only end up here without a message when XLogPageRead() failed
+			 * - in that case we already logged something.
+			 * In StandbyMode that only happens if we have been triggered, so
+			 * we shouldn't loop anymore in that case.
+			 */
+			if (errormsg == NULL)
+break;
+
+
+			ereport(emode_for_corrupt_record(emode,
+			 RecPtr ? RecPtr : EndRecPtr),
+	(errmsg_internal(%s, errormsg) /* already translated */));
+
+			/* don't make a timeline check */
+			continue;
 		}
 
 		/*
@@ -3234,6 +3244,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLogSegNo segno;
 			int32 offset;
 
+			lastSourceFailed = true;
+
 			XLByteToSeg(xlogreader-latestPagePtr, segno);
 			offset = xlogreader-latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader-readPageTLI, segno);
@@ -3243,7 +3255,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			xlogreader-latestPageTLI,
 			fname,
 			offset)));
-			return false;
+			continue;
 		}
 	} while (StandbyMode  record == NULL);
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ff871a3..75164f6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord);
 
 	if (readOff  0)
-	{
-		if (state-errormsg_buf[0] != '\0')
-			*errormsg = state-errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	/*
 	 * ReadPageInternal always returns at least the page header, so we can
@@ -246,8 +242,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, invalid record offset at %X/%X,
 			  (uint32) (RecPtr  32), (uint32) RecPtr);
-		*errormsg = state-errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	if XLogPageHeader) state-readBuf)-xlp_info  XLP_FIRST_IS_CONTRECORD) 
@@ -255,8 +250,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, contrecord is requested by %X/%X,
 			  (uint32) (RecPtr  32), (uint32) RecPtr);
-		*errormsg = state-errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	/* ReadPageInternal has verified the page header */
@@ -270,11 +264,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 			   targetPagePtr,
 		  Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
 	if (readOff  0)
-	{
-		if 

Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
   It turns out that createdb() currently only takes an AccessShareLock
   on pg_tablespace when scanning it with SnapshotNow, making it possible
   for a concurrent process to make some uninteresting modification to a
   tablespace (such as an ACL change) which will cause the heap scan in
   createdb() to see a given tablespace multiple times.  copydir() will
   then, rightfully, complain that it's being asked to create a directory
   which already exists.

Ugh.  Still another problem with non-MVCC catalog scans.

   Given that this is during createdb(), I'm guessing we don't have any
   option but to switch the scan to using ShareLock, since there isn't a
   snapshot available to do an MVCC scan with (I'm guessing that there
   could be other issues trying to do that anyway).

It seems that the only thing we actually use from each tuple is the OID.
So there are other ways to fix it, of which probably the minimum-change
one is to keep a list of already-processed tablespace OIDs and skip a
tuple if we get a match in the list.  This would be O(N^2) in the number
of tablespaces, but I rather doubt that's a problem.

[ thinks for a bit... ]  Ugh, no, because the *other* risk you've got
here is not seeing a row at all, which would be really bad.

I'm not sure that transiently increasing the lock here is enough,
either.  The concurrent transactions you're worried about probably
aren't holding their locks till commit, so you could get this lock
and still see tuples with unstable committed-good states.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby conflict resolution handling

2013-01-17 Thread Andres Freund
On 2013-01-17 10:19:23 -0500, Tom Lane wrote:
 Pavan Deolasee pavan.deola...@gmail.com writes:
  On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  ISTM that if we dare not interrupt for fear of confusing OpenSSL, we
  cannot safely attempt to send an error message to the client either;
  but ereport(FATAL) will try exactly that.
 
  I thought since FATAL will force the backend to exit, we don't care much
  about corrupted OpenSSL state. I even thought that's why we raise ERROR to
  FATAL so that the backend can start in a clean state. But clearly I'm
  missing a point here because you don't think that way.
 
 If we were to simply exit(1), leaving the kernel to close the client
 socket, it'd be safe enough because control would never have returned to
 OpenSSL.  But this code doesn't do that.  What we're looking at is that
 we've interrupted OpenSSL at some arbitrary point, and now we're going
 to make fresh calls to it to try to pump the FATAL error message out to
 the client.  It seems fairly unlikely that that's safe.  I'm not sure
 I credit Andres' worry of arbitrary code execution, but I do fear that
 OpenSSL could get confused to the point of freezing up, or even more
 likely that it would transmit garbage to the client, which rather
 defeats the purpose.

I don't think its likely either, I seem to remember it copying arround
function pointers though, so it seems possible with some bad luck.

 Don't see a nice fix.  The COMMERROR approach (ie, don't try to send
 anything to the client, only the log) is not nice at all since the
 client would get the impression that the server crashed.  On the other
 hand, anything else requires waiting till we get control back from
 OpenSSL, which might be a long time, and meanwhile we're still holding
 locks that prevent WAL recovery from proceeding.

I think we can make openssl return pretty much immediately if we assume
recv() can reliably interrupted by signals, possibly by setting the
socket to nonblocking in the signal handler.
We just need to tell openssl not to retry immediately and we should be
fine. Given that quite some people use openssl with nonblocking sockets,
that code path should be reasonably safe.

That still requires ugliness around saving the error and reraising it
after returning from openssl though...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-01-17 Thread Thom Brown
On 16 January 2013 17:25, Thom Brown t...@linux.com wrote:

 On 16 January 2013 17:20, Kevin Grittner kgri...@mail.com wrote:

 Thom Brown wrote:

  Some weirdness:
 
  postgres=# CREATE VIEW v_test2 AS SELECT 1 moo;
  CREATE VIEW
  postgres=# CREATE MATERIALIZED VIEW mv_test2 AS SELECT moo, 2*moo FROM
  v_test2 UNION ALL SELECT moo, 3*moo FROM v_test2;
  SELECT 2
  postgres=# \d+ mv_test2
   Materialized view public.mv_test2
   Column | Type | Modifiers | Storage | Stats target | Description
  --+-+---+-+--+-
   moo | integer | | plain | |
   ?column? | integer | | plain | |
  View definition:
   SELECT *SELECT* 1.moo, *SELECT* 1.?column?;

 You are very good at coming up with these, Thom!

 Will investigate.

 Can you confirm that *selecting* from the MV works as you would
 expect; it is just the presentation in \d+ that's a problem?


 Yes, nothing wrong with using the MV, or refreshing it:

 postgres=# TABLE mv_test2;
  moo | ?column?
 -+--
1 |2
1 |3
 (2 rows)

 postgres=# SELECT * FROM mv_test2;
  moo | ?column?
 -+--
1 |2
1 |3
 (2 rows)

 postgres=# REFRESH MATERIALIZED VIEW mv_test2;
 REFRESH MATERIALIZED VIEW

 But a pg_dump of the MV has the same issue as the view definition:

 --
 -- Name: mv_test2; Type: MATERIALIZED VIEW; Schema: public; Owner: thom;
 Tablespace:
 --

 CREATE MATERIALIZED VIEW mv_test2 (
 moo,
 ?column?
 ) AS
 SELECT *SELECT* 1.moo, *SELECT* 1.?column?
   WITH NO DATA;


A separate issue is with psql tab-completion:

postgres=# COMMENT ON MATERIALIZED VIEW ^IIS

This should be offering MV names instead of prematurely providing the IS
keyword.

-- 
Thom


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 5:18 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Now, if that's what it takes, I'll spend time on it. In which exact
 order do you want to be reviewing and applying that series of patches?

Let's agree on which things we even want to do first.  Here's my take:

- adds ddl_command_trace and ddl_command_end events

I think ddl_command_end is OK and I'm willing to commit that if
extracted as its own patch.  I think ddl_command_trace is unnecessary
syntactic sugar.

- causes those events to be called not only for actual SQL commands
but also for things that happen, at present, to go through the same
code path

I think this is a bad idea, not only because, as I said before, it
exposes internal implementation details, but also because it's
probably not safe.  As Noah (I believe, or maybe Tom), observed in a
previous post, we've gone to a lot of work to make sure that nothing
you do inside a trigger can crash the server, corrupt the catalogs,
trigger elog() messages, etc.  That applies in spades to DDL.

Despite Tom's skepticism, I'm pretty sure that it's safe for us to
execute trigger function calls either completely before or completely
after we execute the DDL command.  It should be no different than
executing them as separate commands.  But doing something in the
middle of a command is something else altogether.  Consider, for
example, that ALTER TABLE does a bunch of crap internally to itself,
and then recurses into ProcessUtility to do some more stuff.  This is
a terrible design, but it's what we've got.  Are you absolutely sure
that the intermediate state that exists after that first bunch of
stuff is done and before those other things are done is a safe place
to execute arbitrary user-provided code?  I'm not.  And the more
places we add that recurse into ProcessUtility, or indeed, add event
triggers in general, the more places we're going to add new failure
modes.

Fixing that is part of the price of adding a new event trigger and I'm
OK with that.  But I'm not OK with baking into the code the assumption
that any current or future callers of ProcessUtility() should be
prepared for arbitrary code execution.  The code wasn't written with
that in mind, and it will not end well.

- adds additional magic variables to PL/pgsql to expose new
information not previously exposed

I'm not sure that I agree with the implementation of this, but I think
I'm OK with the concept.  I will review it when it is submitted in a
form not intertwined with other things.

- adds CONTEXT as an additional event-trigger-filter variable, along with TAG

I'm OK with adding new event-trigger-filter variables, provided that
they are done without additional run-time overhead for people who are
not using event triggers; I think it's nice that we can support that.
But I think this particular proposal is DOA because nothing except
TOPLEVEL is actually safe.

 As for the patch not being well-thought-out, I'm sticking by that,
 too.  Alvaro's comments about lock escalations should be enough to
 tickle your alarm bells, but there are plenty of other problems here,
 too.

 Here's the comment in the code where the lock problems are discussed:

 /*
  * Fill in the EventTriggerTargetOid for a DROP command and a
  * ddl_command_start Event Trigger: the lookup didn't happen yet and we
  * still want to provide the user with that information when possible.
  *
  * We only do the lookup if the command contains a single element, which is
  * often forced to be the case as per the grammar (see the drop_type:
  * production for a list of cases where more than one object per command is
  * allowed).
  *
  * We take no exclusive lock here, the main command will have to do the
  * name lookup all over again and take appropriate locks down after the
  * User Defined Code runs anyway, and the commands executed by the User
  * Defined Code will take their own locks. We only lock the objects here so
  * that they don't disapear under us in another concurrent transaction,
  * hence using ShareUpdateExclusiveLock.  XXX this seems prone to lock
  * escalation troubles.
  */

 My thinking is that the user code can do anything, even run a DROP
 command against the object that we are preparing ourselves to be
 dropping. I don't think we can bypass doing the lookup twice.

Sure, but there's also the issue of getting the right answer.  For
example, suppose you are planning to drop a function.  You do a name
lookup on the name and call the event trigger.  Meanwhile, someone
else renames the function and creates a new one with the same name.
After the event trigger returns, the actual drop code latches onto the
one with the new name.  Now, the event trigger ran with OID A and the
actual object dropped was one with OID B.  It's possible that KaiGai's
RENAME refactoring in 9.3 has strengthened the locking enough that
this particular failure mode is no longer possible in that specific
scenario, but I am fairly confident it would have been an 

Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Andres Freund
On 2013-01-17 11:15:24 -0500, Robert Haas wrote:
 As a further example, suppose that in 9.4 (or 9.17) we add a command
 DROP TABLES IN SCHEMA fred WHERE name LIKE 'bob%'.  Well, the
 logging trigger still just works (because it's only writing the
 statement, without caring about its contents).  And the replication
 trigger still just works too (because it will still get a callback for
 every table that gets dropped, even though it knows nothing about the
 new syntax).  That's very powerful.  Of course, there will still be
 cases where things have to change, but you can minimize it by clean
 definitions.
 
 It pains me that I've evidently failed to communicate this concept
 clearly despite a year or more of trying.  Does that make sense?  Is
 there some way I can make this more clear?  The difference seems very
 clear-cut to me, but evidently I'm not conveying it well.

Well, I didn't manage to follow the topic with the level of detail I
would have liked to/should have so its very well possible that I missed
a proposal of how to implement that concept in a way you like?
With some squinting you can actually see the proposed ddl_trace callsite
as exactly that kind of replication trigger, but with a worse name...

Without piggy-backing on the internal commands generated - which
currently seems to be achieved by passing everything through
ProcessUtility, but could work in a quite different way - and
reconstructing sql from that I don't immediately see how you want to do
this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch.  I also noticed that the TLI check
didn't mark the last source as failed.


This looks fragile:


/*
 * We only end up here without a message when 
XLogPageRead() failed
 * - in that case we already logged something.
 * In StandbyMode that only happens if we have been 
triggered, so
 * we shouldn't loop anymore in that case.
 */
if (errormsg == NULL)
break;


I don't like relying on the presence of an error message to control 
logic like that. Should we throw in an explicit CheckForStandbyTrigger() 
check in the condition of that loop?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:
 On 17.01.2013 17:42, Andres Freund wrote:
 Ok, the attached patch seems to fix a) and b). c) above is bogus, as
 explained in a comment in the patch.  I also noticed that the TLI check
 didn't mark the last source as failed.
 
 This looks fragile:
 
  /*
   * We only end up here without a message when 
  XLogPageRead() failed
   * - in that case we already logged something.
   * In StandbyMode that only happens if we have been 
  triggered, so
   * we shouldn't loop anymore in that case.
   */
  if (errormsg == NULL)
  break;
 
 I don't like relying on the presence of an error message to control logic
 like that. Should we throw in an explicit CheckForStandbyTrigger() check in
 the condition of that loop?

I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.

Don't really have a good idea :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latex longtable format

2013-01-17 Thread Bruce Momjian
On Sat, Jan 12, 2013 at 07:09:31PM -0500, Bruce Momjian wrote:
 I have received several earnest requests over the years for LaTeX
 'longtable' output, and I have just implemented it based on a sample
 LaTeX longtable output file.
 
 I have called it 'latex-longtable' and implemented all the behaviors of
 ordinary latex mode.   One feature is that in latex-longtable mode,
 'tableattr' allows control over the column widths --- that seemed to be
 very important to the users.  One requested change I made to the
 ordinary latex output was to suppress the line under the table title if
 border = 0 (default is border = 1).
 
 Patch and sample output attached.  I would like to apply this for PG
 9.3.

Modified patch applied.  I didn't need to modify our existing 'latex'
output format, except to add border=3 support.  It was tempting to
suggest renaming our 'latex' output format to 'latex-tabular', because
that is what it uses, but 'tabular' is a package included by default in
Latex, while 'longtable' requires two additional packages to be
specified, so I resisted suggesting it.  I did document that 'latex'
uses 'tabular'.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 18:42, Andres Freund wrote:

On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch.  I also noticed that the TLI check
didn't mark the last source as failed.


This looks fragile:


/*
 * We only end up here without a message when 
XLogPageRead() failed
 * - in that case we already logged something.
 * In StandbyMode that only happens if we have been 
triggered, so
 * we shouldn't loop anymore in that case.
 */
if (errormsg == NULL)
break;


I don't like relying on the presence of an error message to control logic
like that. Should we throw in an explicit CheckForStandbyTrigger() check in
the condition of that loop?


I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.


I was thinking of the attached. As long as we check for 
CheckForStandbyTrigger() after the record == NULL check, we won't 
perform extra stat() calls on successful reads, only when we're polling 
after reaching the end of valid WAL. That seems acceptable. If we want 
to avoid even that, we could move the static 'triggered' variable from 
CheckForStandbyTrigger() out of that function, and check that in the loop.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70cfabc..ac2b26b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3180,7 +3180,8 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
  * try to read a record just after the last one previously read.
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
- * (emode must be either PANIC, LOG)
+ * (emode must be either PANIC, LOG). In standby mode, retries until a valid
+ * record is available.
  *
  * The record is copied into readRecordBuf, so that on successful return,
  * the returned record pointer always points there.
@@ -3209,12 +3210,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader-EndRecPtr;
 		if (record == NULL)
 		{
-			/* not all failures fill errormsg; report those that do */
-			if (errormsg  errormsg[0] != '\0')
-ereport(emode_for_corrupt_record(emode,
- RecPtr ? RecPtr : EndRecPtr),
-		(errmsg_internal(%s, errormsg) /* already translated */));
-
 			lastSourceFailed = true;
 
 			if (readFile = 0)
@@ -3222,7 +3217,20 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 close(readFile);
 readFile = -1;
 			}
-			break;
+
+			/*
+			 * We only end up here without a message when XLogPageRead() failed
+			 * - in that case we already logged something.
+			 * In StandbyMode that only happens if we have been triggered, so
+			 * we shouldn't loop anymore in that case.
+			 */
+			if (errormsg)
+ereport(emode_for_corrupt_record(emode,
+ RecPtr ? RecPtr : EndRecPtr),
+		(errmsg_internal(%s, errormsg) /* already translated */));
+
+			/* Give up, or retry if we're in standby mode. */
+			continue;
 		}
 
 		/*
@@ -3234,6 +3242,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLogSegNo segno;
 			int32 offset;
 
+			lastSourceFailed = true;
+
 			XLByteToSeg(xlogreader-latestPagePtr, segno);
 			offset = xlogreader-latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader-readPageTLI, segno);
@@ -3243,9 +3253,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			xlogreader-latestPageTLI,
 			fname,
 			offset)));
-			return false;
+			record = NULL;
+			continue;
 		}
-	} while (StandbyMode  record == NULL);
+	} while (StandbyMode  record == NULL  !CheckForStandbyTrigger());
 
 	return record;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ff871a3..75164f6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord);
 
 	if (readOff  0)
-	{
-		if (state-errormsg_buf[0] != '\0')
-			*errormsg = state-errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	/*
 	 * 

[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:
 On 17.01.2013 18:42, Andres Freund wrote:
 On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:
 On 17.01.2013 17:42, Andres Freund wrote:
 Ok, the attached patch seems to fix a) and b). c) above is bogus, as
 explained in a comment in the patch.  I also noticed that the TLI check
 didn't mark the last source as failed.
 
 This looks fragile:
 
/*
 * We only end up here without a message when 
  XLogPageRead() failed
 * - in that case we already logged something.
 * In StandbyMode that only happens if we have been 
  triggered, so
 * we shouldn't loop anymore in that case.
 */
if (errormsg == NULL)
break;
 
 I don't like relying on the presence of an error message to control logic
 like that. Should we throw in an explicit CheckForStandbyTrigger() check in
 the condition of that loop?
 
 I agree, I wasn't too happy about that either. But in some sense its
 only propagating state from XLogReadPage which already has dealt with
 the error and decided its ok.
 Its the solution closest to what happened in the old implementation,
 thats why I thought it would be halfway to acceptable.
 
 Adding the CheckForStandbyTrigger() in the condition would mean
 promotion would happen before all the available records are processed
 and it would increase the amount of stat()s tremendously.
 So I don't really like that either.
 
 I was thinking of the attached. As long as we check for
 CheckForStandbyTrigger() after the record == NULL check, we won't perform
 extra stat() calls on successful reads, only when we're polling after
 reaching the end of valid WAL. That seems acceptable. 

Looks good to me.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 - adds ddl_command_trace and ddl_command_end events

 I think ddl_command_end is OK and I'm willing to commit that if
 extracted as its own patch.  I think ddl_command_trace is unnecessary
 syntactic sugar.

Ok. Will prepare a non controversial patch for ddl_command_end. After
having played with the patch, I happen to quite like ddl_command_trace,
but I won't try to protect it from its death any more than that.

 - causes those events to be called not only for actual SQL commands
 but also for things that happen, at present, to go through the same
 code path

 I think this is a bad idea, not only because, as I said before, it
 exposes internal implementation details, but also because it's
 probably not safe.  As Noah (I believe, or maybe Tom), observed in a
 previous post, we've gone to a lot of work to make sure that nothing
 you do inside a trigger can crash the server, corrupt the catalogs,
 trigger elog() messages, etc.  That applies in spades to DDL.

I naively though that doing CommandCounterIncrement() before running the
event trigger would go a long enough way towards making the operation
safe when the current code is calling back into ProcessUtility().

 Despite Tom's skepticism, I'm pretty sure that it's safe for us to
 execute trigger function calls either completely before or completely
 after we execute the DDL command.  It should be no different than
 executing them as separate commands.  But doing something in the
 middle of a command is something else altogether.  Consider, for
 example, that ALTER TABLE does a bunch of crap internally to itself,
 and then recurses into ProcessUtility to do some more stuff.  This is
 a terrible design, but it's what we've got.  Are you absolutely sure
 that the intermediate state that exists after that first bunch of
 stuff is done and before those other things are done is a safe place
 to execute arbitrary user-provided code?  I'm not.  And the more
 places we add that recurse into ProcessUtility, or indeed, add event
 triggers in general, the more places we're going to add new failure
 modes.

While I hear you, I completely fail to understand which magic is going
to make your other idea safer than this one. Namely, calling an event
trigger named sql_drop rather than ddl_command_start from the same place
in the code is certainly not helping at all.

If your answer to that is how to implement it (e.g. keeping a queue of
events for which to fire Event Triggers once we reach a safe point in
the code), then why couldn't we do exactly the same thing under the
other name?

 Fixing that is part of the price of adding a new event trigger and I'm
 OK with that.  But I'm not OK with baking into the code the assumption
 that any current or future callers of ProcessUtility() should be
 prepared for arbitrary code execution.  The code wasn't written with
 that in mind, and it will not end well.

Again, while I agree with the general principle, it happens that the
calls to the arbitrary code are explicit in ProcessUtility(), so that if
the place is known to be unsafe it's easy enough to refrain from calling
the user code. We do that already for CREATE INDEX CONCURRENTLY at
ddl_command_end.

 - adds additional magic variables to PL/pgsql to expose new
 information not previously exposed

 I'm not sure that I agree with the implementation of this, but I think
 I'm OK with the concept.  I will review it when it is submitted in a
 form not intertwined with other things.

Fair enough.

 - adds CONTEXT as an additional event-trigger-filter variable, along with TAG

 I'm OK with adding new event-trigger-filter variables, provided that
 they are done without additional run-time overhead for people who are
 not using event triggers; I think it's nice that we can support that.
 But I think this particular proposal is DOA because nothing except
 TOPLEVEL is actually safe.

Yeah, let's first see about that.

 My thinking is that the user code can do anything, even run a DROP
 command against the object that we are preparing ourselves to be
 dropping. I don't think we can bypass doing the lookup twice.

 Sure, but there's also the issue of getting the right answer.  For
 example, suppose you are planning to drop a function.  You do a name
 lookup on the name and call the event trigger.  Meanwhile, someone
 else renames the function and creates a new one with the same name.
 After the event trigger returns, the actual drop code latches onto the
 one with the new name.  Now, the event trigger ran with OID A and the
 actual object dropped was one with OID B.  It's possible that KaiGai's

Is your scenario even possible when we take a ShareUpdateExclusiveLock
on he object before running the Event Trigger, in order to protect
against exactly that scenario, as said in the comment? Maybe another
lock should be taken. Which one?

 There's another issue here, too, which is that this change
 reintroduces a failure mode that I spent a lot of time 

[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 18:55, Andres Freund wrote:

On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:

I was thinking of the attached. As long as we check for
CheckForStandbyTrigger() after the record == NULL check, we won't perform
extra stat() calls on successful reads, only when we're polling after
reaching the end of valid WAL. That seems acceptable.


Looks good to me.


Ok, committed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
 Slave does not try anymore to reconnect to master with messages of the type:
 FATAL:  could not connect to the primary server

 I also noticed that there is some delay until modifications on master are
 visible on slave.

 I think that bug has been introduced by commit 7fcbf6a.
 Before splitting xlog reading as a separate facility things worked
 correctly.
 There are also no delay problems before this commit.

Heikki committed a fix for at least the promotion issue, I didn't notice
any problem with an increased delay, could you check again if you still
see it?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Ugh.  Still another problem with non-MVCC catalog scans.

Indeed.

 It seems that the only thing we actually use from each tuple is the OID.

Yes, that's true.

 So there are other ways to fix it, of which probably the minimum-change
 one is to keep a list of already-processed tablespace OIDs and skip a
 tuple if we get a match in the list.  This would be O(N^2) in the number
 of tablespaces, but I rather doubt that's a problem.

I did consider this option also but it felt like a lot of additional
code to write and I wasn't entirely convinced it'd be worth it.  It's
very frustrating that we can't simply get a list of *currently valid*
tablespaces with a guarantee that no one else is going to mess with that
list while we process it.  That's what MVCC would give us, but we can't
rely on that yet..

If we agree that keeping a list and then skipping over anything on the
list already seen, I can go ahead and code that up.

 [ thinks for a bit... ]  Ugh, no, because the *other* risk you've got
 here is not seeing a row at all, which would be really bad.

I'm not sure that I see how that could happen..?  I agree that it'd be
really bad if it did though.  Or are you thinking if we were to take a
snapshot and then walk the table?

 I'm not sure that transiently increasing the lock here is enough,
 either.  The concurrent transactions you're worried about probably
 aren't holding their locks till commit, so you could get this lock
 and still see tuples with unstable committed-good states.

If there are other transactiosn which have open locks against the table,
wouldn't that prevent my being able to acquire ShareLock on it?  Or put
another way, how would they not hold their locks till commit or
rollback?  We do only look at tablespaces which exist for the database
we're copying, as I recall, so any newly created tablespaces shouldn't
impact this.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Fujii Masao
On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
 Slave does not try anymore to reconnect to master with messages of the type:
 FATAL:  could not connect to the primary server

 I also noticed that there is some delay until modifications on master are
 visible on slave.

 I think that bug has been introduced by commit 7fcbf6a.
 Before splitting xlog reading as a separate facility things worked
 correctly.
 There are also no delay problems before this commit.

 Heikki committed a fix for at least the promotion issue, I didn't notice
 any problem with an increased delay, could you check again if you still
 see it?

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1


Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
 On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
  Slave does not try anymore to reconnect to master with messages of the 
  type:
  FATAL:  could not connect to the primary server
 
  I also noticed that there is some delay until modifications on master are
  visible on slave.
 
  I think that bug has been introduced by commit 7fcbf6a.
  Before splitting xlog reading as a separate facility things worked
  correctly.
  There are also no delay problems before this commit.
 
  Heikki committed a fix for at least the promotion issue, I didn't notice
  any problem with an increased delay, could you check again if you still
  see it?
 
 I encountered the problem that the timeline switch is not performed 
 expectedly.
 I set up one master, one standby and one cascade standby. All the servers
 share the archive directory. restore_command is specified in the recovery.conf
 in those two standbys.
 
 I shut down the master, and then promoted the standby. In this case, the
 cascade standby should switch to new timeline and replication should be
 successfully restarted. But the timeline was never changed, and the following
 log messages were kept outputting.
 
 sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG:  replication terminated by primary server
 sby2 DETAIL:  End of WAL reached on timeline 1
 sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG:  replication terminated by primary server
 sby2 DETAIL:  End of WAL reached on timeline 1
 sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG:  replication terminated by primary server
 sby2 DETAIL:  End of WAL reached on timeline 1
 

That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2013-01-17 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 On 8 December 2012 14:41, Andres Freund and...@2ndquadrant.com wrote:
 Is anybody planning to work on this? There hasn't been any activity
 since the beginning of the CF and it doesn't look like there is much
 work left?

 I took another look at this.

Applied with some changes:

* Use float8 arithmetic to prevent intermediate-result overflow, as I
suggested last night.

* Rearrange the subsequent checks so that they provide bulletproof
clamping behavior on out-of-range newmemtupsize values; this allows
dropping the very ad-hoc range limiting used in the previous patch (and
in what I had last night).

* Fix the check against availMem; it was supposed to be testing that the
increment in the array size was within availMem.  (In the original
coding the increment was always the same as the original size, but not
so much anymore.)

* Allow the array size to grow to the MaxAllocSize limit, instead of
punting if we'd exceed that.  If we're attempting to use as much of
work_mem as we possibly can, I don't see why that should be a reject
case.

* Improve the comments, including the existing one about the availMem
check, since that evidently wasn't clear enough.

* Copy the whole mess into tuplestore.c too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH]Tablesample Submission

2013-01-17 Thread Josh Berkus
On 11/04/2012 07:22 PM, Qi Huang wrote:
 Dear hackers Sorry for not replying the patch review. I didn't see the 
 review until recently as my mail box is full of Postgres mails and I didn't 
 notice the one for me, my mail box configuration problem. I am still kind 
 of busy with my university final year project. I shall not have time to work 
 on updating the patch until this semester finishes which is December. I will 
 work on then.Thanks.
 Best RegardsHuang Qi VictorComputer Science of National University of 
 Singapore

Did you ever do the update of the patch?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
 Now that I look at the patch, I wonder if there is another fundamental
 issue with the patch. Since the patch removes WAL logging for the VM
 set operation, this can happen:
 
Thank you. I think I was confused by this comment here:

When we *set* a visibility map during VACUUM, we must write WAL.  This
may seem counterintuitive, since the bit is basically a hint: if it is
clear, it may still be the case that every tuple on the page is visible
to all transactions; we just don't know that for certain.  The
difficulty is that there are two bits which are typically set together:
the PD_ALL_VISIBLE bit on the page itself, and the visibility map bit.
If a crash occurs after the visibility map page makes it to disk and
before the updated heap page makes it to disk, redo must set the bit on
the heap page. Otherwise, the next insert, update, or delete on the heap
page will fail to realize that the visibility map bit must be cleared,
possibly causing index-only scans to return wrong answers.

Which lead me to believe that I could just rip out the WAL-related code
if PD_ALL_VISIBLE goes away, which is incorrect. But the incorrectness
doesn't have to do with the WAL directly, it's because the VM page's LSN
is not bumped past the LSN of the related heap page cleanup, so it can
be written too early.

Of course, the way to bump the LSN is to write WAL for the
visibilitymap_set operation. But that would be a very simple WAL
routine, rather than the complex one that exists without the patch.

I suppose we could even try to bump the LSN without writing WAL somehow,
but it doesn't seem worth reasoning through that (setting a VM bit is
rare enough).

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 [ thinks for a bit... ]  Ugh, no, because the *other* risk you've got
 here is not seeing a row at all, which would be really bad.

 I'm not sure that I see how that could happen..?  I agree that it'd be
 really bad if it did though.  Or are you thinking if we were to take a
 snapshot and then walk the table?

The case where you see a tuple twice is if an update drops a new version
of a row beyond your seqscan, and then commits before you get to the new
version.  But if it drops the new version of the row *behind* your
seqscan, and then commits before you get to the original tuple, you'll
not see either row version as committed.  Both of these cases are
inherent risks in SnapshotNow scans.

 I'm not sure that transiently increasing the lock here is enough,
 either.  The concurrent transactions you're worried about probably
 aren't holding their locks till commit, so you could get this lock
 and still see tuples with unstable committed-good states.

 If there are other transactiosn which have open locks against the table,
 wouldn't that prevent my being able to acquire ShareLock on it?

What I'm worried about is that we very commonly release locks on
catalogs as soon as we're done with the immediate operation --- not only
read locks, but update locks as well.  So there are lots of cases where
locks are released before commit.  It's possible that that never happens
in operations applied to pg_tablespace, but I'd not want to bet on it.

I wonder whether it's practical to look at the on-disk directories
instead of relying on pg_tablespace for this particular purpose.

Or maybe we should just write this off as a case we can't realistically
fix before we have MVCC catalog scans.  It seems that any other fix is
going to be hopelessly ugly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2013-01-17 Thread Peter Geoghegan
On 17 January 2013 18:22, Tom Lane t...@sss.pgh.pa.us wrote:
 Applied with some changes:

Thank you. That feedback is useful.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Andres Freund
On 2013-01-17 13:46:44 -0500, Tom Lane wrote:
 Or maybe we should just write this off as a case we can't realistically
 fix before we have MVCC catalog scans.  It seems that any other fix is
 going to be hopelessly ugly.

ISTM we could just use a MVCC snapshot in this specific case?

Andres
--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 The case where you see a tuple twice is if an update drops a new version
 of a row beyond your seqscan, and then commits before you get to the new
 version.  But if it drops the new version of the row *behind* your
 seqscan, and then commits before you get to the original tuple, you'll
 not see either row version as committed.  Both of these cases are
 inherent risks in SnapshotNow scans.

Ah, I see.

  If there are other transactiosn which have open locks against the table,
  wouldn't that prevent my being able to acquire ShareLock on it?
 
 What I'm worried about is that we very commonly release locks on
 catalogs as soon as we're done with the immediate operation --- not only
 read locks, but update locks as well.  So there are lots of cases where
 locks are released before commit.  It's possible that that never happens
 in operations applied to pg_tablespace, but I'd not want to bet on it.

Fair enough.

 I wonder whether it's practical to look at the on-disk directories
 instead of relying on pg_tablespace for this particular purpose.

So we'd traverse all of the tablespace directories and copy any
directories which we come across which have the oid of the template
database..?  Sounds pretty reasonable, though I'm not sure that we'd
want to back-patch a large change like that.

 Or maybe we should just write this off as a case we can't realistically
 fix before we have MVCC catalog scans.  It seems that any other fix is
 going to be hopelessly ugly.

I feel like we should be able to do better than what we have now, at
least.  Using ShareLock prevented the specific case that we were
experiencing and is therefore MUCH better (for us, anyway) than
released versions where we ran into the error on a regular basis.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH]Tablesample Submission

2013-01-17 Thread Simon Riggs
On 17 January 2013 18:32, Josh Berkus j...@agliodbs.com wrote:
 On 11/04/2012 07:22 PM, Qi Huang wrote:
 Dear hackers Sorry for not replying the patch review. I didn't see the 
 review until recently as my mail box is full of Postgres mails and I didn't 
 notice the one for me, my mail box configuration problem. I am still 
 kind of busy with my university final year project. I shall not have time to 
 work on updating the patch until this semester finishes which is December. I 
 will work on then.Thanks.
 Best RegardsHuang Qi VictorComputer Science of National University of 
 Singapore

 Did you ever do the update of the patch?

We aren't just waiting for a rebase, we're waiting for Hitoshi's
comments to be addressed.

I would add to them by saying I am very uncomfortable with the idea of
allowing a TABLESAMPLE clause on an UPDATE or a DELETE. If you really
want that you can use a sub-select.

Plus the patch contains zero documentation.

So I can't see this going anywhere for 9.3. I've moved it to CF1 of
9.4 marked Waiting on Author

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2013-01-17 Thread Alvaro Herrera
Tomas Vondra wrote:
 Hi,
 
 attached is a patch that improves performance when dropping multiple
 tables within a transaction. Instead of scanning the shared buffers for
 each table separately, the patch removes this and evicts all the tables
 in a single pass through shared buffers.

Made some tweaks and pushed (added comments to new functions, ensure
that we never try to palloc(0), renamed DropRelFileNodeAllBuffers to
plural, made the use bsearch logic a bit simpler).

 Our system creates a lot of working tables (even 100.000) and we need
 to perform garbage collection (dropping obsolete tables) regularly. This
 often took ~ 1 hour, because we're using big AWS instances with lots of
 RAM (which tends to be slower than RAM on bare hw). After applying this
 patch and dropping tables in groups of 100, the gc runs in less than 4
 minutes (i.e. a 15x speed-up).

I'm curious -- why would you drop tables in groups of 100 instead of
just doing the 100,000 in a single transaction?  Maybe that's faster
now, because you'd do a single scan of the buffer pool instead of 1000?
(I'm assuming that in groups of means you do each group in a separate
transaction)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 10:39 -0800, Jeff Davis wrote:
 On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
  Now that I look at the patch, I wonder if there is another fundamental
  issue with the patch. Since the patch removes WAL logging for the VM
  set operation, this can happen:
  
 Thank you. 

New patch attached with simple WAL logging.

Regards,
Jeff Davis



rm-pd-all-visible-20130117.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 Or maybe we should just write this off as a case we can't realistically
 fix before we have MVCC catalog scans.  It seems that any other fix is
 going to be hopelessly ugly.

 I feel like we should be able to do better than what we have now, at
 least.  Using ShareLock prevented the specific case that we were
 experiencing and is therefore MUCH better (for us, anyway) than
 released versions where we ran into the error on a regular basis.

If it actually *reliably* fixed the problem, rather than just reducing
the probabilities, that would mean that the updates your other sessions
were doing didn't release RowExclusiveLock early.  Have you dug into the
code to see if that's true?  (Or even more to the point, if it's still
true in HEAD?  I have no idea if all the recent refactoring has changed
this behavior for GRANT.)

My thought about a localized fix is similar to Andres' --- maybe we
could take a snapshot and use a real MVCC scan.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2013-01-17 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Made some tweaks and pushed (added comments to new functions, ensure
 that we never try to palloc(0), renamed DropRelFileNodeAllBuffers to
 plural, made the use bsearch logic a bit simpler).

FWIW, there's nothing particularly wrong with palloc(0) ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Simon Riggs
On 17 January 2013 16:15, Robert Haas robertmh...@gmail.com wrote:

 As a further example, suppose that in 9.4 (or 9.17) we add a command
 DROP TABLES IN SCHEMA fred WHERE name LIKE 'bob%'.  Well, the
 logging trigger still just works (because it's only writing the
 statement, without caring about its contents).  And the replication
 trigger still just works too (because it will still get a callback for
 every table that gets dropped, even though it knows nothing about the
 new syntax).  That's very powerful.  Of course, there will still be
 cases where things have to change, but you can minimize it by clean
 definitions.

 It pains me that I've evidently failed to communicate this concept
 clearly despite a year or more of trying.  Does that make sense?  Is
 there some way I can make this more clear?  The difference seems very
 clear-cut to me, but evidently I'm not conveying it well.

 That problem is going to happen in some way or another. It's not about
 avoiding the problem completely, but choosing a set of compromises. I'm
 yet to understand what your set of compromises is in detailed enough
 practical implementation terms (not just it's another set of events)
 and how it's a better compromise over all (I trust your judgement on
 that, I still want to understand).

 After all, the whole point of the Event Trigger patch is to expose some
 level of implementation details.

 I don't really agree with that.  I think the point is to expose what
 the system is doing to the DBA.  I'm OK with exposing the fact that
 creating a table with a serial column also creates a sequence.  There
 is no problem with that - it's all good.  What we DON'T want to do is
 set up a system where the fact that it creates a sequence is exposed
 because that happens to go through ProcessUtility() and the fact that
 it creates a constraint is not because that happens not to go through
 ProcessUtility().  That is not the sort of quality that our users have
 come to expect from PostgreSQL.


The point, i.e. the main use case, is to replicate the DDL in a useful form.

Discussions and reasoning need to focus on the main use case, not on
weird futures or qualitative points. Yes, the discussion has gone on
for years and it really should come to a useful conclusion sometime
soon.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Alvaro Herrera
Robert Haas escribió:

 Actually, I'm really glad to see all the work you've done to improve
 the way that some of these scenarios work and eliminate various bugs
 and other surprising failure modes over the last couple of months.
 It's great stuff.

+1

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Tom Lane
I wrote:
 Stephen Frost sfr...@snowman.net writes:
 I feel like we should be able to do better than what we have now, at
 least.  Using ShareLock prevented the specific case that we were
 experiencing and is therefore MUCH better (for us, anyway) than
 released versions where we ran into the error on a regular basis.

 My thought about a localized fix is similar to Andres' --- maybe we
 could take a snapshot and use a real MVCC scan.

BTW, it strikes me that the reason this particular SnapshotNow scan is a
big problem for you is that, because we stop and copy a subdirectory
after each tuple, the elapsed time for the seqscan is several orders of
magnitude more than it is for most (perhaps all) other cases where
SnapshotNow is used.  So the window for trouble with concurrent updates
is that much bigger.  This seems to provide a reasonably principled
argument why we might want to fix this case with a localized use of an
MVCC scan before we have such a fix globally.

Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
kind of annotation.  We know we need the SnapshotNow scan fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 20:08, Andres Freund wrote:

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1



That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...


I think I broke that with the teach pg_receivexlog to cross timelines 
patch. Will take a look...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Simon Riggs
On 17 January 2013 15:14, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 17.01.2013 16:53, Robert Haas wrote:

 On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
 pavan.deola...@gmail.com  wrote:

 May be you've already addressed that concern with the proven
 performance numbers, but I'm not sure.


 It would be nice to hear what Heikki's reasons were for adding
 PD_ALL_VISIBLE in the first place.


 The idea was to avoid clearing the bit in the VM page on every update, when
 the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set.
 I assumed the traffic and contention on the VM page would be a killer
 otherwise. I don't remember if I ever actually tested that though. Maybe I
 was worrying about nothing and hitting the VM page on every update is ok.

Presumably we remember the state of the VM so we can skip the re-visit
after every write?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 17:14 +0200, Heikki Linnakangas wrote:
 I don't remember if I ever actually tested that 
 though. Maybe I was worrying about nothing and hitting the VM page on 
 every update is ok.

I tried, but was unable to show really anything at all, even without
keeping the VM page pinned. I think the bottleneck is elsewhere;
although I am keeping the page pinned in this patch to prevent it from
becoming a bottleneck.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH]Tablesample Submission

2013-01-17 Thread Josh Berkus

 So I can't see this going anywhere for 9.3. I've moved it to CF1 of
 9.4 marked Waiting on Author

Agreed.  I wish I'd noticed that it got lost earlier.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2013-01-17 Thread Stefan Keller
Hi Jeff

2012/4/19 Jeff Davis pg...@j-davis.com:
 On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote:
(...)
 This is just handwaving of course.  I think some digging in the
 spatial-join literature would likely find ideas better than any of
 these.

 I will look in some more detail. The merge-like approach did seem to be
 represented in the paper referenced by Alexander (the external plane
 sweep), but it also refers to several methods based on partitioning.

 I'm beginning to think that more than one of these ideas has merit.

 Regards,
 Jeff Davis

I'm perhaps really late in this discussion but I just was made aware
of that via the tweet from Josh Berkus about PostgreSQL 9.3: Current
Feature Status

What is the reason to digg into spatial-joins when there is PostGIS
being a bullet-proof and fast implementation?

Yours, Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 17 January 2013 16:15, Robert Haas robertmh...@gmail.com wrote:
 It pains me that I've evidently failed to communicate this concept
 clearly despite a year or more of trying.  Does that make sense?  Is
 there some way I can make this more clear?  The difference seems very
 clear-cut to me, but evidently I'm not conveying it well.

 The point, i.e. the main use case, is to replicate the DDL in a useful form.

 Discussions and reasoning need to focus on the main use case, not on
 weird futures or qualitative points.

I have to completely disagree with that.  If we don't want PostgreSQL
to soon subside into an unfixable morass, as I think Brooks puts it,
we must *not* simply patch things in a way that expediently provides
an approximation of some desired feature.  We have to consider, and put
substantial weight on, having a clean and maintainable system design.

This is particularly the case if we're talking about a feature that
is going to expose backend internals to users to any degree.  We're
either going to be constrained to not change those internals any more,
or expect to break users' applications when we do change them, and
neither result is very appealing.  Especially not when we're talking
about the structure of Postgres' DDL code, which can most charitably
be described as it just grew, not as something that has any clear
architecture to it.

Now if we could quantify these things, that would be great, but
AFAICS qualitative points is all we've got to go on.

I think that we're not realistically going to be able to introduce
event triggers in very many of the places we'd like to have them
without first doing a lot of fundamental refactoring.  And that is
hard, dangerous, slow work that tends to break things in itself.
As we've been repeatedly reminded in connection with KaiGai-san's
refactoring patches.

So my opinion is that most of what we'd like to have here is material
for 9.4 or 9.5 or even further out.  If we can get the event trigger
catalog infrastructure and some *very* basic events, like
end-of-toplevel-command, into place for 9.3, we'll have done well.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 12:06 PM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Ok. Will prepare a non controversial patch for ddl_command_end.

Thanks.  I will make a forceful effort to review that in a timely
fashion when it's posted.

 I think this is a bad idea, not only because, as I said before, it
 exposes internal implementation details, but also because it's
 probably not safe.  As Noah (I believe, or maybe Tom), observed in a
 previous post, we've gone to a lot of work to make sure that nothing
 you do inside a trigger can crash the server, corrupt the catalogs,
 trigger elog() messages, etc.  That applies in spades to DDL.

 I naively though that doing CommandCounterIncrement() before running the
 event trigger would go a long enough way towards making the operation
 safe when the current code is calling back into ProcessUtility().

It's something, but I'd be shocked if it covers all the bases.

Let me try to give a concrete example of how I think another firing
point could be made to work along the lines I'm suggesting.  I'm not
going to use DROP as an example because I think upon reflection that
will be a particularly challenging case to make work correctly.
Instead, let me use the example of ALTER.

Goal: Every time an ALTER command is used on object *that actually
exists*, we will call some user-defined function and pass the object
type, the OID of the object, and some details about what sort of
alteration the user has requested.

Where shall we put the code to do this?  Clearly, ProcessUtility is
too early, because at that point we have not done the name lookup,
locked the object, or checked permissions.  So the OID of the target
object is not and cannot be known at that point.  We cannot do an
extra name lookup at that point without taking a lock, because the
answer might change, or, due to non-MVCC catalog access, the lookup
might fail to find an object altogether.  And we can't take a lock
without checking permissions first.  And the permissions-checking
logic can't be done inside ProcessUtility(), because it's object-type
specific and we don't want to duplicate it.

So, instead, what we need to do is go into each function that
implements ALTER, and modify it so that after it does the dance where
we check permissions, take locks, and look up names, we call the
trigger function.  That's a bit of a nuisance, because we probably
have a bunch of call sites to go fix, but in theory it should be
doable.  The problem of course, is that it's not intrinsically safe to
call user-defined code there.  If it drops the object and creates a
new one, say, hilarity will probably ensue.

But that should be a fixable problem.  The only things we've done at
this point are check permissions, lock, and look up names.  I think we
can assume that if the event trigger changes permissions, it's safe
for the event trigger to ignore that.  The trigger cannot have
released the lock, so we're OK there.  The only thing it can really
have done that's problematic is change the results of the name lookup,
say by dropping or renaming the object.  But that's quite simple to
protect against: after firing the trigger, we do a
CommandCounterIncrement() and repeat the name lookup, and if we get a
different answer, then we bail out with an error and tell the user
they're getting far too cute.  Then we're safe.

Now, there is a further problem: all of the information about the
ALTER statement that we want to provide to the trigger may not be
available at that point.  Simple cases like ALTER .. RENAME won't
require anything more than that, but things like ALTER TABLE .. ADD
FOREIGN KEY get tricky, because while at this point we have a solid
handle on the identity of the relation to which we're adding the
constraint, we do NOT yet have knowledge of or a lock on the TARGET
relation, whose OID could still change on us up to much later in the
computation.  To get reliable information about *that*, we'll need to
refactor the sequence of events inside ALTER TABLE.

Hopefully you can see what I'm driving toward here.  In a design like
this, we can pretty much prove - after returning from the event
trigger - that we're in a state no different from what would have been
created by executing a series of commands - lock table, then SELECT
event_trigger_func(), then the actual ALTER in sequence.  Maybe
there's a flaw in that analysis - that's what patch review is for -
but it sounds pretty darn tight to me.

I could write more, but I have to take the kids to dance class, and
this email may be too long already.  Does that help at all to clarify
the lines along which I am thinking?  Note that the above is clearly
not ddl_command_start but something else, and the different firing
point and additional safety cross-checks are what enable us to pass
additional information to the trigger without fear of breaking either
the trigger or the world.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent 

Re: [HACKERS] [sepgsql 1/3] add name qualified creation label

2013-01-17 Thread Kohei KaiGai
2013/1/16 Robert Haas robertmh...@gmail.com:
 On Tue, Jan 15, 2013 at 3:02 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 This patch adds sepgsql the feature of name qualified creation label.

 Background, on creation of a certain database object, sepgsql assigns
 a default security label according to the security policy that has a set of
 rules to determine a label of new object.
 Usually, a new object inherits its parent (e.g table is a parent of column)
 object's label, unless it has a particular type_transition rule in the 
 policy.
 Type_transition rule allows to describe a particular security label as
 default label of new object towards a pair of client and parent object.
 For example, the below rule says columns constructed under the table
 labeled as sepgsql_table_t by client with staff_t will have
 staff_column_t, instead of table's label.
   TYPE_TRANSITION staff_t sepgsql_table_t:db_column staff_column_t;

 Recently, this rule was enhanced to take 5th argument for object name;
 that enables to special case handling exceptionally.
 It was originally designed to describe default security labels for files in
 /etc directory, because many application put its own configuration files
 here, thus, traditional type_transition rule was poor to describe all the
 needed defaults.
 On the other hand, we can port this concept of database system also.
 One example is temporary objects being constructed under the pg_temp
 schema. If we could assign a special default label on this, it allows
 unprivileged users (who cannot create persistent tables) to create
 temporary tables that has no risk of information leak to other users.
 Otherwise, we may be able to assign a special security label on
 system columns and so on.

 From the perspective of implementation on sepgsql side, all we need
 to do is replace old security_compute_create_raw() interface by new
 security_compute_create_name_raw().
 If here is no name qualified type_transition rules, it performs as if
 existing API, so here is no backword compatible issue.

 This patch can be applied on the latest master branch.

 This looks OK on a quick once-over, but should it update the
 documentation somehow?

Documentation does not take so much description for type_transition
rules, so I just modified relevant description a bit to mention about
type_transition rule may have argument of new object name optionally.
In addition, I forgot to update minimum required version for libselinux;
(it also takes change in configure script).
These two are the point to be updated in documentation.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


sepgsql-v9.3-creation-label-with-name.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 Goal: Every time an ALTER command is used on object *that actually
 exists*, we will call some user-defined function and pass the object
 type, the OID of the object, and some details about what sort of
 alteration the user has requested.

Ok, in current terms of the proposed patch, it's a ddl_command_end event
trigger, and you can choose when to fire it in utility.c. The current
place is choosen to be just after the AlterTable() call, because that
sounded logical if you believe in CommandCounterIncrement.

 So, instead, what we need to do is go into each function that
 implements ALTER, and modify it so that after it does the dance where
 we check permissions, take locks, and look up names, we call the
 trigger function.  That's a bit of a nuisance, because we probably
 have a bunch of call sites to go fix, but in theory it should be
 doable.  The problem of course, is that it's not intrinsically safe to
 call user-defined code there.  If it drops the object and creates a
 new one, say, hilarity will probably ensue.

You're trying to find a dangerous point when to fire the trigger here,
rather than trying to solve the problem you're describing. For the
problem you're describing, either you want the Object Specific
Information and the trigger fires at ddl_command_end, or you don't and
you can register your trigger at ddl_command_start.

If you want something else, well, it won't ship in 9.3.

 Now, there is a further problem: all of the information about the
 ALTER statement that we want to provide to the trigger may not be
 available at that point.  Simple cases like ALTER .. RENAME won't
 require anything more than that, but things like ALTER TABLE .. ADD
 FOREIGN KEY get tricky, because while at this point we have a solid
 handle on the identity of the relation to which we're adding the
 constraint, we do NOT yet have knowledge of or a lock on the TARGET
 relation, whose OID could still change on us up to much later in the
 computation.  To get reliable information about *that*, we'll need to
 refactor the sequence of events inside ALTER TABLE.

The only valid answer here has already been given by Tom. You can only
provide the information if it's available in the catalogs. So again,
it's a ddl_command_end event. It can not happen before.

 Hopefully you can see what I'm driving toward here.  In a design like
 this, we can pretty much prove - after returning from the event
 trigger - that we're in a state no different from what would have been
 created by executing a series of commands - lock table, then SELECT
 event_trigger_func(), then the actual ALTER in sequence.  Maybe
 there's a flaw in that analysis - that's what patch review is for -
 but it sounds pretty darn tight to me.

The only case when we need to do a lookup BEFORE actually running the
command is when that command is a DROP, because that's the only timing
when the information we want is still in the catalogs.

So that's the only case where we do a double object lookup, and we take
a ShareUpdateExclusiveLock lock on the object when doing so, so that we
can't lose it from another concurrent activity.

 Note that the above is clearly
 not ddl_command_start but something else, and the different firing
 point and additional safety cross-checks are what enable us to pass
 additional information to the trigger without fear of breaking either
 the trigger or the world.

That's the reason why we are only going to have ddl_command_start and
ddl_command_end in 9.3, and also the reason why the extra information is
only provided when this very information still exists in the catalogs at
the moment when we want to expose it.

And that's the reason why ddl_command_trace is attractive too: you won't
ever get Object OID and Name and Schema from the event where the object
does not exists, and maybe you want an easy way to tell the system
you're interested into an event *with* the information, and be done,
don't think.

Again, I don't care much about keeping ddl_command_trace because it's
not interesting much for implementing DDL support in logical
replication and my other use cases, but still, I though I would explain
it now that we're talking about it.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
Hi,


While checking whether I could reproduce the replication delay reported
by Michael Paquier I found this very nice tidbit:

In a pretty trivial replication setup of only streaming replication I
can currently easily reproduce this:

standby# BEGIN;SELECT * FROM foo;
BEGIN
 id | data 
+--


standby=# SELECT relation::regclass, locktype, mode FROM pg_locks;
 relation |  locktype  |  mode   
--++-
 pg_locks | relation   | AccessShareLock
 foo_pkey | relation   | AccessShareLock
 foo  | relation   | AccessShareLock
  | virtualxid | ExclusiveLock
  | virtualxid | ExclusiveLock
(5 rows)

primary# DROP TABLE foo;
DROP TABLE


standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks;
 relation |  pid  |  locktype  |mode 
--+---++-
 pg_locks | 28068 | relation   | AccessShareLock
 foo_pkey | 28068 | relation   | AccessShareLock
 foo  | 28068 | relation   | AccessShareLock
  | 28068 | virtualxid | ExclusiveLock
  | 28048 | virtualxid | ExclusiveLock
 foo  | 28048 | relation   | AccessExclusiveLock
(6 rows)

standby# SELECT * FROM foo;
ERROR:  relation foo does not exist
LINE 1: select * from foo;
  ^
Note the conflicting locks held on relation foo by 28048 and 28068.

I don't immediately know which patch to blame here? Looks a bit like
broken fastpath locking, but I don't immediately see anything in
c00dc337b87 that should cause this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 21:57, Heikki Linnakangas wrote:

On 17.01.2013 20:08, Andres Freund wrote:

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the
servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1



That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...


I think I broke that with the teach pg_receivexlog to cross timelines
patch. Will take a look...


Ok, there was a couple of issues. First, as I suspected above, I added a 
new result set after copy has ended in START_STREAMING command, but 
forgot to teach walreceiver about it. Fixed that.


Secondly, there's an interesting interaction between the new xlogreader 
code and streaming replication and timeline switches:


The first call to the page_read callback in xlogreader is always made 
with size SizeOfXLogRecord, counting from the beginning of the page. 
That's bogus; there can be no WAL record in the very beginning of page, 
because of the page header, so I think that was supposed to be 
SizeXLogShortPHD. But that won't fix the issue.


The problem is that XLogPageRead callback uses the page address and 
requested length to decide what timeline to stream from. For example, 
imagine that there's a timeline switch at offset 1000 in a page, and we 
have already read all the WAL up to that point. When xlogreader first 
asks to fetch the first 32 bytes from the page (the bogus 
SizeOfXLogRecord), XLogPageRead deduces that that byte range is still on 
the old timeline, and starts streaming from the old timeline. Next, 
xlogreader needs the rest of the page, up to 1000 + SizeOfPageHeader, to 
read the first record it's actually interested in, XLogPageRead will 
return an error because that range is not on the timeline that's 
currently streamed. And we loop back to retry, and run into the same 
problem again.


This interaction is a bit too subtle for my taste, but the 
straightforward fix is to just modify xlogreader so that the first 
read_page call requests all the bytes up to record we're actually 
interested in. That seems like a smart thing to do anyway.


I committed a patch for that second issue too, but please take a look in 
case you come up with a better idea.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 I have to completely disagree with that.  If we don't want PostgreSQL
 to soon subside into an unfixable morass, as I think Brooks puts it,
 we must *not* simply patch things in a way that expediently provides
 an approximation of some desired feature.  We have to consider, and put
 substantial weight on, having a clean and maintainable system design.

When in Ottawa next may, I will have to buy you a glass of your favorite
wine and explain to you my main use case and vision. You will then
realize that all this time that I've been spending on Event Triggers is
a sideway to build something else entirely, because I know that it's the
only way to get the feature I want in core PostgreSQL.

So yes, I know about the clean and maintainable system design
constraint. I've a lot to learn still, and I appreciate your help very
much. Rest assured that the path you describe is the one I'm taking.

 I think that we're not realistically going to be able to introduce
 event triggers in very many of the places we'd like to have them
 without first doing a lot of fundamental refactoring.

We're only talking about ddl_command_start and ddl_command_end now. The
table_rewrite event is still on the horizon, but is not realistic to get
in 9.3 anymore, I think :(

So we're talking about adding some call points only in utility.c, only
before or after a ddl command is run, including nested sub-commands.

 So my opinion is that most of what we'd like to have here is material
 for 9.4 or 9.5 or even further out.  If we can get the event trigger
 catalog infrastructure and some *very* basic events, like
 end-of-toplevel-command, into place for 9.3, we'll have done well.

My feeling is that I'm sending patches that only implement the *very*
basic events here, and nothing more. Most of the heat of this thread
came from a discussion about a feature that's very hard to design and
that is not in my patch series to review.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER command reworks

2013-01-17 Thread Alvaro Herrera
Kohei KaiGai escribió:

 This attached patch is the rebased one towards the latest master branch.

Great, thanks.  I played with it a bit and it looks almost done to me.
The only issue I can find is that it lets you rename an aggregate by
using ALTER FUNCTION, which is supposed to be forbidden.  (Funnily
enough, renaming a non-agg function with ALTER AGGREGATE does raise an
error).  Didn't immediately spot the right place to add a check.

I think these two error cases ought to have regression tests of their
own.

I attach a version with my changes.

 I noticed that your proposed design also allows to unify ALTER code of
 OPERATOR CLASS / FAMILY; that takes index access method for its
 namespace in addition to name and namespace.

Yeah, I had noticed that and was planning on implementing it later.
Thanks for saving me the work.

 So, AlterObjectRename_internal and AlterObjectNamespace_internal have
 four of special case handling to check name / namespace conflict.

Right.  (I wonder if it would make sense to encapsulate all those checks
in a single function for both to call.)

 The latest master lookups syscache within special case handing block
 as follows:
 [code]
 
 But, we already pulls a relevant tuple from syscache on top of
 AlterObjectNamespace_internal. So, I removed cache lookup code here.

Silly me.  Thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


pgsql-v9.3-alter-reworks.3-rename.v10.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 23:49:22 +0200, Heikki Linnakangas wrote:
 On 17.01.2013 21:57, Heikki Linnakangas wrote:
 On 17.01.2013 20:08, Andres Freund wrote:
 On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
 I encountered the problem that the timeline switch is not performed
 expectedly.
 I set up one master, one standby and one cascade standby. All the
 servers
 share the archive directory. restore_command is specified in the
 recovery.conf
 in those two standbys.
 
 I shut down the master, and then promoted the standby. In this case, the
 cascade standby should switch to new timeline and replication should be
 successfully restarted. But the timeline was never changed, and the
 following
 log messages were kept outputting.
 
 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG: replication terminated by primary server
 sby2 DETAIL: End of WAL reached on timeline 1
 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG: replication terminated by primary server
 sby2 DETAIL: End of WAL reached on timeline 1
 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG: replication terminated by primary server
 sby2 DETAIL: End of WAL reached on timeline 1
 
 
 That's after the commit or before? Because in passing I think I
 noticed/fixed a bug that could cause exactly that problem...
 
 I think I broke that with the teach pg_receivexlog to cross timelines
 patch. Will take a look...

 Ok, there was a couple of issues. First, as I suspected above, I added a new
 result set after copy has ended in START_STREAMING command, but forgot to
 teach walreceiver about it. Fixed that.

 Secondly, there's an interesting interaction between the new xlogreader code
 and streaming replication and timeline switches:

 The first call to the page_read callback in xlogreader is always made with
 size SizeOfXLogRecord, counting from the beginning of the page. That's
 bogus; there can be no WAL record in the very beginning of page, because of
 the page header, so I think that was supposed to be SizeXLogShortPHD. But
 that won't fix the issue.

Yea, thats clearly a typo.

 The problem is that XLogPageRead callback uses the page address and
 requested length to decide what timeline to stream from. For example,
 imagine that there's a timeline switch at offset 1000 in a page, and we have
 already read all the WAL up to that point. When xlogreader first asks to
 fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord),
 XLogPageRead deduces that that byte range is still on the old timeline, and
 starts streaming from the old timeline. Next, xlogreader needs the rest of
 the page, up to 1000 + SizeOfPageHeader, to read the first record it's
 actually interested in, XLogPageRead will return an error because that range
 is not on the timeline that's currently streamed. And we loop back to retry,
 and run into the same problem again.

 This interaction is a bit too subtle for my taste, but the straightforward
 fix is to just modify xlogreader so that the first read_page call requests
 all the bytes up to record we're actually interested in. That seems like a
 smart thing to do anyway.

Yuck. Reading from targetRecOff onwards seems like a good idea, yes. But
ISTM that the code isn't really safe now: Assume targetRecOff is 0
(which is valid) then we read up to SizeOfXLogRecord but assume that we
can read both the xlog record as well as the page header.
Then you also need to factor in that the page header can be differently
big.

And yes, its too subtle :(

Do you want to fix that or shall I?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Simon Riggs
On 17 January 2013 20:24, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 17 January 2013 16:15, Robert Haas robertmh...@gmail.com wrote:
 It pains me that I've evidently failed to communicate this concept
 clearly despite a year or more of trying.  Does that make sense?  Is
 there some way I can make this more clear?  The difference seems very
 clear-cut to me, but evidently I'm not conveying it well.

 The point, i.e. the main use case, is to replicate the DDL in a useful 
 form.

 Discussions and reasoning need to focus on the main use case, not on
 weird futures or qualitative points.

 I have to completely disagree with that.  If we don't want PostgreSQL
 to soon subside into an unfixable morass, as I think Brooks puts it,
 we must *not* simply patch things in a way that expediently provides
 an approximation of some desired feature.  We have to consider, and put
 substantial weight on, having a clean and maintainable system design.

So why does focusing on a use case make this turn into an unfixable
morass? The two things are completely unrelated.

If the patch is anywhere close to an unfixable morass, let's just
reject it now. But Robert's arguments were much more tenuous than
that.

My comments were in response to this

 I don't really agree with that.  I think the point is to expose what
 the system is doing to the DBA.  I'm OK with exposing the fact that
 creating a table with a serial column also creates a sequence.  There
 is no problem with that - it's all good.  What we DON'T want to do is
 set up a system where the fact that it creates a sequence is exposed
 because that happens to go through ProcessUtility() and the fact that
 it creates a constraint is not because that happens not to go through
 ProcessUtility().  That is not the sort of quality that our users have
 come to expect from PostgreSQL.

The above functionality is sufficient to allow DDL replication. What
else do we want to do that requires some other capability?

Discussing things in terms of what we will actually use a feature for
is how we know we've got it right. And if it does that, we don't need
anything else.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Let me try to give a concrete example of how I think another firing
 point could be made to work along the lines I'm suggesting.
 [ snip description of how an event trigger might safely be fired just
 after identification and locking of the target object for an ALTER ]

Reading that and reflecting on my gripe about the lack of any clear
architecture for our DDL code, I had the beginnings of an idea.

Perhaps it would improve matters if we refactored DDL processing so that
there were separate parse analysis and execution phases, where parse
analysis is (perhaps among other responsibilities) responsible for
identifying and locking all objects to be used in the command.  I note
that locking the referenced tables is the responsibility of the parse
analysis step in DML processing, so there's solid precedent for this.
Also, we have some of this approach already for certain commands such
as CREATE TABLE, cf parse_utilcmd.c.

If we did that, then it'd be feasible to fire event triggers after the
parse analysis step, and the rechecking that Robert describes could be
encapsulated as redo the parse analysis and see if the result changed.

It's not clear to me just how this ought to extend to the cascaded-DROP
or inherited-table-ALTER cases, but hey, it's only the beginnings of
an idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Dimitri Fontaine dimi...@2ndquadrant.fr writes:
 Tom Lane t...@sss.pgh.pa.us writes:
 I think that we're not realistically going to be able to introduce
 event triggers in very many of the places we'd like to have them
 without first doing a lot of fundamental refactoring.

 We're only talking about ddl_command_start and ddl_command_end now. The
 table_rewrite event is still on the horizon, but is not realistic to get
 in 9.3 anymore, I think :(

 So we're talking about adding some call points only in utility.c, only
 before or after a ddl command is run, including nested sub-commands.

Well, that's already a problem, because as Robert keeps saying, what
goes through utility.c and what doesn't is pretty random right at the
moment, and we shouldn't expose that behavior to users for fear of not
being able to change it later.

I think we could possibly ship event triggers defined as start and end
of a *top level* command and have them working reliably in 9.3.  If you
want users to be looking at subcommands, there is a lot more work to do
there than we have any chance of getting done for 9.3 (if it is to ship
in 2013, that is).

Alternatively, if you want to get something into 9.3 that has not
necessarily got a long-term-stable API, I'd be inclined to suggest that
we forget about a SQL-level event trigger facility for now, and just put
in some hook call points.  It's pretty well established that we're
willing to change the API seen by hook functions across major releases.

TBH this might be the best thing anyway if your long-term goals have to
do with replication, because it'd be a lot cheaper to get to the point
where you could write the replication code and see if it all actually
works.  I'm fairly concerned that we might spend many man-months getting
event triggers with definitions A,B,C into the system, only to find out
later that what is really needed for logical replication is definitions
D,E,F.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 17 January 2013 20:24, Tom Lane t...@sss.pgh.pa.us wrote:
 My comments were in response to this

 I don't really agree with that.  I think the point is to expose what
 the system is doing to the DBA.  I'm OK with exposing the fact that
 creating a table with a serial column also creates a sequence.  There
 is no problem with that - it's all good.  What we DON'T want to do is
 set up a system where the fact that it creates a sequence is exposed
 because that happens to go through ProcessUtility() and the fact that
 it creates a constraint is not because that happens not to go through
 ProcessUtility().  That is not the sort of quality that our users have
 come to expect from PostgreSQL.

 The above functionality is sufficient to allow DDL replication. What
 else do we want to do that requires some other capability?

I think you and Robert are talking past each other.  Whether it is or
is not sufficient for DDL replication is not what he is concerned about;
what he's worried about is whether we can refactor the backend in future
without causing a user-visible change in the events that event triggers
see.  I think that that is an entirely valid concern, and it's not going
to go away on the strength of but this is enough to let us do
replication.

The other point I'd make here is what I just said to Dimitri: it's far
from proven, AFAIK, that this actually *is* sufficient to allow DDL
replication.  I'd be a lot happier if we had a working proof of that
before we lock down a SQL-level facility that we're going to have a hard
time changing the details of.  Maybe we should just introduce some
C-level hooks and let you guys go off and code replication using the
hooks, before we put in a lot of expensive infrastructure that might
turn out to be Not Quite The Right Thing.  After we *know* the hooks
are in the right places and supply the right information, we can look at
replacing them with some more formalized facility.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Well, that's already a problem, because as Robert keeps saying, what
 goes through utility.c and what doesn't is pretty random right at the
 moment, and we shouldn't expose that behavior to users for fear of not
 being able to change it later.

It didn't feel that random to me. In most cases, the rule is that if the
system wants to execute a command that you could have typed as a user in
the prompt, then it hacks together a parsenode and call ProcessUtility.

The main exception to that rule is the CASCADE implementation, for
reasons that you detailed in that thread earlier.

 I think we could possibly ship event triggers defined as start and end
 of a *top level* command and have them working reliably in 9.3.  If you
 want users to be looking at subcommands, there is a lot more work to do
 there than we have any chance of getting done for 9.3 (if it is to ship
 in 2013, that is).

By default, you only get top level in what I've been cooking. I think
the other contexts are needed for the logical replication, and maybe it
would be good enough if they are restricted to either only internal code
or event triggers coded in C. What do you think?

 Alternatively, if you want to get something into 9.3 that has not
 necessarily got a long-term-stable API, I'd be inclined to suggest that
 we forget about a SQL-level event trigger facility for now, and just put
 in some hook call points.  It's pretty well established that we're
 willing to change the API seen by hook functions across major releases.

Or just a hook. That would want to have about the exact same amount of
information as the Event Trigger anyway, so I'm thinking we could maybe
do that the same way as the parsetree exposure?

Another idea would be yet another GUC, ala allow_system_table_mods, so
that only intrepid users can have at it. Well, as long as the logical
replication use case is covered, I'm done here, so I want to hear from
Simon and Andres on that (and maybe the Slony and Londiste guys, etc),
and from you to triage what is possible and what is crazy.

 TBH this might be the best thing anyway if your long-term goals have to
 do with replication, because it'd be a lot cheaper to get to the point
 where you could write the replication code and see if it all actually
 works.  I'm fairly concerned that we might spend many man-months getting
 event triggers with definitions A,B,C into the system, only to find out
 later that what is really needed for logical replication is definitions
 D,E,F.

I'm worried about that too, and that's why I'm trying to remain general
and quite transparent about the way the system actually works.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 19:58 +, Simon Riggs wrote:
 Presumably we remember the state of the VM so we can skip the re-visit
 after every write?

That was not a part of my patch, although I remember that you mentioned
that previously and I thought it could be a good way to mitigate a
problem if it ever came up.

However, the tests I did didn't show any problem there. The tests were
somewhat noisy, so perhaps I was doing something wrong, but it didn't
appear that looking at the VM page for every update was a bottleneck.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 09:53 -0500, Robert Haas wrote:
 The main question in my mind is whether
 there are any negative consequences to holding a VM buffer pin for
 that long without interruption.  The usual consideration - namely,
 blocking vacuum - doesn't apply here because vacuum does not take a
 cleanup lock on the visibility map page, only the heap page, but I'm
 not sure if there are others.

If the without interruption part becomes a practical problem, it seems
fairly easy to fix: drop the pin and pick it up again once every K
pages. Unless I'm missing something, this is a minor concern.

 The other part of the issue is cache pressure. I don't think I can say
 it better than what Tom already wrote:
 
 # I'd be worried about the case of a lot of sessions touching a lot of
 # unrelated tables.  This change implies doubling the number of buffers
 # that are held pinned by any given query, and the distributed overhead
 # from that (eg, adding cycles to searches for free buffers) is what you
 # ought to be afraid of.
 
 I agree that we ought to be afraid of that.

It's a legitimate concern, but I think being afraid goes to far (more
below).

 A pgbench test isn't
 going to find a problem in this area because there you have a bunch of
 sessions all accessing the same small group of tables.  To find a
 problem of the type above, you'd need lots of backends accessing lots
 of different, small tables.  That's not the use case we usually
 benchmark, but I think there are a fair number of people doing things
 like that - for example, imagine a hosting provider or web application
 with many databases or schemas on a single instance.  AFAICS, Jeff
 hasn't tried to test this scenario.

The concern here is over a lot of different, small tables (e.g.
multi-tenancy or something similar) as you say. If we're talking about
nearly empty tables, that's easy to fix: just don't use the VM on tables
less than N pages.

You could say that small tables are really 1-10MB each, and you could
have a zillion of those. I will try to create a worst-case here and see
what numbers come out. Perhaps the extra time to look for a free buffer
does add up.

Test plan:

  1. Take current patch (without skip VM check for small tables
optimization mentioned above).
  2. Create 500 tables each about 1MB.
  3. VACUUM them all.
  4. Start 500 connections (one for each table)
  5. Time the running of a loop that executes a COUNT(*) on that
connection's table 100 times.

I think shared_buffers=64MB is probably appropriate. We want some memory
pressure so that it has to find and evict pages to satisfy the queries.
But we don't want it to be totally exhausted and unable to even pin a
new page; that really doesn't tell us much except that max_connections
is too high.

Sound reasonable?

 Now, on the flip side, we should also be thinking about what we would
 gain from this patch, and what other ways there might be to achieve
 the same goals.

One thing to keep in mind is that the current code to maintain a
crash-safe PD_ALL_VISIBLE and VM bit is quite complex and doesn't play
by the normal rules. If you want to talk about distributed costs, that
has some very real distributed costs in terms of development effort. For
instance, my checksums patch took me extra time to write (despite the
patch being the simplest checksums design on the table) and will take
others extra time to review.

So, if the only things keeping that code in place are theoretical fears,
let's take them one by one and see if they are real problems or not.

 All of which is to say that I think this patch is premature.  If we
 adopt something like this, we're likely never going to revert back to
 the way we do it now, and whatever cache-pressure or other costs this
 approach carries will be hear to stay - so we had better think awfully
 carefully before we do that.

What about this patch makes it hard to undo/rework in the future?

  Even if
 there were, this is exactly the sort of thing that should be committed
 at the beginning of a release cycle, not the end, so as to allow
 adequate time for discovery of unforeseen consequences before the code
 ends up out in the wild.

I'm concerned that such a comment at this stage will cut review early,
which could prevent also it from being committed early in 9.4.

 Of course, there's another issue here too, which is that as Pavan
 points out, the page throws crash-safety out the window

My mistake. I believe that is already fixed, and certainly not a
fundamental issue.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
On 2013-01-17 22:46:21 +0100, Andres Freund wrote:
 Hi,
 
 
 While checking whether I could reproduce the replication delay reported
 by Michael Paquier I found this very nice tidbit:
 
 In a pretty trivial replication setup of only streaming replication I
 can currently easily reproduce this:
 
 standby# BEGIN;SELECT * FROM foo;
 BEGIN
  id | data 
 +--
 
 
 standby=# SELECT relation::regclass, locktype, mode FROM pg_locks;
  relation |  locktype  |  mode   
 --++-
  pg_locks | relation   | AccessShareLock
  foo_pkey | relation   | AccessShareLock
  foo  | relation   | AccessShareLock
   | virtualxid | ExclusiveLock
   | virtualxid | ExclusiveLock
 (5 rows)
 
 primary# DROP TABLE foo;
 DROP TABLE
 
 
 standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks;
  relation |  pid  |  locktype  |mode 
 --+---++-
  pg_locks | 28068 | relation   | AccessShareLock
  foo_pkey | 28068 | relation   | AccessShareLock
  foo  | 28068 | relation   | AccessShareLock
   | 28068 | virtualxid | ExclusiveLock
   | 28048 | virtualxid | ExclusiveLock
  foo  | 28048 | relation   | AccessExclusiveLock
 (6 rows)
 
 standby# SELECT * FROM foo;
 ERROR:  relation foo does not exist
 LINE 1: select * from foo;
   ^
 Note the conflicting locks held on relation foo by 28048 and 28068.
 
 I don't immediately know which patch to blame here? Looks a bit like
 broken fastpath locking, but I don't immediately see anything in
 c00dc337b87 that should cause this?

Rather scarily this got broken in
96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago,
including in 9.2.1+. How the heck could this go unnoticed so long?

Not sure yet what the cause of this is.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Dimitri Fontaine dimi...@2ndquadrant.fr writes:
 Tom Lane t...@sss.pgh.pa.us writes:
 Well, that's already a problem, because as Robert keeps saying, what
 goes through utility.c and what doesn't is pretty random right at the
 moment, and we shouldn't expose that behavior to users for fear of not
 being able to change it later.

 It didn't feel that random to me. In most cases, the rule is that if the
 system wants to execute a command that you could have typed as a user in
 the prompt, then it hacks together a parsenode and call ProcessUtility.

Uh, no, there's no such rule.  In some places it was convenient to do
that so people did it --- but often it's easier to just call the
appropriate function directly, especially if you have any special
locking or permissions requirements.  I don't think there's any great
consistency to it.

To be a little more concrete, I looked through backend/catalog and
backend/commands, which ought to pretty much cover all the places where
commands do things that could be considered subcommands.  I find three
uses of ProcessUtility:

extension.c: uses ProcessUtility to handle non-DML commands read from an
extension script.

schemacmds.c: CreateSchemaCommand uses ProcessUtility for any subcommand
of a CREATE SCHEMA statement.  This is really just direct execution of
something the user typed, it's not the system creating a subcommand.

trigger.c: ConvertTriggerToFK uses ProcessUtility to execute a consed-up
ALTER TABLE ADD FOREIGN KEY command in place of a series of legacy CREATE
CONSTRAINT TRIGGER commands.  Now this is the very definition of ugly,
and shouldn't drive our design decisions --- but I think that any user
event trigger would be darn surprised to find this being emitted,
especially when the two preceding CREATE CONSTRAINT TRIGGERs hadn't done
any such thing, and indeed hadn't created any constraint trigger.

AFAICT, everything else in those directories is using direct calls and
not going through utility.c.  So the only other cases where a trigger in
ProcessUtility would trap subcommands are where those subcommands are
things that parse_utilcmd.c generated during expansion of a CREATE TABLE
or such.  And that whole area is something that I feel strongly is
implementation detail we don't particularly want to expose to users.
For instance, the fact that a UNIQUE clause in a CREATE TABLE works that
way and not at some lower level is nothing but implementation artifact.

[ pokes around a bit more... ]  Having now actually looked at every call
point of ProcessUtility in the current code, I find myself agreeing very
much with Robert: we do *not* want to expose that exact set of events to
users.  Except for the original top-level commands, it's almost entirely
implementation artifacts that can and will change from release to
release.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
On 2013-01-17 23:56:16 +0100, Andres Freund wrote:
 On 2013-01-17 22:46:21 +0100, Andres Freund wrote:
  ^
  Note the conflicting locks held on relation foo by 28048 and 28068.
  
  I don't immediately know which patch to blame here? Looks a bit like
  broken fastpath locking, but I don't immediately see anything in
  c00dc337b87 that should cause this?
 
 Rather scarily this got broken in
 96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago,
 including in 9.2.1+. How the heck could this go unnoticed so long?

That only made the bug more visible - the real bug is somewhere
else. Which makes it even scarrier, the bug was in in the fast path
locking patch from the start...

It assumes conflicting fast path locks can only ever be in the same
database as the the backend transfering the locks to itself. But thats
obviously not true for the startup process which is in no specific
database.
I think it might also be a dangerous assumption for shared objects?

A patch minimally addressing the is attached, but it only addresses part
of the problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index f2cf5c6..b5240da 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2458,8 +2458,13 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 		 * less clear that our backend is certain to have performed a memory
 		 * fencing operation since the other backend set proc-databaseId.	So
 		 * for now, we test it after acquiring the LWLock just to be safe.
+		 *
+		 * But if we are the startup process we don't belong to a database but
+		 * still need to lock objects in other databases, so we can do this
+		 * optimization only in case we belong to a database.
+		 * XXX: explain why this is safe for shared tables.
 		 */
-		if (proc-databaseId != MyDatabaseId)
+		if (MyDatabaseId != InvalidOid  proc-databaseId != MyDatabaseId)
 		{
 			LWLockRelease(proc-backendLock);
 			continue;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Michael Paquier
On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao masao.fu...@gmail.com wrote:

  I encountered the problem that the timeline switch is not performed
 expectedly.
 I set up one master, one standby and one cascade standby. All the servers
 share the archive directory. restore_command is specified in the
 recovery.conf
 in those two standbys.

 I shut down the master, and then promoted the standby. In this case, the
 cascade standby should switch to new timeline and replication should be
 successfully restarted. But the timeline was never changed, and the
 following
 log messages were kept outputting.

 sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG:  replication terminated by primary server
 sby2 DETAIL:  End of WAL reached on timeline 1
 sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG:  replication terminated by primary server
 sby2 DETAIL:  End of WAL reached on timeline 1
 sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
 sby2 LOG:  replication terminated by primary server
 sby2 DETAIL:  End of WAL reached on timeline 1

I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then
kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
you will notice that the timeline jump is not done.

I don't know if Masao tried to put in sync the slave that reconnects to the
promoted slave, but in this case slave2 stucks in potential state. That
is due to timeline that has not changed on slave2 but better to let you
know...

The replication delays are still here.
-- 
Michael Paquier
http://michael.otacoo.com


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 This seems to provide a reasonably principled
 argument why we might want to fix this case with a localized use of an
 MVCC scan before we have such a fix globally.

I had discussed that idea a bit with Andres on IRC and my only concern
was if there's some reason that acquiring a snapshot during createdb()
would be problematic.  It doesn't appear to currently and I wasn't sure
if there'd be any issues.

I'll start working on a patch for that.

 Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
 kind of annotation.  We know we need the SnapshotNow scan fix.

Agreed.

Thanks,

Stephen


signature.asc
Description: Digital signature


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
 On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao masao.fu...@gmail.com wrote:
 
   I encountered the problem that the timeline switch is not performed
  expectedly.
  I set up one master, one standby and one cascade standby. All the servers
  share the archive directory. restore_command is specified in the
  recovery.conf
  in those two standbys.
 
  I shut down the master, and then promoted the standby. In this case, the
  cascade standby should switch to new timeline and replication should be
  successfully restarted. But the timeline was never changed, and the
  following
  log messages were kept outputting.
 
  sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
  sby2 LOG:  replication terminated by primary server
  sby2 DETAIL:  End of WAL reached on timeline 1
  sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
  sby2 LOG:  replication terminated by primary server
  sby2 DETAIL:  End of WAL reached on timeline 1
  sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
  sby2 LOG:  replication terminated by primary server
  sby2 DETAIL:  End of WAL reached on timeline 1
 
 I am seeing similar issues with master at 88228e6.
 This is easily reproducible by setting up 2 slaves under a master, then
 kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
 you will notice that the timeline jump is not done.

Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
split off?).

 The replication delays are still here.

That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile  0 ||
(readSource == XLOG_FROM_STREAM 
-receivedUpto = targetPagePtr + reqLen))
+receivedUpto  targetPagePtr + reqLen))
{
if (StandbyMode)
{

I didn't notice because I had a testscript inserting stuff continuously
and it cause at most lagging by one record...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2013-01-17 Thread Tomas Vondra
On 17.1.2013 20:19, Alvaro Herrera wrote:
 
 I'm curious -- why would you drop tables in groups of 100 instead of
 just doing the 100,000 in a single transaction?  Maybe that's faster
 now, because you'd do a single scan of the buffer pool instead of 1000?
 (I'm assuming that in groups of means you do each group in a separate
 transaction)

There's a limited number of locks, and each DROP acquires a lock
(possibly more than one).

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory ...: File exists

2013-01-17 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 This seems to provide a reasonably principled
 argument why we might want to fix this case with a localized use of an
 MVCC scan before we have such a fix globally.

 I had discussed that idea a bit with Andres on IRC and my only concern
 was if there's some reason that acquiring a snapshot during createdb()
 would be problematic.  It doesn't appear to currently and I wasn't sure
 if there'd be any issues.

Don't see what.  The main reason we've not yet attempted a global fix is
that the most straightforward way (take a new snapshot each time we
start a new SnapshotNow scan) seems too expensive.  But CREATE DATABASE
is so expensive that the cost of an extra snapshot there ain't gonna
matter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Michael Paquier
On Fri, Jan 18, 2013 at 8:48 AM, Andres Freund and...@2ndquadrant.comwrote:

 Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
 split off?).

Yes, it is reproducible before the xlog reader split.
Just an additional report, the master jumps correctly to the new timeline.


  The replication delays are still here.

 That one is caused by this nice bug, courtesy of yours truly:
 diff --git a/src/backend/access/transam/xlog.c
 b/src/backend/access/transam/xlog.c
 index 90ba32e..1174493 100644
 --- a/src/backend/access/transam/xlog.c
 +++ b/src/backend/access/transam/xlog.c
 @@ -8874,7 +8874,7 @@ retry:
 /* See if we need to retrieve more data */
 if (readFile  0 ||
 (readSource == XLOG_FROM_STREAM 
 -receivedUpto = targetPagePtr + reqLen))
 +receivedUpto  targetPagePtr + reqLen))
 {
 if (StandbyMode)
 {

 I didn't notice because I had a testscript inserting stuff continuously
 and it cause at most lagging by one record...

This fix is indeed working.
-- 
Michael Paquier
http://michael.otacoo.com


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
 On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao masao.fu...@gmail.com wrote:
 
   I encountered the problem that the timeline switch is not performed
  expectedly.
  I set up one master, one standby and one cascade standby. All the servers
  share the archive directory. restore_command is specified in the
  recovery.conf
  in those two standbys.
 
  I shut down the master, and then promoted the standby. In this case, the
  cascade standby should switch to new timeline and replication should be
  successfully restarted. But the timeline was never changed, and the
  following
  log messages were kept outputting.
 
  sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
  sby2 LOG:  replication terminated by primary server
  sby2 DETAIL:  End of WAL reached on timeline 1
  sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
  sby2 LOG:  replication terminated by primary server
  sby2 DETAIL:  End of WAL reached on timeline 1
  sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
  sby2 LOG:  replication terminated by primary server
  sby2 DETAIL:  End of WAL reached on timeline 1
 
 I am seeing similar issues with master at 88228e6.
 This is easily reproducible by setting up 2 slaves under a master, then
 kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
 you will notice that the timeline jump is not done.
 
 I don't know if Masao tried to put in sync the slave that reconnects to the
 promoted slave, but in this case slave2 stucks in potential state. That
 is due to timeline that has not changed on slave2 but better to let you
 know...

Ok, I know whats causing this now. Rather ugly.

Whenever accessing a page in a segment we haven't accessed before we
read the first page to do an extra bit of validation as the first page
in a segment contains more information.

Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends
there, wants to read the new timeline, requests record
0/06087088. xlogreader wants to do its validation and goes back to the
first page in the segment which triggers xlog.c to rerequest timeline1
to be transferred..

Heikki, any ideas?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Tatsuo Ishii
 On Thu, Jan 17, 2013 at 2:29 PM, Andrew Dunstan and...@dunslane.net wrote:

 On 01/17/2013 06:04 AM, Tomas Vondra wrote:

 The problem is I have access to absolutely no Windows machines,
 not mentioning the development tools (and that I have no clue about it).

 I vaguely remember there were people on this list doing Windows
 development
 on a virtual machine or something. Any interest in working on this /
 giving
 me some help?




 One of the items on my very long TODO list (see stuff elsewhere about
 committers not doing enough reviewing and committing) is to create a package
 that can easily be run to set Windows Postgres development environments for
 MSVC, Mingw and Cygwin on Amazon, GoGrid etc.

 Once I get that done I'll be a lot less sympathetic to I don't have access
 to Windows pleas.

 Windows does run in a VM very well, but if you're going to set it up on your
 own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own legal
 copy to install there. If you don't already have one, that will set you back
 about $140.00 (for w7 Pro) in the USA. Note that that's significantly better
 than the situation with OSX, which you can't run at all except on Apple
 hardware.
 
 Yeah. I used to have an AMI with the VS environment preinstalled on
 Amazon, but I managed to fat finger things and delete it at some point
 and haven't really had time to rebuild it.
 
 Having a script that would download and install all the pre-requisites
 on such a box would be *great*. Then you could get up and going pretty
 quickly, and getting a Windows box up running for a few hours there is
 almost free, and you don't have to deal with licensing hassles.
 
 (Of course, the AMI method doesn't work all the way since you'd be
 distributing Visual Studio, but if we can have a script that
 auto-downloads-and-installs it as necessary we can get around that)

So if my understating is correct, 1)Tomas Vondra commits to work on
Windows support for 9.4, 2)on the assumption that one of Andrew
Dunstan, Dave Page or Magnus Hagander will help him in Windows
development.

Ok? If so, I can commit the patch for 9.3 without Windows support. If
not, I will move the patch to next CF (for 9.4).

Please correct me if I am wrong.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch to add \watch to psql

2013-01-17 Thread Daniel Farina
I have adjusted this patch a little bit to take care of the review
issues, along with just doing a bit of review myself.

On Thu, Oct 25, 2012 at 2:25 AM, Will Leinweber w...@heroku.com wrote:
 Thanks for the reviews and comments. Responses inline:
 .
 On Sat, Oct 20, 2012 at 9:19 AM, Abhijit Menon-Sen a...@2ndquadrant.com 
 wrote:
 Maybe you should call it \repeat or something. I'm sure people would get
 around to using \watch that way eventually. :-)


 While I agree that clearing the screen would be nicer, I feel that
 watching the bottom of the screen instead of the top gets you 95% of
 the value of unix watch(1), and having the same name will greatly
 enhance discoverability. Perhaps later on if ncurses is linked for
 some other reason, this could take advantage of it then.

 That said, I'm not that strongly attached to the name one way or the other.

The name \repeat has grown on me, but I haven't bothered renaming it
for the time being.  I think sameness with the familiar 'watch'
program may not be such a big deal as I thought originally, but
'repeat' sounds a lot more like a kind of flow control for scripts,
whereas \watch is more clearly for humans, which is the idea.

 On Wed, Oct 24, 2012 at 2:55 PM, Peter Eisentraut pete...@gmx.net wrote:
 This doesn't handle multiline queries:

 = \watch select 1 +
 ERROR:  42601: syntax error at end of input
 LINE 1:  select +
  ^

 I think to make it cooperate better with psql syntax, put the \watch at
 the end, as a replacement for \g, like

I have implemented some kind of multi-line support.  The rough part is
in this part of the patch:

+ if (query_buf  query_buf-len  0)
+ {
+ /*
+ * Check that the query in query_buf has been terminated.  This
+ * is mostly consistent with behavior from commands like \g.
+ * The reason this is here is to prevent terrible things from
+ * occuring from submitting incomplete input of statements
+ * like:
+ *
+ * DELETE FROM foo
+ * \watch
+ * WHERE
+ *
+ * Wherein \watch will go ahead and send whatever has been
+ * submitted so far.  So instead, insist that the user
+ * terminate the query with a semicolon to be safe.
+ */
+ if (query_buf-data[query_buf-len - 1] == ';')

What I found myself reaching for when giving up and writing this hack
was a way to thread through the last lexer state of  query_buf, which
seems it could stand to be accrue a bit more information than being
just a byte buffer.  But this is the simplest possible thing, so I'll
let others comment...

--
fdr


psql-watch-v2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 4:43 PM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Robert Haas robertmh...@gmail.com writes:
 Goal: Every time an ALTER command is used on object *that actually
 exists*, we will call some user-defined function and pass the object
 type, the OID of the object, and some details about what sort of
 alteration the user has requested.

 Ok, in current terms of the proposed patch, it's a ddl_command_end event
 trigger, and you can choose when to fire it in utility.c. The current
 place is choosen to be just after the AlterTable() call,

/me scratches my head.

Depends. What I described would fire before the command was executed,
not after, so it's not quite the same thing, but it might be just as
good for your purposes.  It wasn't so much intended as we should do
this exact thing as much as this is the kind of thought process that
you need to go through if you want to safely run an event trigger in
the middle of another command.

Now, maybe we don't actually need to do that to serve your purposes.
As already discussed, it seems like passing information to
ddl_command_end is for most purposes sufficient for what you need -
and we can do that without any special gymnastics, because that way it
runs completely after the command, not in the middle.  If I understand
correctly, and I may not, there are basically two problems with that
approach:

(1) It's not exactly clear how to pass the information about the
statement through to the ddl_command_end trigger.  I know you've
proposed a couple of things, but none of them seem completely
satisfying.  At least not to me.

(2) If the method of transmission is the OIDs of the affected
objects, which I believe is your current proposal, the
ddl_command_end trigger can't go look those objects up in the catalog,
because the catalog entries are already gone by that point.

It's possible we could solve both of those problems without running
event triggers in the middle of commands at all.  In which case, my
whole example is moot for now.  But I think it would be smart to get
ddl_command_end (without any additional information) committed first,
and then argue about these details, so that we at least make some
definable progress here.

 because that
 sounded logical if you believe in CommandCounterIncrement.

I'm somewhat bemused by this comment.  I don't think
CommandCounterIncrement() is an article of faith.  If you execute a
command to completion, and do a CommandCounterIncrement(), then
whatever you do next will look just like a new command, so it should
be safe to run user-provided code there.  But if you stop in the
middle of a command, do a CommandCounterIncrement(), run some
user-provided code, and then continue on, the
CommandCounterIncrement() by itself protects you from nothing. If the
code is not expecting arbitrary operations to be executed at that
point, and you execute them, you get to keep both pieces.  Indeed,
there are some places in the code where inserting a
CommandCounterIncrement() *by itself* could be unsafe.  I don't
believe that's a risk in ProcessUtility(), but that doesn't mean there
aren't any risks in ProcessUtility().

 So, instead, what we need to do is go into each function that
 implements ALTER, and modify it so that after it does the dance where
 we check permissions, take locks, and look up names, we call the
 trigger function.  That's a bit of a nuisance, because we probably
 have a bunch of call sites to go fix, but in theory it should be
 doable.  The problem of course, is that it's not intrinsically safe to
 call user-defined code there.  If it drops the object and creates a
 new one, say, hilarity will probably ensue.

 You're trying to find a dangerous point when to fire the trigger here,

Yes, I am.  I'm not asking you to implement what I proposed there -
I'm just giving an example of how to make a dangerous place safe.
You've ALSO found a dangerous point when to fire the trigger.  The
difference is that my example dangerous point is probably fixable to
be safe, whereas your actual dangerous point is probably unfixable,
because there are many current paths of execution that go through that
function in an extremely ad-hoc fashion, and there may be more in the
future.  ddl_command_start and ddl_command_end are safe enough *when
restricted to toplevel commands*, but that's about as far as it goes.
Perhaps there are other special cases that are also safe, but just
throwing everything into the pot with no analysis and no
future-proofing certainly isn't.

 Now, there is a further problem: all of the information about the
 ALTER statement that we want to provide to the trigger may not be
 available at that point.  Simple cases like ALTER .. RENAME won't
 require anything more than that, but things like ALTER TABLE .. ADD
 FOREIGN KEY get tricky, because while at this point we have a solid
 handle on the identity of the relation to which we're adding the
 constraint, we do NOT yet have knowledge of or a lock on the 

  1   2   >