Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Simon Riggs
On 17 January 2013 03:02, Jeff Davis  wrote:

> Rebased patch attached. No significant changes.

Jeff, can you summarise/collate why we're doing this, what concerns it
raises and how you've dealt with them? That will help decide whether
to commit.

Thanks

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4

2013-01-17 Thread Magnus Hagander
On Jan 17, 2013 8:15 AM, "Abhijit Menon-Sen"  wrote:
>
> At 2013-01-17 16:05:05 +0900, michael.paqu...@gmail.com wrote:
> >
> > Is it really necessary to create a new commit fest just to move the
> > items? Marking the patches that are considered as being too late for
> > 9.3 should be just returned with feedback.
>
> Opening 2013-03 is not so much to move existing patches, but to give
> people a place to submit *new* post-9.3 patches without interrupting
> the 2013-01 CF.

Yeah, and +1 for doing that. The sooner the better. By whichever of the
time frames for the cf that had been discussed, it should certainly not be
open for accepting new patches for 9.3 anymore.

/Magnus


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Pavan Deolasee
On Thu, Jan 17, 2013 at 2:11 PM, Simon Riggs  wrote:
>
> On 17 January 2013 03:02, Jeff Davis  wrote:
>
> > Rebased patch attached. No significant changes.
>
> Jeff, can you summarise/collate why we're doing this, what concerns it
> raises and how you've dealt with them? That will help decide whether
> to commit.
>

+1. On another thread "Set visibility map bit after HOT prune", Robert
mentioned that its not such a good idea to remove the PD_ALL_VISIBLE
flag because it helps us to reduce the contention on the VM page,
especially when we need to reset the VM bit. Here is an excerpt from
Robert's comment that thread:

"Sure, but you're zipping rather blithely past the disadvantages of
such an approach.  Jeff Davis recently proposed getting rid of
PD_ALL_VISIBLE, and Tom and I both expressed considerable skepticism
about that; this proposal has the same problems.  One of the major
benefits of PD_ALL_VISIBLE is that, when it isn't set, inserts,
updates, and deletes to the page can ignore the visibility map.  That
means that a server under heavy concurrency is much less likely to
encounter contention on the visibility map blocks.  Now, maybe that's
not really a problem, but I sure haven't seen enough evidence to make
me believe it.  If it's really true that PD_ALL_VISIBLE needn't fill
this role, then Heikki wasted an awful lot of time implementing it,
and I wasted an awful lot of time keeping it working when I made the
visibility map crash-safe for IOS.  That could be true, but I tend to
think it isn't."

May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4 (was Re: Parallel query execution)

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 8:19 AM, Pavan Deolasee
 wrote:
>
> On Wed, Jan 16, 2013 at 1:51 PM, Abhijit Menon-Sen 
> wrote:
>>
>> At 2013-01-16 02:07:29 -0500, t...@sss.pgh.pa.us wrote:
>> >
>> > In case you hadn't noticed, we've totally lost control of
>> > the CF process.
>>
>> What can we do to get it back on track?
>>
>> I know various people (myself included) have been trying to keep CF3
>> moving, e.g. sending followup mail, adjusting patch status, etc.
>>
>> I want to help, but I don't know what's wrong. What are the committers
>> working on, and what is the status of the "Ready for commiter" patches?
>> Is the problem that the patches marked Ready aren't, in fact, ready? Or
>> is it lack of feedback from authors? Or something else?
>>
>
> ISTM that even committers are often overwhelmed with the work, their own as
> well as that of reviewing other's patches and committing them. Especially
> when a patch is large or touches core areas, I can feel the significant work
> that the committer has to do even after some help and initial review from CF
> reviewers. On the other hand, if the patches are not committed in time, we
> loose context, interest dies down and when the patch is actually picked up
> by a committer who is often not involved in the original discussion, many
> points need to be revisited and reworked.
>
> Would it help to step up a few developers and create a second line of
> committers ? The commits by the second line committers will still be
> reviewed by the first line committers before they make into the product, but
> may be at later stage or when we are in beta. I thought of even suggesting
> that the master branch will contain only commits by the first line
> committers. We then maintain a secondary branch which also have commits from
> the second line committers in addition to all commits from the master
> branch. The first line committers can then cherry pick from the secondary
> branch at some later stage. But not sure if this will add more overhead and
> defeat the problem at hand.

While we can certainly do that, it would probably help just to havae a
"second line of reviewers". Basically a set of more senior reviewers -
so a patch would go submission -> reviewer -> senior reviewer ->
committer. With the second line of reviewers focusing more on the
whole "how to do things", etc.

As you, I'm not sure if it creates more overhead than it solves, but
it might be worth a try. In a way it already exists, because I'm sure
committers pay slightly more attention to reviews by people who ahve
been doing it a lot and are known to process those things, than to new
entries. All that woudl bee needed was for those people to realize it
would be helpful for them to do a second-stage review even if somebody
else has done the first one.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Abhijit Menon-Sen
At 2013-01-17 08:41:37 +, si...@2ndquadrant.com wrote:
>
> Jeff, can you summarise/collate why we're doing this, what concerns it
> raises and how you've dealt with them?

Since I was just looking at the original patch and discussion, and since
Pavan has posted an excerpt from one objection to it, here's an excerpt
from Jeff's original post titled "Do we need so many hint bits?"

http://www.postgresql.org/message-id/1353026577.14335.91.camel@sussancws0025

Also, I am wondering about PD_ALL_VISIBLE. It was originally
introduced in the visibility map patch, apparently as a way to know
when to clear the VM bit when doing an update. It was then also used
for scans, which showed a significant speedup. But I wonder: why not
just use the visibilitymap directly from those places? It can be
used for the scan because it is crash safe now (not possible
before). And since it's only one lookup per scanned page, then I
don't think it would be a measurable performance loss there.
Inserts/updates/deletes also do a significant amount of work, so
again, I doubt it's a big drop in performance there -- maybe under a
lot of concurrency or something.

The benefit of removing PD_ALL_VISIBLE would be significantly
higher. It's quite common to load a lot of data, and then do some
reads for a while (setting hint bits and flushing them to disk), and
then do a VACUUM a while later, setting PD_ALL_VISIBLE and writing
all of the pages again. Also, if I remember correctly, Robert went
to significant effort when making the VM crash-safe to keep the
PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed
before?

There was considerable discussion after this (accessible through the
archives link above), which I won't attempt to summarise.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii  wrote:
 This might be way more than we want to do, but there is an article
 that describes some techniques for doing what seems to be missing
 (AIUI):

 
>>> Even this would be doable, I'm afraid it may not fit in 9.3 if we
>>> think about the current status of CF. So our choice would be:
>>>
>>> 1) Postpone the patch to 9.4
>>>
>>> 2) Commit the patch in 9.3 without Windows support
>>>
>>> I personally am ok with #2. We traditionally avoid particular paltform
>>> specific features on PostgreSQL.  However I think the policiy could be
>>> losen for contrib staffs. Also pgbench is just a client program. We
>>> could always use pgbench on UNIX/Linux if we truely need the feature.
>>>
>>> What do you think?
>>
>> Fair enough, I was just trying to point out alternatives. We have
>> committed platform-specific features before now. I hope it doesn't
>> just get left like this, though.

We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, so
there isn't much we can do there.


> Yeah, I hope someone pick this up and propose as a TODO item. In the
> mean time, I'm going to commit the patch without Windows support
> unless there's objection.

Perhaps we should actually hold off until someone committs to actually
getting it fixed in the next version? If we do have that, then we can
commit it as a partial feature, but if we just "hope someone picks it
up", that's leaving it very loose..

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Pavan Deolasee
On Thu, Jan 17, 2013 at 2:57 PM, Abhijit Menon-Sen wrote:

>
>
> There was considerable discussion after this (accessible through the
> archives link above), which I won't attempt to summarise.
>
>
I thought Robert made those comments after considerable discussions on
Jeff's approach. So he probably still stands by his objections or at least
not satisfied/seen the numbers.

Now that I look at the patch, I wonder if there is another fundamental
issue with the patch. Since the patch removes WAL logging for the VM set
operation, this can happen:

1. Vacuum kicks in and clears all dead tuples in a page and decides that
its all-visible
2. Vacuum WAL-logs the cleanup activity and marks the page dirty
3. Vacuum sets the visibility bit and marks the VM page dirty
4. Say the VM page gets written to the disk. The heap page is not yet
written neither the WAL log corresponding to the cleanup operation
5. CRASH

After recovery, the VM bit will remain set because the VM page got written
before the crash. But since heap page's cleanup WAL did not made to the
disk, those operations won't be replayed. The heap page will be left with
not-all-visible tuples in that case and its not a good state to be in.

The original code does not have this problem because the VM set WAL gets
written after the heap page cleanup WAL. So its guaranteed that the VM bit
will be set during recovery only if the cleanup WAL is replayed too (there
is more magic than what meets the eye and I think its not fully
documented).

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Dave Page
On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander  wrote:
> On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii  wrote:
> This might be way more than we want to do, but there is an article
> that describes some techniques for doing what seems to be missing
> (AIUI):
>
> 
 Even this would be doable, I'm afraid it may not fit in 9.3 if we
 think about the current status of CF. So our choice would be:

 1) Postpone the patch to 9.4

 2) Commit the patch in 9.3 without Windows support

 I personally am ok with #2. We traditionally avoid particular paltform
 specific features on PostgreSQL.  However I think the policiy could be
 losen for contrib staffs. Also pgbench is just a client program. We
 could always use pgbench on UNIX/Linux if we truely need the feature.

 What do you think?
>>>
>>> Fair enough, I was just trying to point out alternatives. We have
>>> committed platform-specific features before now. I hope it doesn't
>>> just get left like this, though.
>
> We have committed platform-specific features before, but generally
> only when it's not *possible* to do them for all platforms. For
> example the posix_fadvise stuff isn't available on Windows at all, so
> there isn't much we can do there.

Right - having platform specific features for other reasons like lack
of time is a slippery slope in my opinion. We should not get into such
a habit or Windows will quickly become a second class platform as far
as PostgreSQL features are concerned.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 11:09 AM, Dave Page  wrote:
> On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander  wrote:
>> On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii  wrote:
>> This might be way more than we want to do, but there is an article
>> that describes some techniques for doing what seems to be missing
>> (AIUI):
>>
>> 
> Even this would be doable, I'm afraid it may not fit in 9.3 if we
> think about the current status of CF. So our choice would be:
>
> 1) Postpone the patch to 9.4
>
> 2) Commit the patch in 9.3 without Windows support
>
> I personally am ok with #2. We traditionally avoid particular paltform
> specific features on PostgreSQL.  However I think the policiy could be
> losen for contrib staffs. Also pgbench is just a client program. We
> could always use pgbench on UNIX/Linux if we truely need the feature.
>
> What do you think?

 Fair enough, I was just trying to point out alternatives. We have
 committed platform-specific features before now. I hope it doesn't
 just get left like this, though.
>>
>> We have committed platform-specific features before, but generally
>> only when it's not *possible* to do them for all platforms. For
>> example the posix_fadvise stuff isn't available on Windows at all, so
>> there isn't much we can do there.
>
> Right - having platform specific features for other reasons like lack
> of time is a slippery slope in my opinion. We should not get into such
> a habit or Windows will quickly become a second class platform as far
> as PostgreSQL features are concerned.

Especially since there is no lack of time - the functionality is
there, it just looks (significantly) different.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Robert Haas  writes:
> - adds ddl_command_trace and ddl_command_end events
> - causes those events to be called not only for actual SQL commands
> but also for things that happen, at present, to go through the same
> code path
> - adds additional magic variables to PL/pgsql to expose new
> information not previously exposed
> - adds CONTEXT as an additional event-trigger-filter variable, along with TAG
>
> In my mind, that's four or five separate patches.  You don't have to
> agree, but that's what I think, and I'm not going to apologize for
> thinking it.  Moreover, I think you will struggle to find examples of
> patches committed in the last three years that made as many different,
> only loosely-related changes as you're proposing for this one.

So, say I do that. Now we have 5 patches to review. They all are against
the same set of files, so there's a single possible ordering to apply
them, and so to review them. When we reach to an agreement out-of-order
(we will), it takes a sensible amount of time to rebuild the patch set
in a different order. I don't think we have the time to do that.

Now, if that's what it takes, I'll spend time on it. In which exact
order do you want to be reviewing and applying that series of patches?

> As for the patch not being well-thought-out, I'm sticking by that,
> too.  Alvaro's comments about lock escalations should be enough to
> tickle your alarm bells, but there are plenty of other problems here,
> too.

Here's the comment in the code where the lock problems are discussed:

/*
 * Fill in the EventTriggerTargetOid for a DROP command and a
 * "ddl_command_start" Event Trigger: the lookup didn't happen yet and we
 * still want to provide the user with that information when possible.
 *
 * We only do the lookup if the command contains a single element, which is
 * often forced to be the case as per the grammar (see the drop_type:
 * production for a list of cases where more than one object per command is
 * allowed).
 *
 * We take no exclusive lock here, the main command will have to do the
 * name lookup all over again and take appropriate locks down after the
 * User Defined Code runs anyway, and the commands executed by the User
 * Defined Code will take their own locks. We only lock the objects here so
 * that they don't disapear under us in another concurrent transaction,
 * hence using ShareUpdateExclusiveLock.  XXX this seems prone to lock
 * escalation troubles.
 */

My thinking is that the user code can do anything, even run a DROP
command against the object that we are preparing ourselves to be
dropping. I don't think we can bypass doing the lookup twice.

The way not to have to do any lookup in our code is not to report the
OID of the Object being dropped. Then, the event trigger code will do
that lookup as its first thing, certainly. The lookup will happen.

   RAISE NOTICE 'drop table % (%)', tg_objectname::regclass,
tg_objectname::regclass::oid;

That is, at least, the angle I considered when crafting this patch. I'm
happy to change that angle if needed and adjust to things I missed and
you will tell me about.

>> I think you might want to review the use case behing ddl_command_trace,
>> that has been asked by who users wanted to see their use case supported
>> in some easier way than just what you're talking about here.
>
> […]
> Being able to do
> that in one command instead of two is not a sufficient reason to add
> another event type.

That feature was proposed in past october (with a patch) and received no
comment, so I went on with it. The performance costs of this patch is
another couple of cache lookups when doing a DDL command, which I saw as
acceptable.

  http://www.postgresql.org/message-id/m28vbgcc30@2ndquadrant.fr

I'm not so attached to it that I will put the rest of the patch in
danger with it, I'm fine about removing that part if we have a good
reason to. And yes, you (politely) saying "I don't like it" is a good
enough reason to.

Our process for a non-commiter proposal seems to me that you have to
send a patch showing what you're talking about for a thread to form and
to get some comments on the idea, either rejecting it or reaching to a
consensus about it. So the opening of that thread was in october and the
discussion happens now: it's very fine, and I could be happy and
thankfull about this fact with some subtle changes in the form of the
messages.

> I'm very much opposed to that proposal, as I am to your proposal to
> expose internal and generated events to users.  Recursing into
> ProcessUtility is a nasty, messy hook that is responsible for subtle
> bugs and locking problems in our current DDL implementation.  We

We are publishing 2 pieces of information tied to the implementation in
the current patch:

 - the parse tree, only available to C code, and without any API
   stability promises

 - the fact that somme commands are running nested commands: serial does
   a create sequence, create 

Re: [HACKERS] Multiple --table options for other commands

2013-01-17 Thread Magnus Hagander
On Fri, Dec 14, 2012 at 4:14 PM, Karl O. Pinc  wrote:
> On 12/13/2012 11:02:56 PM, Josh Kupershmidt wrote:
>> On Thu, Dec 13, 2012 at 9:03 PM, Karl O. Pinc  wrote:
>
>> > Sorry to be so persnickety, and unhelpful until now.
>> > It seemed like it should be doable, but something
>> > was going wrong between keyboard and chair.  I guess
>> > I should be doing this when better rested.
>>
>> No problem, here is v5 with changed synopses.
>
> The clusterdb synopsis had tabs in it, which I understand
> is frowned upon in the docs.  I've fixed this.
> It looks good to me, passes check and so forth.
>
> Attached is a v6 patch, with no tabs in docs and based
> off the latest head.
>
> I'm marking it ready for committer.

Thanks. Applied, with only some small whitespace changes.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby conflict resolution handling

2013-01-17 Thread Andres Freund
On 2013-01-17 01:38:31 -0500, Tom Lane wrote:
> But having said that ... are we sure this code is not actually broken?
> ISTM that if we dare not interrupt for fear of confusing OpenSSL, we
> cannot safely attempt to send an error message to the client either;
> but ereport(FATAL) will try exactly that.

You're absolutely right.

ISTM, to fix it we would have to either provide a COMERROR like facility
for FATAL errors or just remove the requirement of raising an error
exactly there.
If I remember what I tried before correctly the latter seems to involve
setting openssl into nonblocking mode and check for the saved error in
the EINTR loop in be-secure and re-raise the error from there.

Do we want to backport either - there hasn't been any report that I
could link to it right now, but on the other hand it could possibly
cause rather ugly problems (data leakage, segfaults, code execution
aren't all that improbable)?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> Hi all,
> 
> There is a strange bug with the latest master head (commit 7fcbf6a).
> When the WAL stream with a master is cut on a slave, slave returns a FATAL
> (well normal...), but then enters in recovery process and automatically
> promotes.
> Here are more details about the logs on slave (I simply killed the master
> manually):
> FATAL:  could not receive data from WAL stream:
> cp: cannot stat
> ‘/home/michael/bin/pgsql/archive/master/00010004’: No such
> file or directory
> LOG:  record with zero length at 0/401E1B8
> LOG:  redo done at 0/401E178
> LOG:  last completed transaction was at log time 2013-01-17
> 20:27:53.180971+09
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0002.history’:
> No such file or directory
> LOG:  selected new timeline ID: 2
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0001.history’:
> No such file or directory
> LOG:  archive recovery complete
> DEBUG:  resetting unlogged relations: cleanup 0 init 1
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> DEBUG:  archived transaction log file "00010004"
> DEBUG:  archived transaction log file "0002.history"
> LOG:  statement: create table bn (a int);
> DEBUG:  autovacuum: processing database "postgres"
> 
> Slave does not try anymore to reconnect to master with messages of the type:
> FATAL:  could not connect to the primary server
> 
> I also noticed that there is some delay until modifications on master are
> visible on slave.
> For example run a simple CREATE TABLE and the new table is not
> 
> [some bisecting later...]
> 
> I think that bug has been introduced by commit 7fcbf6a.
> Before splitting xlog reading as a separate facility things worked
> correctly.
> There are also no delay problems before this commit.
> 
> Does someone else noticed that?

No, at least I haven't, but it got committed only a rather short time
ago ;)
Looking, I have an inkling where the rpoblem could be.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Tomas Vondra

Dne 17.01.2013 10:36, Magnus Hagander napsal:
On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii  
wrote:
This might be way more than we want to do, but there is an 
article

that describes some techniques for doing what seems to be missing
(AIUI):



Even this would be doable, I'm afraid it may not fit in 9.3 if we
think about the current status of CF. So our choice would be:

1) Postpone the patch to 9.4

2) Commit the patch in 9.3 without Windows support

I personally am ok with #2. We traditionally avoid particular 
paltform
specific features on PostgreSQL.  However I think the policiy 
could be
losen for contrib staffs. Also pgbench is just a client program. 
We
could always use pgbench on UNIX/Linux if we truely need the 
feature.


What do you think?


Fair enough, I was just trying to point out alternatives. We have
committed platform-specific features before now. I hope it doesn't
just get left like this, though.


We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, so
there isn't much we can do there.


Maybe, although this platform-dependence already exists in pgbench to 
some
extent (the Windows branch is unable to log the timestamps of 
transactions).


It would certainly be better if pgbench was able to handle Windows and
Linux equally, but that was not the aim of this patch.


Yeah, I hope someone pick this up and propose as a TODO item. In the
mean time, I'm going to commit the patch without Windows support
unless there's objection.


Perhaps we should actually hold off until someone committs to 
actually

getting it fixed in the next version? If we do have that, then we can
commit it as a partial feature, but if we just "hope someone picks it
up", that's leaving it very loose..


Well, given that I'm an author of that patch, that 'someone' would have
to be me. The problem is I have access to absolutely no Windows 
machines,
not mentioning the development tools (and that I have no clue about 
it).


I vaguely remember there were people on this list doing Windows 
development
on a virtual machine or something. Any interest in working on this / 
giving

me some help?

Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Tomas Vondra

Dne 17.01.2013 11:16, Magnus Hagander napsal:
On Thu, Jan 17, 2013 at 11:09 AM, Dave Page  
wrote:
On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander 
 wrote:

We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, 
so

there isn't much we can do there.


Right - having platform specific features for other reasons like 
lack
of time is a slippery slope in my opinion. We should not get into 
such
a habit or Windows will quickly become a second class platform as 
far

as PostgreSQL features are concerned.


Especially since there is no lack of time - the functionality is
there, it just looks (significantly) different.


Really? Any link to relevant docs or something?

When doing some research in this field, the only option I was able to 
come up
with was combining gettimeofday() with the timing functionality, and do 
something

like this:

  1) call gettimeofday() at thread start, giving a common unix 
timestamp
  2) measure the time from the thread start using the conters (for each 
transaction)

  3) combine those values

This might of course give up to a second difference compared to the 
actual time

(because of the gettimeofday precision), but IMHO that's fine.

An even simpler option would omit the (1), so the timestamps would 
start at 0.


Or is there a better way?

Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-01-17 Thread Robert Haas
On Wed, Jan 16, 2013 at 1:38 PM, Tom Lane  wrote:
>> Where I really need someone to hit me upside the head with a
>> clue-stick is the code I added to the bottom of RelationBuildDesc()
>> in relcache.c. The idea is that on first access to an unlogged MV,
>> to detect that the heap has been replaced by the init fork, set
>> relisvalid to false, and make the heap look normal again.
>
> Hmm.  I agree that relcache.c has absolutely no business doing that,
> but not sure what else to do instead.  Seems like it might be better
> done at first touch of the MV in the parser, rewriter, or planner ---
> but the fact that I can't immediately decide which of those is right
> makes me feel that it's still too squishy.

I think we shouldn't be doing that at all.  The whole business of
transferring the relation-is-invalid information from the relation to
a pg_class flag seems like a Rube Goldberg device to me.  I'm still
not convinced that we should have a relation-is-invalid flag at all,
but can we at least not have two?

It seems perfectly adequate to detect that the MV is invalid when we
actually try to execute a plan - that is, when we first access the
heap or one of its indexes.  So the bit can just live in the
file-on-disk, and there's no need to have a second copy of it in
pg_class.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Andrew Dunstan


On 01/17/2013 06:11 AM, Tomas Vondra wrote:

Dne 17.01.2013 11:16, Magnus Hagander napsal:

On Thu, Jan 17, 2013 at 11:09 AM, Dave Page  wrote:
On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander 
 wrote:

We have committed platform-specific features before, but generally
only when it's not *possible* to do them for all platforms. For
example the posix_fadvise stuff isn't available on Windows at all, so
there isn't much we can do there.


Right - having platform specific features for other reasons like lack
of time is a slippery slope in my opinion. We should not get into such
a habit or Windows will quickly become a second class platform as far
as PostgreSQL features are concerned.


Especially since there is no lack of time - the functionality is
there, it just looks (significantly) different.


Really? Any link to relevant docs or something?

When doing some research in this field, the only option I was able to 
come up
with was combining gettimeofday() with the timing functionality, and 
do something

like this:

  1) call gettimeofday() at thread start, giving a common unix timestamp
  2) measure the time from the thread start using the conters (for 
each transaction)

  3) combine those values

This might of course give up to a second difference compared to the 
actual time

(because of the gettimeofday precision), but IMHO that's fine.


The link I posted showed a technique which essentially uses edge 
detection on gettimeofday() to get an accurate correspondence between 
the time of day and the performance counter, following which you can 
supposedly calculate the time of day with a high degree of accuracy just 
from the performance counter. You should be able to do this just once, 
at program start. If you don't care that much about the initial 
precision then your procedure should work fine, I think.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> Hi all,
> 
> There is a strange bug with the latest master head (commit 7fcbf6a).
> When the WAL stream with a master is cut on a slave, slave returns a FATAL
> (well normal...), but then enters in recovery process and automatically
> promotes.
> Here are more details about the logs on slave (I simply killed the master
> manually):
> FATAL:  could not receive data from WAL stream:
> cp: cannot stat
> ‘/home/michael/bin/pgsql/archive/master/00010004’: No such
> file or directory
> LOG:  record with zero length at 0/401E1B8
> LOG:  redo done at 0/401E178
> LOG:  last completed transaction was at log time 2013-01-17
> 20:27:53.180971+09
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0002.history’:
> No such file or directory
> LOG:  selected new timeline ID: 2
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0001.history’:
> No such file or directory
> LOG:  archive recovery complete
> DEBUG:  resetting unlogged relations: cleanup 0 init 1
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> DEBUG:  archived transaction log file "00010004"
> DEBUG:  archived transaction log file "0002.history"
> LOG:  statement: create table bn (a int);
> DEBUG:  autovacuum: processing database "postgres"
> 
> Slave does not try anymore to reconnect to master with messages of the type:
> FATAL:  could not connect to the primary server
> 
> I also noticed that there is some delay until modifications on master are
> visible on slave.
> For example run a simple CREATE TABLE and the new table is not
> 
> [some bisecting later...]
> 
> I think that bug has been introduced by commit 7fcbf6a.
> Before splitting xlog reading as a separate facility things worked
> correctly.
> There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4

2013-01-17 Thread Craig Ringer
On 01/17/2013 04:43 PM, Magnus Hagander wrote:
>
>
> On Jan 17, 2013 8:15 AM, "Abhijit Menon-Sen"  > wrote:
> >
> > At 2013-01-17 16:05:05 +0900, michael.paqu...@gmail.com
>  wrote:
> > >
> > > Is it really necessary to create a new commit fest just to move the
> > > items? Marking the patches that are considered as being too late for
> > > 9.3 should be just returned with feedback.
> >
> > Opening 2013-03 is not so much to move existing patches, but to give
> > people a place to submit *new* post-9.3 patches without interrupting
> > the 2013-01 CF.
>
> Yeah, and +1 for doing that. The sooner the better. By whichever of
> the time frames for the cf that had been discussed, it should
> certainly not be open for accepting new patches for 9.3 anymore.
>
I've moved all pending patches from 2012-11 to 2013-01. I'll go through
and poke them for aliveness and start chasing things up; in the mean
time, any chance of closing 2012-11?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Andrew Dunstan


On 01/17/2013 06:04 AM, Tomas Vondra wrote:

The problem is I have access to absolutely no Windows machines,
not mentioning the development tools (and that I have no clue about it).

I vaguely remember there were people on this list doing Windows 
development
on a virtual machine or something. Any interest in working on this / 
giving

me some help?





One of the items on my very long TODO list (see stuff elsewhere about 
committers not doing enough reviewing and committing) is to create a 
package that can easily be run to set Windows Postgres development 
environments for MSVC, Mingw and Cygwin on Amazon, GoGrid etc.


Once I get that done I'll be a lot less sympathetic to "I don't have 
access to Windows" pleas.


Windows does run in a VM very well, but if you're going to set it up on 
your own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own 
legal copy to install there. If you don't already have one, that will 
set you back about $140.00 (for w7 Pro) in the USA. Note that that's 
significantly better than the situation with OSX, which you can't run at 
all except on Apple hardware.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Dave Page
On Thu, Jan 17, 2013 at 1:29 PM, Andrew Dunstan  wrote:
>
> On 01/17/2013 06:04 AM, Tomas Vondra wrote:
>>
>> The problem is I have access to absolutely no Windows machines,
>> not mentioning the development tools (and that I have no clue about it).
>>
>> I vaguely remember there were people on this list doing Windows
>> development
>> on a virtual machine or something. Any interest in working on this /
>> giving
>> me some help?
>>
>>
>
>
> One of the items on my very long TODO list (see stuff elsewhere about
> committers not doing enough reviewing and committing) is to create a package
> that can easily be run to set Windows Postgres development environments for
> MSVC, Mingw and Cygwin on Amazon, GoGrid etc.

FYI, I have a similar item on my TODO list, though I was looking
primarily at VC++ on AWS. It did get close to the top last week, but
then I got busy with other things :-/. Anyway, I'd suggest we ping
each other if either actually get started, to avoid duplication of
effort.


--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Magnus Hagander
On Thu, Jan 17, 2013 at 2:29 PM, Andrew Dunstan  wrote:
>
> On 01/17/2013 06:04 AM, Tomas Vondra wrote:
>>
>> The problem is I have access to absolutely no Windows machines,
>> not mentioning the development tools (and that I have no clue about it).
>>
>> I vaguely remember there were people on this list doing Windows
>> development
>> on a virtual machine or something. Any interest in working on this /
>> giving
>> me some help?
>>
>>
>
>
> One of the items on my very long TODO list (see stuff elsewhere about
> committers not doing enough reviewing and committing) is to create a package
> that can easily be run to set Windows Postgres development environments for
> MSVC, Mingw and Cygwin on Amazon, GoGrid etc.
>
> Once I get that done I'll be a lot less sympathetic to "I don't have access
> to Windows" pleas.
>
> Windows does run in a VM very well, but if you're going to set it up on your
> own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own legal
> copy to install there. If you don't already have one, that will set you back
> about $140.00 (for w7 Pro) in the USA. Note that that's significantly better
> than the situation with OSX, which you can't run at all except on Apple
> hardware.

Yeah. I used to have an AMI with the VS environment preinstalled on
Amazon, but I managed to fat finger things and delete it at some point
and haven't really had time to rebuild it.

Having a script that would download and install all the pre-requisites
on such a box would be *great*. Then you could get up and going pretty
quickly, and getting a Windows box up running for a few hours there is
almost free, and you don't have to deal with licensing hassles.

(Of course, the AMI method doesn't work all the way since you'd be
distributing Visual Studio, but if we can have a script that
auto-downloads-and-installs it as necessary we can get around that)


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small pg_basebackup display bug

2013-01-17 Thread Magnus Hagander
On Sun, Dec 16, 2012 at 7:20 PM, Tom Lane  wrote:
> Magnus Hagander  writes:
>> On Sat, Dec 15, 2012 at 2:24 PM, Erik Rijkers  wrote:
>>> That would make such a truncation less frequent, and after all a truncated 
>>> display is not
>>> particular useful.
>
>> Agreed - it's useful during testing, but not in a typical production
>> use. It might actually be more useful if it's truncated in in the
>> other end (keeping the last 30 instead of the first 30 chars)
>
> +1 for truncating from the left.  I think pg_upgrade already does that
> in its progress messages.

Fixed. I also fixed the output of the size parameter to be a fixed
length, so the whole row doesn't shift left and right depending on how
far long the process is.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] default SSL compression (was: libpq compression)

2013-01-17 Thread Magnus Hagander
On Wed, Jan 2, 2013 at 3:17 PM, Magnus Hagander  wrote:
> On Wed, Jan 2, 2013 at 3:15 PM, Noah Misch  wrote:
>> On Wed, Jan 02, 2013 at 02:03:20PM +0100, Magnus Hagander wrote:
>>> On Wed, Jan 2, 2013 at 1:15 AM, Tom Lane  wrote:
>>> > So +1 for changing it to "DEFAULT" from me, too.  There's no reason to
>>> > think we know more about this than the OpenSSL authors.
>>>
>>> The DEFAULT value in OpenSSL 1.0 means "ALL:!aNULL:!eNULL".
>>>
>>> Researching some more, this might cause a problem actually, which
>>> would explain some of the things that are in our default. For example,
>>> an ADH algorithm doesn't use certificates - but it uses DH parameters,
>>> so it likely won't work anyway. EDH uses certs, but also requires DH
>>> parameters.
>>>
>>> Maybe what we nede is "DEFAULT:!ADH:@STRENGTH" as the default?
>>
>> I understand aNULL to include ADH.
>
> Hmm. Seems you're right when I run a test on it, I was reading it wrong.
>
>
>>> The other difference is that our current string denies 40 and 56 bit
>>> encryptions (low and export strenghts). Do we stll want to do that?
>>
>> On the one hand, those seem bad to permit by default in 2013.  On the other
>> hand, if so, why hasn't OpenSSL removed them from DEFAULT?  Perhaps it has
>> backward-compatibility concerns that wouldn't apply to us by virtue of having
>> disabled them for some time.  Sounds reasonable to continue disabling them.
>
> So then the default would be "DEFAULT:!LOW:!EXP:@STRENGTH"
>
> (the @STRENGTH part is the sorting key for preference, which the
> default one seems not to have)
>
> The biggest difference being that we start from DEFAULT rather than ALL.

I've applied a change that does this, including still rejecting MD5.
Meaning that the difference is we start from DEFAULT instead of ALL
(and the ADH rule is removed, since !aNULL is already in the default).

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] could not create directory "...": File exists

2013-01-17 Thread Stephen Frost
Greetings,

  We've come across this rather annoying error happening during our
  builds:

  ERROR:  could not create directory "pg_tblspc/25120/PG_9.3_201212081/231253": 
File exists

  It turns out that this is coming from copydir() when called by
  createdb() during a CREATE DATABASE .. FROM TEMPLATE where the
  template has tables in tablespaces.

  It turns out that createdb() currently only takes an AccessShareLock
  on pg_tablespace when scanning it with SnapshotNow, making it possible
  for a concurrent process to make some uninteresting modification to a
  tablespace (such as an ACL change) which will cause the heap scan in
  createdb() to see a given tablespace multiple times.  copydir() will
  then, rightfully, complain that it's being asked to create a directory
  which already exists.

  Given that this is during createdb(), I'm guessing we don't have any
  option but to switch the scan to using ShareLock, since there isn't a
  snapshot available to do an MVCC scan with (I'm guessing that there
  could be other issues trying to do that anyway).

  Attached is a patch which does this and corrects the problem for us
  (of course, we're now going to go rework our build system to not
  modify tablespace ACLs, since with this patch concurrent builds end up
  blocking on each other, which is annoying).

Thanks,

Stephen
colordiff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
new file mode 100644
index 1f6e02d..60a5099
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
*** createdb(const CreatedbStmt *stmt)
*** 549,556 
  		/*
  		 * Iterate through all tablespaces of the template database, and copy
  		 * each one to the new database.
  		 */
! 		rel = heap_open(TableSpaceRelationId, AccessShareLock);
  		scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
--- 549,563 
  		/*
  		 * Iterate through all tablespaces of the template database, and copy
  		 * each one to the new database.
+ 		 *
+ 		 * We need to use ShareLock to prevent other processes from updating a
+ 		 * tablespace (such as changing an ACL, for example) or we will end up
+ 		 * running into the same tablespace multiple times during our heap scan
+ 		 * (the prior-to-update tuple and then the new tuple after the update)
+ 		 * and copydir() will, rightfully, complain that the directory already
+ 		 * exists.
  		 */
! 		rel = heap_open(TableSpaceRelationId, ShareLock);
  		scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
*** createdb(const CreatedbStmt *stmt)
*** 607,613 
  			}
  		}
  		heap_endscan(scan);
! 		heap_close(rel, AccessShareLock);
  
  		/*
  		 * We force a checkpoint before committing.  This effectively means
--- 614,620 
  			}
  		}
  		heap_endscan(scan);
! 		heap_close(rel, ShareLock);
  
  		/*
  		 * We force a checkpoint before committing.  This effectively means


signature.asc
Description: Digital signature


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
 wrote:
> May be you've already addressed that concern with the proven
> performance numbers, but I'm not sure.

It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.

Jeff's approach of holding the VM pins for longer certainly mitigates
some of the damage, in the sense that it reduces buffer lookups and
pin/unpin cycles - and might be worth doing independently of the rest
of the patch if we think it's a win.  Index-only scans already use a
similar optimization, so extending it to inserts, updates, and deletes
is surely worth considering.  The main question in my mind is whether
there are any negative consequences to holding a VM buffer pin for
that long without interruption.  The usual consideration - namely,
blocking vacuum - doesn't apply here because vacuum does not take a
cleanup lock on the visibility map page, only the heap page, but I'm
not sure if there are others.

The other part of the issue is cache pressure. I don't think I can say
it better than what Tom already wrote:

# I'd be worried about the case of a lot of sessions touching a lot of
# unrelated tables.  This change implies doubling the number of buffers
# that are held pinned by any given query, and the distributed overhead
# from that (eg, adding cycles to searches for free buffers) is what you
# ought to be afraid of.

I agree that we ought to be afraid of that.  A pgbench test isn't
going to find a problem in this area because there you have a bunch of
sessions all accessing the same small group of tables.  To find a
problem of the type above, you'd need lots of backends accessing lots
of different, small tables.  That's not the use case we usually
benchmark, but I think there are a fair number of people doing things
like that - for example, imagine a hosting provider or web application
with many databases or schemas on a single instance.  AFAICS, Jeff
hasn't tried to test this scenario.

Now, on the flip side, we should also be thinking about what we would
gain from this patch, and what other ways there might be to achieve
the same goals.  As far as I can see, the main gain is that if you
bulk-load a table, don't vacuum it right away, get all the hint bits
set by some other mechanism, and then vacuum the table, you'll only
read the whole table instead of rewriting the whole table.  So ISTM
that, for example, if we adopted the idea of making HOT prune set
visibility-map bits, most of the benefit of this change evaporates,
but whatever costs it may have will remain.  There are other possible
ways of getting the same benefit as well - for example, I've been
thinking for a while now that we should try to find some judicious way
of vacuuming insert-only tables, perhaps only in small chunks when
there's nothing else going on.  That wouldn't as clearly obsolete this
patch, but it would address a very similar use case and would also
preset hint bits, which would help a lot of people.  Some of the ideas
we've had about a HEAP_XMIN_FREEZE intersect with this idea, too - if
we can do an early freeze without losing forensic information, we can
roll setting the hint bit, setting PD_ALL_VISIBLE, and freezing the
page into a single write.

All of which is to say that I think this patch is premature.  If we
adopt something like this, we're likely never going to revert back to
the way we do it now, and whatever cache-pressure or other costs this
approach carries will be hear to stay - so we had better think awfully
carefully before we do that.  And, FWIW, I don't believe that there is
sufficient time in this release cycle to carefully test this patch and
the other alternative designs that aim toward the same ends.  Even if
there were, this is exactly the sort of thing that should be committed
at the beginning of a release cycle, not the end, so as to allow
adequate time for discovery of unforeseen consequences before the code
ends up out in the wild.

Of course, there's another issue here too, which is that as Pavan
points out, the page throws crash-safety out the window, which breaks
index-only scans (if you have a crash).  HEAP_XLOG_VISIBLE is intended
principally to protect the VM bit, not PD_ALL_VISIBLE, but the patch
rips it out even though its purpose is to remove the latter, not the
former.  Removing PD_ALL_VISIBLE eliminates the need to keep the
visibility-map bit consist with PD_ALL_VISIBLE, but it does not
eliminate the need to keep PD_ALL_VISIBLE consistent with the page
contents.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CF3+4

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 8:17 AM, Craig Ringer  wrote:
> I've moved all pending patches from 2012-11 to 2013-01. I'll go through and
> poke them for aliveness and start chasing things up; in the mean time, any
> chance of closing 2012-11?

Done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Robert Haas
On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
 wrote:
> I'd prefer to leave the .partial suffix in place, as the segment really
> isn't complete. It doesn't make a difference when you recover to the latest
> timeline, but if you have a more complicated scenario with multiple
> timelines that are still "alive", ie. there's a server still actively
> generating WAL on that timeline, you'll easily get confused.
>
> As an example, imagine that you have a master server, and one standby. You
> maintain a WAL archive for backup purposes with pg_receivexlog, connected to
> the standby. Now, for some reason, you get a split-brain situation and the
> standby server is promoted with new timeline 2, while the real master is
> still running. The DBA notices the problem, and kills the standby and
> pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
> pg_receivexlog's target directory, and re-points pg_recevexlog to the master
> while he re-builds the standby server from backup. At that point,
> pg_receivexlog will start streaming from the end of the zero-padded segment,
> not knowing that it was partial, and you have a hole in the archived WAL
> stream. Oops.
>
> The DBA could avoid that by also removing the last WAL segment on timeline
> 1, the one that was partial. But it's really not obvious that there's
> anything wrong with that segment. Keeping the .partial suffix makes it
> clear.

I shudder at the idea that the DBA is manually involved in any of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 16:56, Robert Haas wrote:

On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
  wrote:

I'd prefer to leave the .partial suffix in place, as the segment really
isn't complete. It doesn't make a difference when you recover to the latest
timeline, but if you have a more complicated scenario with multiple
timelines that are still "alive", ie. there's a server still actively
generating WAL on that timeline, you'll easily get confused.

As an example, imagine that you have a master server, and one standby. You
maintain a WAL archive for backup purposes with pg_receivexlog, connected to
the standby. Now, for some reason, you get a split-brain situation and the
standby server is promoted with new timeline 2, while the real master is
still running. The DBA notices the problem, and kills the standby and
pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
pg_receivexlog's target directory, and re-points pg_recevexlog to the master
while he re-builds the standby server from backup. At that point,
pg_receivexlog will start streaming from the end of the zero-padded segment,
not knowing that it was partial, and you have a hole in the archived WAL
stream. Oops.

The DBA could avoid that by also removing the last WAL segment on timeline
1, the one that was partial. But it's really not obvious that there's
anything wrong with that segment. Keeping the .partial suffix makes it
clear.


I shudder at the idea that the DBA is manually involved in any of this.


The scenario I described is that you screwed up your failover 
environment, and end up with a split-brain situation by accident. The 
DBA certainly needs to be involved to recover from that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 16:53, Robert Haas wrote:

On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
  wrote:

May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.


It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.


The idea was to avoid clearing the bit in the VM page on every update, 
when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is 
not set. I assumed the traffic and contention on the VM page would be a 
killer otherwise. I don't remember if I ever actually tested that 
though. Maybe I was worrying about nothing and hitting the VM page on 
every update is ok.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 9:59 AM, Heikki Linnakangas
 wrote:
> The scenario I described is that you screwed up your failover environment,
> and end up with a split-brain situation by accident. The DBA certainly needs
> to be involved to recover from that.

OK, I agree, but I still think a lot of DBAs would have no idea how to
handle that situation.  I agree with your proposal, don't get me wrong
- I just think there's still an awful lot of room for operator error
in these more complex replication scenarios.  I don't have a clue how
to fix that, and it's certainly not the purpose of this thread to fix
that; I'm just venting.

Actually, I'm really glad to see all the work you've done to improve
the way that some of these scenarios work and eliminate various bugs
and other surprising failure modes over the last couple of months.
It's great stuff.  Alas, I think we still some distance from being
able to provide an "easy button".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 15:05, Andres Freund wrote:

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.


Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.


Got a patch?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby conflict resolution handling

2013-01-17 Thread Tom Lane
Pavan Deolasee  writes:
> On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane  wrote:
>> ISTM that if we dare not interrupt for fear of confusing OpenSSL, we
>> cannot safely attempt to send an error message to the client either;
>> but ereport(FATAL) will try exactly that.

> I thought since FATAL will force the backend to exit, we don't care much
> about corrupted OpenSSL state. I even thought that's why we raise ERROR to
> FATAL so that the backend can start in a clean state. But clearly I'm
> missing a point here because you don't think that way.

If we were to simply exit(1), leaving the kernel to close the client
socket, it'd be safe enough because control would never have returned to
OpenSSL.  But this code doesn't do that.  What we're looking at is that
we've interrupted OpenSSL at some arbitrary point, and now we're going
to make fresh calls to it to try to pump the FATAL error message out to
the client.  It seems fairly unlikely that that's safe.  I'm not sure
I credit Andres' worry of arbitrary code execution, but I do fear that
OpenSSL could get confused to the point of freezing up, or even more
likely that it would transmit garbage to the client, which rather
defeats the purpose.

Don't see a nice fix.  The COMMERROR approach (ie, don't try to send
anything to the client, only the log) is not nice at all since the
client would get the impression that the server crashed.  On the other
hand, anything else requires waiting till we get control back from
OpenSSL, which might be a long time, and meanwhile we're still holding
locks that prevent WAL recovery from proceeding.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 15:05, Andres Freund wrote:
> >On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> >>I think that bug has been introduced by commit 7fcbf6a.
> >>Before splitting xlog reading as a separate facility things worked
> >>correctly.
> >>There are also no delay problems before this commit.
> >
> >Ok, my inkling proved to be correct, its two related issues:
> >
> >a ) The error handling in XLogReadRecord is inconsistent, it doesn't
> >always reset the necessary things.
> >
> >b) ReadRecord: We cannot not break out of the retry loop in readRecord
> >just so, just removing the break seems correct.
> >
> >c) ReadRecord: (minor): We should log an error even if errormsg is not
> >set, otherwise we wont jump out far enough.
> >
> >I think at least a) and b) is the result of merges between development
> >of different people, sorry for that.
> 
> Got a patch?

Yes, I am just testing some scenarios with it, will send it after that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 16:23:44 +0100, Andres Freund wrote:
> On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:
> > On 17.01.2013 15:05, Andres Freund wrote:
> > >On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> > >>I think that bug has been introduced by commit 7fcbf6a.
> > >>Before splitting xlog reading as a separate facility things worked
> > >>correctly.
> > >>There are also no delay problems before this commit.
> > >
> > >Ok, my inkling proved to be correct, its two related issues:
> > >
> > >a ) The error handling in XLogReadRecord is inconsistent, it doesn't
> > >always reset the necessary things.
> > >
> > >b) ReadRecord: We cannot not break out of the retry loop in readRecord
> > >just so, just removing the break seems correct.
> > >
> > >c) ReadRecord: (minor): We should log an error even if errormsg is not
> > >set, otherwise we wont jump out far enough.
> > >
> > >I think at least a) and b) is the result of merges between development
> > >of different people, sorry for that.
> > 
> > Got a patch?
> 
> Yes, I am just testing some scenarios with it, will send it after that.

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch.  I also noticed that the TLI check
didn't mark the last source as failed.

Not a real issue and its independent from this patch but I found that
when promoting from streaming rep the code first fails over to archive
recovery and only then to recovering from pg_xlog.  Is that intended?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70cfabc..e33bd7a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3209,12 +3209,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			/* not all failures fill errormsg; report those that do */
-			if (errormsg && errormsg[0] != '\0')
-ereport(emode_for_corrupt_record(emode,
- RecPtr ? RecPtr : EndRecPtr),
-		(errmsg_internal("%s", errormsg) /* already translated */));
-
 			lastSourceFailed = true;
 
 			if (readFile >= 0)
@@ -3222,7 +3216,23 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 close(readFile);
 readFile = -1;
 			}
-			break;
+
+			/*
+			 * We only end up here without a message when XLogPageRead() failed
+			 * - in that case we already logged something.
+			 * In StandbyMode that only happens if we have been triggered, so
+			 * we shouldn't loop anymore in that case.
+			 */
+			if (errormsg == NULL)
+break;
+
+
+			ereport(emode_for_corrupt_record(emode,
+			 RecPtr ? RecPtr : EndRecPtr),
+	(errmsg_internal("%s", errormsg) /* already translated */));
+
+			/* don't make a timeline check */
+			continue;
 		}
 
 		/*
@@ -3234,6 +3244,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLogSegNo segno;
 			int32 offset;
 
+			lastSourceFailed = true;
+
 			XLByteToSeg(xlogreader->latestPagePtr, segno);
 			offset = xlogreader->latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader->readPageTLI, segno);
@@ -3243,7 +3255,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			xlogreader->latestPageTLI,
 			fname,
 			offset)));
-			return false;
+			continue;
 		}
 	} while (StandbyMode && record == NULL);
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ff871a3..75164f6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord);
 
 	if (readOff < 0)
-	{
-		if (state->errormsg_buf[0] != '\0')
-			*errormsg = state->errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	/*
 	 * ReadPageInternal always returns at least the page header, so we can
@@ -246,8 +242,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, "invalid record offset at %X/%X",
 			  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	if XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
@@ -255,8 +250,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, "contrecord is requested by %X/%X",
 			  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	/* ReadPageInternal has verified the page header */
@@ -270,11 +264,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 			   targetPagePtr,
 		

Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Tom Lane
Stephen Frost  writes:
>   It turns out that createdb() currently only takes an AccessShareLock
>   on pg_tablespace when scanning it with SnapshotNow, making it possible
>   for a concurrent process to make some uninteresting modification to a
>   tablespace (such as an ACL change) which will cause the heap scan in
>   createdb() to see a given tablespace multiple times.  copydir() will
>   then, rightfully, complain that it's being asked to create a directory
>   which already exists.

Ugh.  Still another problem with non-MVCC catalog scans.

>   Given that this is during createdb(), I'm guessing we don't have any
>   option but to switch the scan to using ShareLock, since there isn't a
>   snapshot available to do an MVCC scan with (I'm guessing that there
>   could be other issues trying to do that anyway).

It seems that the only thing we actually use from each tuple is the OID.
So there are other ways to fix it, of which probably the minimum-change
one is to keep a list of already-processed tablespace OIDs and skip a
tuple if we get a match in the list.  This would be O(N^2) in the number
of tablespaces, but I rather doubt that's a problem.

[ thinks for a bit... ]  Ugh, no, because the *other* risk you've got
here is not seeing a row at all, which would be really bad.

I'm not sure that transiently increasing the lock here is enough,
either.  The concurrent transactions you're worried about probably
aren't holding their locks till commit, so you could get this lock
and still see tuples with unstable committed-good states.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby conflict resolution handling

2013-01-17 Thread Andres Freund
On 2013-01-17 10:19:23 -0500, Tom Lane wrote:
> Pavan Deolasee  writes:
> > On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane  wrote:
> >> ISTM that if we dare not interrupt for fear of confusing OpenSSL, we
> >> cannot safely attempt to send an error message to the client either;
> >> but ereport(FATAL) will try exactly that.
> 
> > I thought since FATAL will force the backend to exit, we don't care much
> > about corrupted OpenSSL state. I even thought that's why we raise ERROR to
> > FATAL so that the backend can start in a clean state. But clearly I'm
> > missing a point here because you don't think that way.
> 
> If we were to simply exit(1), leaving the kernel to close the client
> socket, it'd be safe enough because control would never have returned to
> OpenSSL.  But this code doesn't do that.  What we're looking at is that
> we've interrupted OpenSSL at some arbitrary point, and now we're going
> to make fresh calls to it to try to pump the FATAL error message out to
> the client.  It seems fairly unlikely that that's safe.  I'm not sure
> I credit Andres' worry of arbitrary code execution, but I do fear that
> OpenSSL could get confused to the point of freezing up, or even more
> likely that it would transmit garbage to the client, which rather
> defeats the purpose.

I don't think its likely either, I seem to remember it copying arround
function pointers though, so it seems possible with some bad luck.

> Don't see a nice fix.  The COMMERROR approach (ie, don't try to send
> anything to the client, only the log) is not nice at all since the
> client would get the impression that the server crashed.  On the other
> hand, anything else requires waiting till we get control back from
> OpenSSL, which might be a long time, and meanwhile we're still holding
> locks that prevent WAL recovery from proceeding.

I think we can make openssl return pretty much immediately if we assume
recv() can reliably interrupted by signals, possibly by setting the
socket to nonblocking in the signal handler.
We just need to tell openssl not to retry immediately and we should be
fine. Given that quite some people use openssl with nonblocking sockets,
that code path should be reasonably safe.

That still requires ugliness around saving the error and reraising it
after returning from openssl though...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-01-17 Thread Thom Brown
On 16 January 2013 17:25, Thom Brown  wrote:

> On 16 January 2013 17:20, Kevin Grittner  wrote:
>
>> Thom Brown wrote:
>>
>> > Some weirdness:
>> >
>> > postgres=# CREATE VIEW v_test2 AS SELECT 1 moo;
>> > CREATE VIEW
>> > postgres=# CREATE MATERIALIZED VIEW mv_test2 AS SELECT moo, 2*moo FROM
>> > v_test2 UNION ALL SELECT moo, 3*moo FROM v_test2;
>> > SELECT 2
>> > postgres=# \d+ mv_test2
>> >  Materialized view "public.mv_test2"
>> >  Column | Type | Modifiers | Storage | Stats target | Description
>> > --+-+---+-+--+-
>> >  moo | integer | | plain | |
>> >  ?column? | integer | | plain | |
>> > View definition:
>> >  SELECT "*SELECT* 1".moo, "*SELECT* 1"."?column?";
>>
>> You are very good at coming up with these, Thom!
>>
>> Will investigate.
>>
>> Can you confirm that *selecting* from the MV works as you would
>> expect; it is just the presentation in \d+ that's a problem?
>>
>
> Yes, nothing wrong with using the MV, or refreshing it:
>
> postgres=# TABLE mv_test2;
>  moo | ?column?
> -+--
>1 |2
>1 |3
> (2 rows)
>
> postgres=# SELECT * FROM mv_test2;
>  moo | ?column?
> -+--
>1 |2
>1 |3
> (2 rows)
>
> postgres=# REFRESH MATERIALIZED VIEW mv_test2;
> REFRESH MATERIALIZED VIEW
>
> But a pg_dump of the MV has the same issue as the view definition:
>
> --
> -- Name: mv_test2; Type: MATERIALIZED VIEW; Schema: public; Owner: thom;
> Tablespace:
> --
>
> CREATE MATERIALIZED VIEW mv_test2 (
> moo,
> "?column?"
> ) AS
> SELECT "*SELECT* 1".moo, "*SELECT* 1"."?column?"
>   WITH NO DATA;
>

A separate issue is with psql tab-completion:

postgres=# COMMENT ON MATERIALIZED VIEW ^IIS

This should be offering MV names instead of prematurely providing the "IS"
keyword.

-- 
Thom


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 5:18 AM, Dimitri Fontaine
 wrote:
> Now, if that's what it takes, I'll spend time on it. In which exact
> order do you want to be reviewing and applying that series of patches?

Let's agree on which things we even want to do first.  Here's my take:

- adds ddl_command_trace and ddl_command_end events

I think ddl_command_end is OK and I'm willing to commit that if
extracted as its own patch.  I think ddl_command_trace is unnecessary
syntactic sugar.

- causes those events to be called not only for actual SQL commands
but also for things that happen, at present, to go through the same
code path

I think this is a bad idea, not only because, as I said before, it
exposes internal implementation details, but also because it's
probably not safe.  As Noah (I believe, or maybe Tom), observed in a
previous post, we've gone to a lot of work to make sure that nothing
you do inside a trigger can crash the server, corrupt the catalogs,
trigger elog() messages, etc.  That applies in spades to DDL.

Despite Tom's skepticism, I'm pretty sure that it's safe for us to
execute trigger function calls either completely before or completely
after we execute the DDL command.  It should be no different than
executing them as separate commands.  But doing something in the
middle of a command is something else altogether.  Consider, for
example, that ALTER TABLE does a bunch of crap internally to itself,
and then recurses into ProcessUtility to do some more stuff.  This is
a terrible design, but it's what we've got.  Are you absolutely sure
that the intermediate state that exists after that first bunch of
stuff is done and before those other things are done is a safe place
to execute arbitrary user-provided code?  I'm not.  And the more
places we add that recurse into ProcessUtility, or indeed, add event
triggers in general, the more places we're going to add new failure
modes.

Fixing that is part of the price of adding a new event trigger and I'm
OK with that.  But I'm not OK with baking into the code the assumption
that any current or future callers of ProcessUtility() should be
prepared for arbitrary code execution.  The code wasn't written with
that in mind, and it will not end well.

- adds additional magic variables to PL/pgsql to expose new
information not previously exposed

I'm not sure that I agree with the implementation of this, but I think
I'm OK with the concept.  I will review it when it is submitted in a
form not intertwined with other things.

- adds CONTEXT as an additional event-trigger-filter variable, along with TAG

I'm OK with adding new event-trigger-filter variables, provided that
they are done without additional run-time overhead for people who are
not using event triggers; I think it's nice that we can support that.
But I think this particular proposal is DOA because nothing except
TOPLEVEL is actually safe.

>> As for the patch not being well-thought-out, I'm sticking by that,
>> too.  Alvaro's comments about lock escalations should be enough to
>> tickle your alarm bells, but there are plenty of other problems here,
>> too.
>
> Here's the comment in the code where the lock problems are discussed:
>
> /*
>  * Fill in the EventTriggerTargetOid for a DROP command and a
>  * "ddl_command_start" Event Trigger: the lookup didn't happen yet and we
>  * still want to provide the user with that information when possible.
>  *
>  * We only do the lookup if the command contains a single element, which is
>  * often forced to be the case as per the grammar (see the drop_type:
>  * production for a list of cases where more than one object per command is
>  * allowed).
>  *
>  * We take no exclusive lock here, the main command will have to do the
>  * name lookup all over again and take appropriate locks down after the
>  * User Defined Code runs anyway, and the commands executed by the User
>  * Defined Code will take their own locks. We only lock the objects here so
>  * that they don't disapear under us in another concurrent transaction,
>  * hence using ShareUpdateExclusiveLock.  XXX this seems prone to lock
>  * escalation troubles.
>  */
>
> My thinking is that the user code can do anything, even run a DROP
> command against the object that we are preparing ourselves to be
> dropping. I don't think we can bypass doing the lookup twice.

Sure, but there's also the issue of getting the right answer.  For
example, suppose you are planning to drop a function.  You do a name
lookup on the name and call the event trigger.  Meanwhile, someone
else renames the function and creates a new one with the same name.
After the event trigger returns, the actual drop code latches onto the
one with the new name.  Now, the event trigger ran with OID A and the
actual object dropped was one with OID B.  It's possible that KaiGai's
RENAME refactoring in 9.3 has strengthened the locking enough that
this particular failure mode is no longer possible in that specific
scenario, but I am fairly confident it woul

Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Andres Freund
On 2013-01-17 11:15:24 -0500, Robert Haas wrote:
> As a further example, suppose that in 9.4 (or 9.17) we add a command
> "DROP TABLES IN SCHEMA fred WHERE name LIKE 'bob%'".  Well, the
> logging trigger still just works (because it's only writing the
> statement, without caring about its contents).  And the replication
> trigger still just works too (because it will still get a callback for
> every table that gets dropped, even though it knows nothing about the
> new syntax).  That's very powerful.  Of course, there will still be
> cases where things have to change, but you can minimize it by clean
> definitions.
> 
> It pains me that I've evidently failed to communicate this concept
> clearly despite a year or more of trying.  Does that make sense?  Is
> there some way I can make this more clear?  The difference seems very
> clear-cut to me, but evidently I'm not conveying it well.

Well, I didn't manage to follow the topic with the level of detail I
would have liked to/should have so its very well possible that I missed
a proposal of how to implement that concept in a way you like?
With some squinting you can actually see the proposed ddl_trace callsite
as exactly that kind of replication trigger, but with a worse name...

Without piggy-backing on the internal commands generated - which
currently seems to be achieved by passing everything through
ProcessUtility, but could work in a quite different way - and
reconstructing sql from that I don't immediately see how you want to do
this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch.  I also noticed that the TLI check
didn't mark the last source as failed.


This looks fragile:


/*
 * We only end up here without a message when 
XLogPageRead() failed
 * - in that case we already logged something.
 * In StandbyMode that only happens if we have been 
triggered, so
 * we shouldn't loop anymore in that case.
 */
if (errormsg == NULL)
break;


I don't like relying on the presence of an error message to control 
logic like that. Should we throw in an explicit CheckForStandbyTrigger() 
check in the condition of that loop?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 17:42, Andres Freund wrote:
> >Ok, the attached patch seems to fix a) and b). c) above is bogus, as
> >explained in a comment in the patch.  I also noticed that the TLI check
> >didn't mark the last source as failed.
> 
> This looks fragile:
> 
> > /*
> >  * We only end up here without a message when 
> > XLogPageRead() failed
> >  * - in that case we already logged something.
> >  * In StandbyMode that only happens if we have been 
> > triggered, so
> >  * we shouldn't loop anymore in that case.
> >  */
> > if (errormsg == NULL)
> > break;
> 
> I don't like relying on the presence of an error message to control logic
> like that. Should we throw in an explicit CheckForStandbyTrigger() check in
> the condition of that loop?

I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.

Don't really have a good idea :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latex longtable format

2013-01-17 Thread Bruce Momjian
On Sat, Jan 12, 2013 at 07:09:31PM -0500, Bruce Momjian wrote:
> I have received several earnest requests over the years for LaTeX
> 'longtable' output, and I have just implemented it based on a sample
> LaTeX longtable output file.
> 
> I have called it 'latex-longtable' and implemented all the behaviors of
> ordinary latex mode.   One feature is that in latex-longtable mode,
> 'tableattr' allows control over the column widths --- that seemed to be
> very important to the users.  One requested change I made to the
> ordinary latex output was to suppress the line under the table title if
> border = 0 (default is border = 1).
> 
> Patch and sample output attached.  I would like to apply this for PG
> 9.3.

Modified patch applied.  I didn't need to modify our existing 'latex'
output format, except to add border=3 support.  It was tempting to
suggest renaming our 'latex' output format to 'latex-tabular', because
that is what it uses, but 'tabular' is a package included by default in
Latex, while 'longtable' requires two additional packages to be
specified, so I resisted suggesting it.  I did document that 'latex'
uses 'tabular'.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-01-17 Thread Simon Riggs
On 16 January 2013 05:40, Kevin Grittner  wrote:

> Here is a new version of the patch, with most issues discussed in
> previous posts fixed.

Looks good.

The patch implements one kind of MV. In the future, we hope to have
other features or other kinds of MV alongside this:
* Snapshot MV - built once at start and then refreshed by explicit command only
* Snapshot MV with fast refresh
* Maintained MV (lazy) - changes trickle continuously into lazy MVs
* Maintained MV (transactional) - changes applied as part of write transactions
and or others

So I think we should agree now some aspects of those other options so
we can decide syntax. Otherwise we'll be left in the situation that
what we implement in 9.3 becomes the default for all time and/or we
have difficulties adding things later. e.g.
REFRESH ON COMMAND

Also, since there is no optimizer linkage between these MVs and the
tables they cover, I think we need to have that explicitly as a
command option, e.g.
DISABLE OPTIMIZATION

That way in the future we can implement "ENABLE OPTIMIZATION" mode and
REFRESH TRANSACTIONAL mode as separate items.

So all I am requesting is that we add additional syntax now, so that
future additional features are clear.

Please suggest syntax, not wedded to those... and we may want to use
more compatible syntax also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 18:42, Andres Freund wrote:

On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch.  I also noticed that the TLI check
didn't mark the last source as failed.


This looks fragile:


/*
 * We only end up here without a message when 
XLogPageRead() failed
 * - in that case we already logged something.
 * In StandbyMode that only happens if we have been 
triggered, so
 * we shouldn't loop anymore in that case.
 */
if (errormsg == NULL)
break;


I don't like relying on the presence of an error message to control logic
like that. Should we throw in an explicit CheckForStandbyTrigger() check in
the condition of that loop?


I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.


I was thinking of the attached. As long as we check for 
CheckForStandbyTrigger() after the "record == NULL" check, we won't 
perform extra stat() calls on successful reads, only when we're polling 
after reaching the end of valid WAL. That seems acceptable. If we want 
to avoid even that, we could move the static 'triggered' variable from 
CheckForStandbyTrigger() out of that function, and check that in the loop.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70cfabc..ac2b26b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3180,7 +3180,8 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
  * try to read a record just after the last one previously read.
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
- * (emode must be either PANIC, LOG)
+ * (emode must be either PANIC, LOG). In standby mode, retries until a valid
+ * record is available.
  *
  * The record is copied into readRecordBuf, so that on successful return,
  * the returned record pointer always points there.
@@ -3209,12 +3210,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			/* not all failures fill errormsg; report those that do */
-			if (errormsg && errormsg[0] != '\0')
-ereport(emode_for_corrupt_record(emode,
- RecPtr ? RecPtr : EndRecPtr),
-		(errmsg_internal("%s", errormsg) /* already translated */));
-
 			lastSourceFailed = true;
 
 			if (readFile >= 0)
@@ -3222,7 +3217,20 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 close(readFile);
 readFile = -1;
 			}
-			break;
+
+			/*
+			 * We only end up here without a message when XLogPageRead() failed
+			 * - in that case we already logged something.
+			 * In StandbyMode that only happens if we have been triggered, so
+			 * we shouldn't loop anymore in that case.
+			 */
+			if (errormsg)
+ereport(emode_for_corrupt_record(emode,
+ RecPtr ? RecPtr : EndRecPtr),
+		(errmsg_internal("%s", errormsg) /* already translated */));
+
+			/* Give up, or retry if we're in standby mode. */
+			continue;
 		}
 
 		/*
@@ -3234,6 +3242,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLogSegNo segno;
 			int32 offset;
 
+			lastSourceFailed = true;
+
 			XLByteToSeg(xlogreader->latestPagePtr, segno);
 			offset = xlogreader->latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader->readPageTLI, segno);
@@ -3243,9 +3253,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			xlogreader->latestPageTLI,
 			fname,
 			offset)));
-			return false;
+			record = NULL;
+			continue;
 		}
-	} while (StandbyMode && record == NULL);
+	} while (StandbyMode && record == NULL && !CheckForStandbyTrigger());
 
 	return record;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ff871a3..75164f6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord);
 
 	if (readOff < 0)
-	{
-		if (state->errormsg_buf[0] != '\0')
-			*errormsg = state->errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	

[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 18:42, Andres Freund wrote:
> >On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:
> >>On 17.01.2013 17:42, Andres Freund wrote:
> >>>Ok, the attached patch seems to fix a) and b). c) above is bogus, as
> >>>explained in a comment in the patch.  I also noticed that the TLI check
> >>>didn't mark the last source as failed.
> >>
> >>This looks fragile:
> >>
> >>>   /*
> >>>* We only end up here without a message when 
> >>> XLogPageRead() failed
> >>>* - in that case we already logged something.
> >>>* In StandbyMode that only happens if we have been 
> >>> triggered, so
> >>>* we shouldn't loop anymore in that case.
> >>>*/
> >>>   if (errormsg == NULL)
> >>>   break;
> >>
> >>I don't like relying on the presence of an error message to control logic
> >>like that. Should we throw in an explicit CheckForStandbyTrigger() check in
> >>the condition of that loop?
> >
> >I agree, I wasn't too happy about that either. But in some sense its
> >only propagating state from XLogReadPage which already has dealt with
> >the error and decided its ok.
> >Its the solution closest to what happened in the old implementation,
> >thats why I thought it would be halfway to acceptable.
> >
> >Adding the CheckForStandbyTrigger() in the condition would mean
> >promotion would happen before all the available records are processed
> >and it would increase the amount of stat()s tremendously.
> >So I don't really like that either.
> 
> I was thinking of the attached. As long as we check for
> CheckForStandbyTrigger() after the "record == NULL" check, we won't perform
> extra stat() calls on successful reads, only when we're polling after
> reaching the end of valid WAL. That seems acceptable. 

Looks good to me.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Robert Haas  writes:
> - adds ddl_command_trace and ddl_command_end events
>
> I think ddl_command_end is OK and I'm willing to commit that if
> extracted as its own patch.  I think ddl_command_trace is unnecessary
> syntactic sugar.

Ok. Will prepare a non controversial patch for ddl_command_end. After
having played with the patch, I happen to quite like ddl_command_trace,
but I won't try to protect it from its death any more than that.

> - causes those events to be called not only for actual SQL commands
> but also for things that happen, at present, to go through the same
> code path
>
> I think this is a bad idea, not only because, as I said before, it
> exposes internal implementation details, but also because it's
> probably not safe.  As Noah (I believe, or maybe Tom), observed in a
> previous post, we've gone to a lot of work to make sure that nothing
> you do inside a trigger can crash the server, corrupt the catalogs,
> trigger elog() messages, etc.  That applies in spades to DDL.

I naively though that doing CommandCounterIncrement() before running the
event trigger would go a long enough way towards making the operation
safe when the current code is calling back into ProcessUtility().

> Despite Tom's skepticism, I'm pretty sure that it's safe for us to
> execute trigger function calls either completely before or completely
> after we execute the DDL command.  It should be no different than
> executing them as separate commands.  But doing something in the
> middle of a command is something else altogether.  Consider, for
> example, that ALTER TABLE does a bunch of crap internally to itself,
> and then recurses into ProcessUtility to do some more stuff.  This is
> a terrible design, but it's what we've got.  Are you absolutely sure
> that the intermediate state that exists after that first bunch of
> stuff is done and before those other things are done is a safe place
> to execute arbitrary user-provided code?  I'm not.  And the more
> places we add that recurse into ProcessUtility, or indeed, add event
> triggers in general, the more places we're going to add new failure
> modes.

While I hear you, I completely fail to understand which magic is going
to make your other idea safer than this one. Namely, calling an event
trigger named sql_drop rather than ddl_command_start from the same place
in the code is certainly not helping at all.

If your answer to that is how to implement it (e.g. keeping a queue of
events for which to fire Event Triggers once we reach a safe point in
the code), then why couldn't we do exactly the same thing under the
other name?

> Fixing that is part of the price of adding a new event trigger and I'm
> OK with that.  But I'm not OK with baking into the code the assumption
> that any current or future callers of ProcessUtility() should be
> prepared for arbitrary code execution.  The code wasn't written with
> that in mind, and it will not end well.

Again, while I agree with the general principle, it happens that the
calls to the arbitrary code are explicit in ProcessUtility(), so that if
the place is known to be unsafe it's easy enough to refrain from calling
the user code. We do that already for CREATE INDEX CONCURRENTLY at
ddl_command_end.

> - adds additional magic variables to PL/pgsql to expose new
> information not previously exposed
>
> I'm not sure that I agree with the implementation of this, but I think
> I'm OK with the concept.  I will review it when it is submitted in a
> form not intertwined with other things.

Fair enough.

> - adds CONTEXT as an additional event-trigger-filter variable, along with TAG
>
> I'm OK with adding new event-trigger-filter variables, provided that
> they are done without additional run-time overhead for people who are
> not using event triggers; I think it's nice that we can support that.
> But I think this particular proposal is DOA because nothing except
> TOPLEVEL is actually safe.

Yeah, let's first see about that.

>> My thinking is that the user code can do anything, even run a DROP
>> command against the object that we are preparing ourselves to be
>> dropping. I don't think we can bypass doing the lookup twice.
>
> Sure, but there's also the issue of getting the right answer.  For
> example, suppose you are planning to drop a function.  You do a name
> lookup on the name and call the event trigger.  Meanwhile, someone
> else renames the function and creates a new one with the same name.
> After the event trigger returns, the actual drop code latches onto the
> one with the new name.  Now, the event trigger ran with OID A and the
> actual object dropped was one with OID B.  It's possible that KaiGai's

Is your scenario even possible when we take a ShareUpdateExclusiveLock
on he object before running the Event Trigger, in order to protect
against exactly that scenario, as said in the comment? Maybe another
lock should be taken. Which one?

> There's another issue here, too, which is that this change
> reintroduces a fai

[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 18:55, Andres Freund wrote:

On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:

I was thinking of the attached. As long as we check for
CheckForStandbyTrigger() after the "record == NULL" check, we won't perform
extra stat() calls on successful reads, only when we're polling after
reaching the end of valid WAL. That seems acceptable.


Looks good to me.


Ok, committed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> Slave does not try anymore to reconnect to master with messages of the type:
> FATAL:  could not connect to the primary server
>
> I also noticed that there is some delay until modifications on master are
> visible on slave.

> I think that bug has been introduced by commit 7fcbf6a.
> Before splitting xlog reading as a separate facility things worked
> correctly.
> There are also no delay problems before this commit.

Heikki committed a fix for at least the promotion issue, I didn't notice
any problem with an increased delay, could you check again if you still
see it?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Ugh.  Still another problem with non-MVCC catalog scans.

Indeed.

> It seems that the only thing we actually use from each tuple is the OID.

Yes, that's true.

> So there are other ways to fix it, of which probably the minimum-change
> one is to keep a list of already-processed tablespace OIDs and skip a
> tuple if we get a match in the list.  This would be O(N^2) in the number
> of tablespaces, but I rather doubt that's a problem.

I did consider this option also but it felt like a lot of additional
code to write and I wasn't entirely convinced it'd be worth it.  It's
very frustrating that we can't simply get a list of *currently valid*
tablespaces with a guarantee that no one else is going to mess with that
list while we process it.  That's what MVCC would give us, but we can't
rely on that yet..

If we agree that keeping a list and then skipping over anything on the
list already seen, I can go ahead and code that up.

> [ thinks for a bit... ]  Ugh, no, because the *other* risk you've got
> here is not seeing a row at all, which would be really bad.

I'm not sure that I see how that could happen..?  I agree that it'd be
really bad if it did though.  Or are you thinking if we were to take a
snapshot and then walk the table?

> I'm not sure that transiently increasing the lock here is enough,
> either.  The concurrent transactions you're worried about probably
> aren't holding their locks till commit, so you could get this lock
> and still see tuples with unstable committed-good states.

If there are other transactiosn which have open locks against the table,
wouldn't that prevent my being able to acquire ShareLock on it?  Or put
another way, how would they not hold their locks till commit or
rollback?  We do only look at tablespaces which exist for the database
we're copying, as I recall, so any newly created tablespaces shouldn't
impact this.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Fujii Masao
On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund  wrote:
> On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
>> Slave does not try anymore to reconnect to master with messages of the type:
>> FATAL:  could not connect to the primary server
>>
>> I also noticed that there is some delay until modifications on master are
>> visible on slave.
>
>> I think that bug has been introduced by commit 7fcbf6a.
>> Before splitting xlog reading as a separate facility things worked
>> correctly.
>> There are also no delay problems before this commit.
>
> Heikki committed a fix for at least the promotion issue, I didn't notice
> any problem with an increased delay, could you check again if you still
> see it?

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1


Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
> On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund  wrote:
> > On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> >> Slave does not try anymore to reconnect to master with messages of the 
> >> type:
> >> FATAL:  could not connect to the primary server
> >>
> >> I also noticed that there is some delay until modifications on master are
> >> visible on slave.
> >
> >> I think that bug has been introduced by commit 7fcbf6a.
> >> Before splitting xlog reading as a separate facility things worked
> >> correctly.
> >> There are also no delay problems before this commit.
> >
> > Heikki committed a fix for at least the promotion issue, I didn't notice
> > any problem with an increased delay, could you check again if you still
> > see it?
> 
> I encountered the problem that the timeline switch is not performed 
> expectedly.
> I set up one master, one standby and one cascade standby. All the servers
> share the archive directory. restore_command is specified in the recovery.conf
> in those two standbys.
> 
> I shut down the master, and then promoted the standby. In this case, the
> cascade standby should switch to new timeline and replication should be
> successfully restarted. But the timeline was never changed, and the following
> log messages were kept outputting.
> 
> sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> 

That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2013-01-17 Thread Tom Lane
Peter Geoghegan  writes:
> On 8 December 2012 14:41, Andres Freund  wrote:
>> Is anybody planning to work on this? There hasn't been any activity
>> since the beginning of the CF and it doesn't look like there is much
>> work left?

> I took another look at this.

Applied with some changes:

* Use float8 arithmetic to prevent intermediate-result overflow, as I
suggested last night.

* Rearrange the subsequent checks so that they provide bulletproof
clamping behavior on out-of-range newmemtupsize values; this allows
dropping the very ad-hoc range limiting used in the previous patch (and
in what I had last night).

* Fix the check against availMem; it was supposed to be testing that the
increment in the array size was within availMem.  (In the original
coding the increment was always the same as the original size, but not
so much anymore.)

* Allow the array size to grow to the MaxAllocSize limit, instead of
punting if we'd exceed that.  If we're attempting to use as much of
work_mem as we possibly can, I don't see why that should be a reject
case.

* Improve the comments, including the existing one about the availMem
check, since that evidently wasn't clear enough.

* Copy the whole mess into tuplestore.c too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH]Tablesample Submission

2013-01-17 Thread Josh Berkus
On 11/04/2012 07:22 PM, Qi Huang wrote:
> Dear hackers Sorry for not replying the patch review. I didn't see the 
> review until recently as my mail box is full of Postgres mails and I didn't 
> notice the one for me, my mail box configuration problem. I am still kind 
> of busy with my university final year project. I shall not have time to work 
> on updating the patch until this semester finishes which is December. I will 
> work on then.Thanks.
> Best RegardsHuang Qi VictorComputer Science of National University of 
> Singapore

Did you ever do the update of the patch?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
> Now that I look at the patch, I wonder if there is another fundamental
> issue with the patch. Since the patch removes WAL logging for the VM
> set operation, this can happen:
> 
Thank you. I think I was confused by this comment here:

"When we *set* a visibility map during VACUUM, we must write WAL.  This
may seem counterintuitive, since the bit is basically a hint: if it is
clear, it may still be the case that every tuple on the page is visible
to all transactions; we just don't know that for certain.  The
difficulty is that there are two bits which are typically set together:
the PD_ALL_VISIBLE bit on the page itself, and the visibility map bit.
If a crash occurs after the visibility map page makes it to disk and
before the updated heap page makes it to disk, redo must set the bit on
the heap page. Otherwise, the next insert, update, or delete on the heap
page will fail to realize that the visibility map bit must be cleared,
possibly causing index-only scans to return wrong answers."

Which lead me to believe that I could just rip out the WAL-related code
if PD_ALL_VISIBLE goes away, which is incorrect. But the incorrectness
doesn't have to do with the WAL directly, it's because the VM page's LSN
is not bumped past the LSN of the related heap page cleanup, so it can
be written too early.

Of course, the way to bump the LSN is to write WAL for the
visibilitymap_set operation. But that would be a very simple WAL
routine, rather than the complex one that exists without the patch.

I suppose we could even try to bump the LSN without writing WAL somehow,
but it doesn't seem worth reasoning through that (setting a VM bit is
rare enough).

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> [ thinks for a bit... ]  Ugh, no, because the *other* risk you've got
>> here is not seeing a row at all, which would be really bad.

> I'm not sure that I see how that could happen..?  I agree that it'd be
> really bad if it did though.  Or are you thinking if we were to take a
> snapshot and then walk the table?

The case where you see a tuple twice is if an update drops a new version
of a row beyond your seqscan, and then commits before you get to the new
version.  But if it drops the new version of the row *behind* your
seqscan, and then commits before you get to the original tuple, you'll
not see either row version as committed.  Both of these cases are
inherent risks in SnapshotNow scans.

>> I'm not sure that transiently increasing the lock here is enough,
>> either.  The concurrent transactions you're worried about probably
>> aren't holding their locks till commit, so you could get this lock
>> and still see tuples with unstable committed-good states.

> If there are other transactiosn which have open locks against the table,
> wouldn't that prevent my being able to acquire ShareLock on it?

What I'm worried about is that we very commonly release locks on
catalogs as soon as we're done with the immediate operation --- not only
read locks, but update locks as well.  So there are lots of cases where
locks are released before commit.  It's possible that that never happens
in operations applied to pg_tablespace, but I'd not want to bet on it.

I wonder whether it's practical to look at the on-disk directories
instead of relying on pg_tablespace for this particular purpose.

Or maybe we should just write this off as a case we can't realistically
fix before we have MVCC catalog scans.  It seems that any other fix is
going to be hopelessly ugly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2013-01-17 Thread Peter Geoghegan
On 17 January 2013 18:22, Tom Lane  wrote:
> Applied with some changes:

Thank you. That feedback is useful.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Andres Freund
On 2013-01-17 13:46:44 -0500, Tom Lane wrote:
> Or maybe we should just write this off as a case we can't realistically
> fix before we have MVCC catalog scans.  It seems that any other fix is
> going to be hopelessly ugly.

ISTM we could just use a MVCC snapshot in this specific case?

Andres
--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> The case where you see a tuple twice is if an update drops a new version
> of a row beyond your seqscan, and then commits before you get to the new
> version.  But if it drops the new version of the row *behind* your
> seqscan, and then commits before you get to the original tuple, you'll
> not see either row version as committed.  Both of these cases are
> inherent risks in SnapshotNow scans.

Ah, I see.

> > If there are other transactiosn which have open locks against the table,
> > wouldn't that prevent my being able to acquire ShareLock on it?
> 
> What I'm worried about is that we very commonly release locks on
> catalogs as soon as we're done with the immediate operation --- not only
> read locks, but update locks as well.  So there are lots of cases where
> locks are released before commit.  It's possible that that never happens
> in operations applied to pg_tablespace, but I'd not want to bet on it.

Fair enough.

> I wonder whether it's practical to look at the on-disk directories
> instead of relying on pg_tablespace for this particular purpose.

So we'd traverse all of the tablespace directories and copy any
directories which we come across which have the oid of the template
database..?  Sounds pretty reasonable, though I'm not sure that we'd
want to back-patch a large change like that.

> Or maybe we should just write this off as a case we can't realistically
> fix before we have MVCC catalog scans.  It seems that any other fix is
> going to be hopelessly ugly.

I feel like we should be able to do better than what we have now, at
least.  Using ShareLock prevented the specific case that we were
experiencing and is therefore MUCH better (for us, anyway) than
released versions where we ran into the error on a regular basis.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH]Tablesample Submission

2013-01-17 Thread Simon Riggs
On 17 January 2013 18:32, Josh Berkus  wrote:
> On 11/04/2012 07:22 PM, Qi Huang wrote:
>> Dear hackers Sorry for not replying the patch review. I didn't see the 
>> review until recently as my mail box is full of Postgres mails and I didn't 
>> notice the one for me, my mail box configuration problem. I am still 
>> kind of busy with my university final year project. I shall not have time to 
>> work on updating the patch until this semester finishes which is December. I 
>> will work on then.Thanks.
>> Best RegardsHuang Qi VictorComputer Science of National University of 
>> Singapore
>
> Did you ever do the update of the patch?

We aren't just waiting for a rebase, we're waiting for Hitoshi's
comments to be addressed.

I would add to them by saying I am very uncomfortable with the idea of
allowing a TABLESAMPLE clause on an UPDATE or a DELETE. If you really
want that you can use a sub-select.

Plus the patch contains zero documentation.

So I can't see this going anywhere for 9.3. I've moved it to CF1 of
9.4 marked Waiting on Author

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2013-01-17 Thread Alvaro Herrera
Tomas Vondra wrote:
> Hi,
> 
> attached is a patch that improves performance when dropping multiple
> tables within a transaction. Instead of scanning the shared buffers for
> each table separately, the patch removes this and evicts all the tables
> in a single pass through shared buffers.

Made some tweaks and pushed (added comments to new functions, ensure
that we never try to palloc(0), renamed DropRelFileNodeAllBuffers to
plural, made the "use bsearch" logic a bit simpler).

> Our system creates a lot of "working tables" (even 100.000) and we need
> to perform garbage collection (dropping obsolete tables) regularly. This
> often took ~ 1 hour, because we're using big AWS instances with lots of
> RAM (which tends to be slower than RAM on bare hw). After applying this
> patch and dropping tables in groups of 100, the gc runs in less than 4
> minutes (i.e. a 15x speed-up).

I'm curious -- why would you drop tables in groups of 100 instead of
just doing the 100,000 in a single transaction?  Maybe that's faster
now, because you'd do a single scan of the buffer pool instead of 1000?
(I'm assuming that "in groups of" means you do each group in a separate
transaction)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 10:39 -0800, Jeff Davis wrote:
> On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
> > Now that I look at the patch, I wonder if there is another fundamental
> > issue with the patch. Since the patch removes WAL logging for the VM
> > set operation, this can happen:
> > 
> Thank you. 

New patch attached with simple WAL logging.

Regards,
Jeff Davis



rm-pd-all-visible-20130117.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Or maybe we should just write this off as a case we can't realistically
>> fix before we have MVCC catalog scans.  It seems that any other fix is
>> going to be hopelessly ugly.

> I feel like we should be able to do better than what we have now, at
> least.  Using ShareLock prevented the specific case that we were
> experiencing and is therefore MUCH better (for us, anyway) than
> released versions where we ran into the error on a regular basis.

If it actually *reliably* fixed the problem, rather than just reducing
the probabilities, that would mean that the updates your other sessions
were doing didn't release RowExclusiveLock early.  Have you dug into the
code to see if that's true?  (Or even more to the point, if it's still
true in HEAD?  I have no idea if all the recent refactoring has changed
this behavior for GRANT.)

My thought about a localized fix is similar to Andres' --- maybe we
could take a snapshot and use a real MVCC scan.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2013-01-17 Thread Tom Lane
Alvaro Herrera  writes:
> Made some tweaks and pushed (added comments to new functions, ensure
> that we never try to palloc(0), renamed DropRelFileNodeAllBuffers to
> plural, made the "use bsearch" logic a bit simpler).

FWIW, there's nothing particularly wrong with palloc(0) ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Simon Riggs
On 17 January 2013 16:15, Robert Haas  wrote:

> As a further example, suppose that in 9.4 (or 9.17) we add a command
> "DROP TABLES IN SCHEMA fred WHERE name LIKE 'bob%'".  Well, the
> logging trigger still just works (because it's only writing the
> statement, without caring about its contents).  And the replication
> trigger still just works too (because it will still get a callback for
> every table that gets dropped, even though it knows nothing about the
> new syntax).  That's very powerful.  Of course, there will still be
> cases where things have to change, but you can minimize it by clean
> definitions.
>
> It pains me that I've evidently failed to communicate this concept
> clearly despite a year or more of trying.  Does that make sense?  Is
> there some way I can make this more clear?  The difference seems very
> clear-cut to me, but evidently I'm not conveying it well.
>
>> That problem is going to happen in some way or another. It's not about
>> avoiding the problem completely, but choosing a set of compromises. I'm
>> yet to understand what your set of compromises is in detailed enough
>> practical implementation terms (not just "it's another set of events")
>> and how it's a better compromise over all (I trust your judgement on
>> that, I still want to understand).
>>
>> After all, the whole point of the Event Trigger patch is to expose some
>> level of implementation details.
>
> I don't really agree with that.  I think the point is to expose what
> the system is doing to the DBA.  I'm OK with exposing the fact that
> creating a table with a serial column also creates a sequence.  There
> is no problem with that - it's all good.  What we DON'T want to do is
> set up a system where the fact that it creates a sequence is exposed
> because that happens to go through ProcessUtility() and the fact that
> it creates a constraint is not because that happens not to go through
> ProcessUtility().  That is not the sort of quality that our users have
> come to expect from PostgreSQL.


"The point", i.e. the main use case, is to replicate the DDL in a useful form.

Discussions and reasoning need to focus on the main use case, not on
weird futures or qualitative points. Yes, the discussion has gone on
for years and it really should come to a useful conclusion sometime
soon.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches

2013-01-17 Thread Alvaro Herrera
Robert Haas escribió:

> Actually, I'm really glad to see all the work you've done to improve
> the way that some of these scenarios work and eliminate various bugs
> and other surprising failure modes over the last couple of months.
> It's great stuff.

+1

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Tom Lane
I wrote:
> Stephen Frost  writes:
>> I feel like we should be able to do better than what we have now, at
>> least.  Using ShareLock prevented the specific case that we were
>> experiencing and is therefore MUCH better (for us, anyway) than
>> released versions where we ran into the error on a regular basis.

> My thought about a localized fix is similar to Andres' --- maybe we
> could take a snapshot and use a real MVCC scan.

BTW, it strikes me that the reason this particular SnapshotNow scan is a
big problem for you is that, because we stop and copy a subdirectory
after each tuple, the elapsed time for the seqscan is several orders of
magnitude more than it is for most (perhaps all) other cases where
SnapshotNow is used.  So the window for trouble with concurrent updates
is that much bigger.  This seems to provide a reasonably principled
argument why we might want to fix this case with a localized use of an
MVCC scan before we have such a fix globally.

Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
kind of annotation.  We know we need the SnapshotNow scan fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 20:08, Andres Freund wrote:

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1



That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...


I think I broke that with the "teach pg_receivexlog to cross timelines" 
patch. Will take a look...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Simon Riggs
On 17 January 2013 15:14, Heikki Linnakangas  wrote:
> On 17.01.2013 16:53, Robert Haas wrote:
>>
>> On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
>>   wrote:
>>>
>>> May be you've already addressed that concern with the proven
>>> performance numbers, but I'm not sure.
>>
>>
>> It would be nice to hear what Heikki's reasons were for adding
>> PD_ALL_VISIBLE in the first place.
>
>
> The idea was to avoid clearing the bit in the VM page on every update, when
> the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set.
> I assumed the traffic and contention on the VM page would be a killer
> otherwise. I don't remember if I ever actually tested that though. Maybe I
> was worrying about nothing and hitting the VM page on every update is ok.

Presumably we remember the state of the VM so we can skip the re-visit
after every write?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 17:14 +0200, Heikki Linnakangas wrote:
> I don't remember if I ever actually tested that 
> though. Maybe I was worrying about nothing and hitting the VM page on 
> every update is ok.

I tried, but was unable to show really anything at all, even without
keeping the VM page pinned. I think the bottleneck is elsewhere;
although I am keeping the page pinned in this patch to prevent it from
becoming a bottleneck.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH]Tablesample Submission

2013-01-17 Thread Josh Berkus

> So I can't see this going anywhere for 9.3. I've moved it to CF1 of
> 9.4 marked Waiting on Author

Agreed.  I wish I'd noticed that it got lost earlier.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2013-01-17 Thread Stefan Keller
Hi Jeff

2012/4/19 Jeff Davis :
> On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote:
(...)
>> This is just handwaving of course.  I think some digging in the
>> spatial-join literature would likely find ideas better than any of
>> these.
>
> I will look in some more detail. The merge-like approach did seem to be
> represented in the paper referenced by Alexander (the external plane
> sweep), but it also refers to several methods based on partitioning.
>
> I'm beginning to think that more than one of these ideas has merit.
>
> Regards,
> Jeff Davis

I'm perhaps really late in this discussion but I just was made aware
of that via the tweet from Josh Berkus about "PostgreSQL 9.3: Current
Feature Status"

What is the reason to digg into spatial-joins when there is PostGIS
being a bullet-proof and fast implementation?

Yours, Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Simon Riggs  writes:
> On 17 January 2013 16:15, Robert Haas  wrote:
>> It pains me that I've evidently failed to communicate this concept
>> clearly despite a year or more of trying.  Does that make sense?  Is
>> there some way I can make this more clear?  The difference seems very
>> clear-cut to me, but evidently I'm not conveying it well.

> "The point", i.e. the main use case, is to replicate the DDL in a useful form.

> Discussions and reasoning need to focus on the main use case, not on
> weird futures or qualitative points.

I have to completely disagree with that.  If we don't want PostgreSQL
to soon subside into an unfixable morass, as I think Brooks puts it,
we must *not* simply patch things in a way that expediently provides
an approximation of some desired feature.  We have to consider, and put
substantial weight on, having a clean and maintainable system design.

This is particularly the case if we're talking about a feature that
is going to expose backend internals to users to any degree.  We're
either going to be constrained to not change those internals any more,
or expect to break users' applications when we do change them, and
neither result is very appealing.  Especially not when we're talking
about the structure of Postgres' DDL code, which can most charitably
be described as "it just grew", not as something that has any clear
architecture to it.

Now if we could quantify these things, that would be great, but
AFAICS "qualitative points" is all we've got to go on.

I think that we're not realistically going to be able to introduce
event triggers in very many of the places we'd like to have them
without first doing a lot of fundamental refactoring.  And that is
hard, dangerous, slow work that tends to break things in itself.
As we've been repeatedly reminded in connection with KaiGai-san's
refactoring patches.

So my opinion is that most of what we'd like to have here is material
for 9.4 or 9.5 or even further out.  If we can get the event trigger
catalog infrastructure and some *very* basic events, like
end-of-toplevel-command, into place for 9.3, we'll have done well.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Robert Haas
On Thu, Jan 17, 2013 at 12:06 PM, Dimitri Fontaine
 wrote:
> Ok. Will prepare a non controversial patch for ddl_command_end.

Thanks.  I will make a forceful effort to review that in a timely
fashion when it's posted.

>> I think this is a bad idea, not only because, as I said before, it
>> exposes internal implementation details, but also because it's
>> probably not safe.  As Noah (I believe, or maybe Tom), observed in a
>> previous post, we've gone to a lot of work to make sure that nothing
>> you do inside a trigger can crash the server, corrupt the catalogs,
>> trigger elog() messages, etc.  That applies in spades to DDL.
>
> I naively though that doing CommandCounterIncrement() before running the
> event trigger would go a long enough way towards making the operation
> safe when the current code is calling back into ProcessUtility().

It's something, but I'd be shocked if it covers all the bases.

Let me try to give a concrete example of how I think another firing
point could be made to work along the lines I'm suggesting.  I'm not
going to use DROP as an example because I think upon reflection that
will be a particularly challenging case to make work correctly.
Instead, let me use the example of ALTER.

Goal: Every time an ALTER command is used on object *that actually
exists*, we will call some user-defined function and pass the object
type, the OID of the object, and some details about what sort of
alteration the user has requested.

Where shall we put the code to do this?  Clearly, ProcessUtility is
too early, because at that point we have not done the name lookup,
locked the object, or checked permissions.  So the OID of the target
object is not and cannot be known at that point.  We cannot do an
extra name lookup at that point without taking a lock, because the
answer might change, or, due to non-MVCC catalog access, the lookup
might fail to find an object altogether.  And we can't take a lock
without checking permissions first.  And the permissions-checking
logic can't be done inside ProcessUtility(), because it's object-type
specific and we don't want to duplicate it.

So, instead, what we need to do is go into each function that
implements ALTER, and modify it so that after it does the dance where
we check permissions, take locks, and look up names, we call the
trigger function.  That's a bit of a nuisance, because we probably
have a bunch of call sites to go fix, but in theory it should be
doable.  The problem of course, is that it's not intrinsically safe to
call user-defined code there.  If it drops the object and creates a
new one, say, hilarity will probably ensue.

But that should be a fixable problem.  The only things we've done at
this point are check permissions, lock, and look up names.  I think we
can assume that if the event trigger changes permissions, it's safe
for the event trigger to ignore that.  The trigger cannot have
released the lock, so we're OK there.  The only thing it can really
have done that's problematic is change the results of the name lookup,
say by dropping or renaming the object.  But that's quite simple to
protect against: after firing the trigger, we do a
CommandCounterIncrement() and repeat the name lookup, and if we get a
different answer, then we bail out with an error and tell the user
they're getting far too cute.  Then we're safe.

Now, there is a further problem: all of the information about the
ALTER statement that we want to provide to the trigger may not be
available at that point.  Simple cases like ALTER .. RENAME won't
require anything more than that, but things like ALTER TABLE .. ADD
FOREIGN KEY get tricky, because while at this point we have a solid
handle on the identity of the relation to which we're adding the
constraint, we do NOT yet have knowledge of or a lock on the TARGET
relation, whose OID could still change on us up to much later in the
computation.  To get reliable information about *that*, we'll need to
refactor the sequence of events inside ALTER TABLE.

Hopefully you can see what I'm driving toward here.  In a design like
this, we can pretty much prove - after returning from the event
trigger - that we're in a state no different from what would have been
created by executing a series of commands - lock table, then SELECT
event_trigger_func(), then the actual ALTER in sequence.  Maybe
there's a flaw in that analysis - that's what patch review is for -
but it sounds pretty darn tight to me.

I could write more, but I have to take the kids to dance class, and
this email may be too long already.  Does that help at all to clarify
the lines along which I am thinking?  Note that the above is clearly
not ddl_command_start but something else, and the different firing
point and additional safety cross-checks are what enable us to pass
additional information to the trigger without fear of breaking either
the trigger or the world.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via p

Re: [HACKERS] [sepgsql 1/3] add name qualified creation label

2013-01-17 Thread Kohei KaiGai
2013/1/16 Robert Haas :
> On Tue, Jan 15, 2013 at 3:02 PM, Kohei KaiGai  wrote:
>> This patch adds sepgsql the feature of name qualified creation label.
>>
>> Background, on creation of a certain database object, sepgsql assigns
>> a default security label according to the security policy that has a set of
>> rules to determine a label of new object.
>> Usually, a new object inherits its parent (e.g table is a parent of column)
>> object's label, unless it has a particular type_transition rule in the 
>> policy.
>> Type_transition rule allows to describe a particular security label as
>> default label of new object towards a pair of client and parent object.
>> For example, the below rule says columns constructed under the table
>> labeled as "sepgsql_table_t" by client with "staff_t" will have
>> "staff_column_t", instead of table's label.
>>   TYPE_TRANSITION staff_t sepgsql_table_t:db_column staff_column_t;
>>
>> Recently, this rule was enhanced to take 5th argument for object name;
>> that enables to special case handling exceptionally.
>> It was originally designed to describe default security labels for files in
>> /etc directory, because many application put its own configuration files
>> here, thus, traditional type_transition rule was poor to describe all the
>> needed defaults.
>> On the other hand, we can port this concept of database system also.
>> One example is temporary objects being constructed under the pg_temp
>> schema. If we could assign a special default label on this, it allows
>> unprivileged users (who cannot create persistent tables) to create
>> temporary tables that has no risk of information leak to other users.
>> Otherwise, we may be able to assign a special security label on
>> system columns and so on.
>>
>> From the perspective of implementation on sepgsql side, all we need
>> to do is replace old security_compute_create_raw() interface by new
>> security_compute_create_name_raw().
>> If here is no name qualified type_transition rules, it performs as if
>> existing API, so here is no backword compatible issue.
>>
>> This patch can be applied on the latest master branch.
>
> This looks OK on a quick once-over, but should it update the
> documentation somehow?
>
Documentation does not take so much description for type_transition
rules, so I just modified relevant description a bit to mention about
type_transition rule may have argument of new object name optionally.
In addition, I forgot to update minimum required version for libselinux;
(it also takes change in configure script).
These two are the point to be updated in documentation.

Thanks,
-- 
KaiGai Kohei 


sepgsql-v9.3-creation-label-with-name.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Robert Haas  writes:
> Goal: Every time an ALTER command is used on object *that actually
> exists*, we will call some user-defined function and pass the object
> type, the OID of the object, and some details about what sort of
> alteration the user has requested.

Ok, in current terms of the proposed patch, it's a ddl_command_end event
trigger, and you can choose when to fire it in utility.c. The current
place is choosen to be just after the AlterTable() call, because that
sounded logical if you believe in CommandCounterIncrement.

> So, instead, what we need to do is go into each function that
> implements ALTER, and modify it so that after it does the dance where
> we check permissions, take locks, and look up names, we call the
> trigger function.  That's a bit of a nuisance, because we probably
> have a bunch of call sites to go fix, but in theory it should be
> doable.  The problem of course, is that it's not intrinsically safe to
> call user-defined code there.  If it drops the object and creates a
> new one, say, hilarity will probably ensue.

You're trying to find a dangerous point when to fire the trigger here,
rather than trying to solve the problem you're describing. For the
problem you're describing, either you want the Object Specific
Information and the trigger fires at ddl_command_end, or you don't and
you can register your trigger at ddl_command_start.

If you want something else, well, it won't ship in 9.3.

> Now, there is a further problem: all of the information about the
> ALTER statement that we want to provide to the trigger may not be
> available at that point.  Simple cases like ALTER .. RENAME won't
> require anything more than that, but things like ALTER TABLE .. ADD
> FOREIGN KEY get tricky, because while at this point we have a solid
> handle on the identity of the relation to which we're adding the
> constraint, we do NOT yet have knowledge of or a lock on the TARGET
> relation, whose OID could still change on us up to much later in the
> computation.  To get reliable information about *that*, we'll need to
> refactor the sequence of events inside ALTER TABLE.

The only valid answer here has already been given by Tom. You can only
provide the information if it's available in the catalogs. So again,
it's a ddl_command_end event. It can not happen before.

> Hopefully you can see what I'm driving toward here.  In a design like
> this, we can pretty much prove - after returning from the event
> trigger - that we're in a state no different from what would have been
> created by executing a series of commands - lock table, then SELECT
> event_trigger_func(), then the actual ALTER in sequence.  Maybe
> there's a flaw in that analysis - that's what patch review is for -
> but it sounds pretty darn tight to me.

The only case when we need to do a lookup BEFORE actually running the
command is when that command is a DROP, because that's the only timing
when the information we want is still in the catalogs.

So that's the only case where we do a double object lookup, and we take
a ShareUpdateExclusiveLock lock on the object when doing so, so that we
can't lose it from another concurrent activity.

> Note that the above is clearly
> not ddl_command_start but something else, and the different firing
> point and additional safety cross-checks are what enable us to pass
> additional information to the trigger without fear of breaking either
> the trigger or the world.

That's the reason why we are only going to have ddl_command_start and
ddl_command_end in 9.3, and also the reason why the extra information is
only provided when this very information still exists in the catalogs at
the moment when we want to expose it.

And that's the reason why ddl_command_trace is attractive too: you won't
ever get Object OID and Name and Schema from the event where the object
does not exists, and maybe you want an easy way to tell the system
you're interested into an event *with* the information, and be done,
don't think.

Again, I don't care much about keeping ddl_command_trace because it's
not interesting much for implementing DDL support in logical
replication and my other use cases, but still, I though I would explain
it now that we're talking about it.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
Hi,


While checking whether I could reproduce the replication delay reported
by Michael Paquier I found this very nice tidbit:

In a pretty trivial replication setup of only streaming replication I
can currently easily reproduce this:

standby# BEGIN;SELECT * FROM foo;
BEGIN
 id | data 
+--


standby=# SELECT relation::regclass, locktype, mode FROM pg_locks;
 relation |  locktype  |  mode   
--++-
 pg_locks | relation   | AccessShareLock
 foo_pkey | relation   | AccessShareLock
 foo  | relation   | AccessShareLock
  | virtualxid | ExclusiveLock
  | virtualxid | ExclusiveLock
(5 rows)

primary# DROP TABLE foo;
DROP TABLE


standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks;
 relation |  pid  |  locktype  |mode 
--+---++-
 pg_locks | 28068 | relation   | AccessShareLock
 foo_pkey | 28068 | relation   | AccessShareLock
 foo  | 28068 | relation   | AccessShareLock
  | 28068 | virtualxid | ExclusiveLock
  | 28048 | virtualxid | ExclusiveLock
 foo  | 28048 | relation   | AccessExclusiveLock
(6 rows)

standby# SELECT * FROM foo;
ERROR:  relation "foo" does not exist
LINE 1: select * from foo;
  ^
Note the conflicting locks held on relation foo by 28048 and 28068.

I don't immediately know which patch to blame here? Looks a bit like
broken fastpath locking, but I don't immediately see anything in
c00dc337b87 that should cause this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Heikki Linnakangas

On 17.01.2013 21:57, Heikki Linnakangas wrote:

On 17.01.2013 20:08, Andres Freund wrote:

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the
servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1



That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...


I think I broke that with the "teach pg_receivexlog to cross timelines"
patch. Will take a look...


Ok, there was a couple of issues. First, as I suspected above, I added a 
new result set after copy has ended in START_STREAMING command, but 
forgot to teach walreceiver about it. Fixed that.


Secondly, there's an interesting interaction between the new xlogreader 
code and streaming replication and timeline switches:


The first call to the page_read callback in xlogreader is always made 
with size SizeOfXLogRecord, counting from the beginning of the page. 
That's bogus; there can be no WAL record in the very beginning of page, 
because of the page header, so I think that was supposed to be 
SizeXLogShortPHD. But that won't fix the issue.


The problem is that XLogPageRead callback uses the page address and 
requested length to decide what timeline to stream from. For example, 
imagine that there's a timeline switch at offset 1000 in a page, and we 
have already read all the WAL up to that point. When xlogreader first 
asks to fetch the first 32 bytes from the page (the bogus 
SizeOfXLogRecord), XLogPageRead deduces that that byte range is still on 
the old timeline, and starts streaming from the old timeline. Next, 
xlogreader needs the rest of the page, up to 1000 + SizeOfPageHeader, to 
read the first record it's actually interested in, XLogPageRead will 
return an error because that range is not on the timeline that's 
currently streamed. And we loop back to retry, and run into the same 
problem again.


This interaction is a bit too subtle for my taste, but the 
straightforward fix is to just modify xlogreader so that the first 
read_page call requests all the bytes up to record we're actually 
interested in. That seems like a smart thing to do anyway.


I committed a patch for that second issue too, but please take a look in 
case you come up with a better idea.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Tom Lane  writes:
> I have to completely disagree with that.  If we don't want PostgreSQL
> to soon subside into an unfixable morass, as I think Brooks puts it,
> we must *not* simply patch things in a way that expediently provides
> an approximation of some desired feature.  We have to consider, and put
> substantial weight on, having a clean and maintainable system design.

When in Ottawa next may, I will have to buy you a glass of your favorite
wine and explain to you my main use case and vision. You will then
realize that all this time that I've been spending on Event Triggers is
a sideway to build something else entirely, because I know that it's the
only way to get the feature I want in core PostgreSQL.

So yes, I know about the clean and maintainable system design
constraint. I've a lot to learn still, and I appreciate your help very
much. Rest assured that the path you describe is the one I'm taking.

> I think that we're not realistically going to be able to introduce
> event triggers in very many of the places we'd like to have them
> without first doing a lot of fundamental refactoring.

We're only talking about ddl_command_start and ddl_command_end now. The
table_rewrite event is still on the horizon, but is not realistic to get
in 9.3 anymore, I think :(

So we're talking about adding some call points only in utility.c, only
before or after a ddl command is run, including nested sub-commands.

> So my opinion is that most of what we'd like to have here is material
> for 9.4 or 9.5 or even further out.  If we can get the event trigger
> catalog infrastructure and some *very* basic events, like
> end-of-toplevel-command, into place for 9.3, we'll have done well.

My feeling is that I'm sending patches that only implement the *very*
basic events here, and nothing more. Most of the heat of this thread
came from a discussion about a feature that's very hard to design and
that is not in my patch series to review.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER command reworks

2013-01-17 Thread Alvaro Herrera
Kohei KaiGai escribió:

> This attached patch is the rebased one towards the latest master branch.

Great, thanks.  I played with it a bit and it looks almost done to me.
The only issue I can find is that it lets you rename an aggregate by
using ALTER FUNCTION, which is supposed to be forbidden.  (Funnily
enough, renaming a non-agg function with ALTER AGGREGATE does raise an
error).  Didn't immediately spot the right place to add a check.

I think these two error cases ought to have regression tests of their
own.

I attach a version with my changes.

> I noticed that your proposed design also allows to unify ALTER code of
> OPERATOR CLASS / FAMILY; that takes index access method for its
> namespace in addition to name and namespace.

Yeah, I had noticed that and was planning on implementing it later.
Thanks for saving me the work.

> So, AlterObjectRename_internal and AlterObjectNamespace_internal have
> four of special case handling to check name / namespace conflict.

Right.  (I wonder if it would make sense to encapsulate all those checks
in a single function for both to call.)

> The latest master lookups syscache within special case handing block
> as follows:
> [code]
> 
> But, we already pulls a relevant tuple from syscache on top of
> AlterObjectNamespace_internal. So, I removed cache lookup code here.

Silly me.  Thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


pgsql-v9.3-alter-reworks.3-rename.v10.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-17 23:49:22 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 21:57, Heikki Linnakangas wrote:
> >On 17.01.2013 20:08, Andres Freund wrote:
> >>On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
> >>>I encountered the problem that the timeline switch is not performed
> >>>expectedly.
> >>>I set up one master, one standby and one cascade standby. All the
> >>>servers
> >>>share the archive directory. restore_command is specified in the
> >>>recovery.conf
> >>>in those two standbys.
> >>>
> >>>I shut down the master, and then promoted the standby. In this case, the
> >>>cascade standby should switch to new timeline and replication should be
> >>>successfully restarted. But the timeline was never changed, and the
> >>>following
> >>>log messages were kept outputting.
> >>>
> >>>sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>sby2 LOG: restarted WAL streaming at 0/300 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>
> >>
> >>That's after the commit or before? Because in passing I think I
> >>noticed/fixed a bug that could cause exactly that problem...
> >
> >I think I broke that with the "teach pg_receivexlog to cross timelines"
> >patch. Will take a look...
>
> Ok, there was a couple of issues. First, as I suspected above, I added a new
> result set after copy has ended in START_STREAMING command, but forgot to
> teach walreceiver about it. Fixed that.
>
> Secondly, there's an interesting interaction between the new xlogreader code
> and streaming replication and timeline switches:
>
> The first call to the page_read callback in xlogreader is always made with
> size SizeOfXLogRecord, counting from the beginning of the page. That's
> bogus; there can be no WAL record in the very beginning of page, because of
> the page header, so I think that was supposed to be SizeXLogShortPHD. But
> that won't fix the issue.

Yea, thats clearly a typo.

> The problem is that XLogPageRead callback uses the page address and
> requested length to decide what timeline to stream from. For example,
> imagine that there's a timeline switch at offset 1000 in a page, and we have
> already read all the WAL up to that point. When xlogreader first asks to
> fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord),
> XLogPageRead deduces that that byte range is still on the old timeline, and
> starts streaming from the old timeline. Next, xlogreader needs the rest of
> the page, up to 1000 + SizeOfPageHeader, to read the first record it's
> actually interested in, XLogPageRead will return an error because that range
> is not on the timeline that's currently streamed. And we loop back to retry,
> and run into the same problem again.

> This interaction is a bit too subtle for my taste, but the straightforward
> fix is to just modify xlogreader so that the first read_page call requests
> all the bytes up to record we're actually interested in. That seems like a
> smart thing to do anyway.

Yuck. Reading from targetRecOff onwards seems like a good idea, yes. But
ISTM that the code isn't really safe now: Assume targetRecOff is 0
(which is valid) then we read up to SizeOfXLogRecord but assume that we
can read both the xlog record as well as the page header.
Then you also need to factor in that the page header can be differently
big.

And yes, its too subtle :(

Do you want to fix that or shall I?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Simon Riggs
On 17 January 2013 20:24, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 17 January 2013 16:15, Robert Haas  wrote:
>>> It pains me that I've evidently failed to communicate this concept
>>> clearly despite a year or more of trying.  Does that make sense?  Is
>>> there some way I can make this more clear?  The difference seems very
>>> clear-cut to me, but evidently I'm not conveying it well.
>
>> "The point", i.e. the main use case, is to replicate the DDL in a useful 
>> form.
>
>> Discussions and reasoning need to focus on the main use case, not on
>> weird futures or qualitative points.
>
> I have to completely disagree with that.  If we don't want PostgreSQL
> to soon subside into an unfixable morass, as I think Brooks puts it,
> we must *not* simply patch things in a way that expediently provides
> an approximation of some desired feature.  We have to consider, and put
> substantial weight on, having a clean and maintainable system design.

So why does focusing on a use case make this turn into an unfixable
morass? The two things are completely unrelated.

If the patch is anywhere close to an unfixable morass, let's just
reject it now. But Robert's arguments were much more tenuous than
that.

My comments were in response to this

> I don't really agree with that.  I think the point is to expose what
> the system is doing to the DBA.  I'm OK with exposing the fact that
> creating a table with a serial column also creates a sequence.  There
> is no problem with that - it's all good.  What we DON'T want to do is
> set up a system where the fact that it creates a sequence is exposed
> because that happens to go through ProcessUtility() and the fact that
> it creates a constraint is not because that happens not to go through
> ProcessUtility().  That is not the sort of quality that our users have
> come to expect from PostgreSQL.

The above functionality is sufficient to allow DDL replication. What
else do we want to do that requires some other capability?

Discussing things in terms of what we will actually use a feature for
is how we know we've got it right. And if it does that, we don't need
anything else.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Robert Haas  writes:
> Let me try to give a concrete example of how I think another firing
> point could be made to work along the lines I'm suggesting.
> [ snip description of how an event trigger might safely be fired just
> after identification and locking of the target object for an ALTER ]

Reading that and reflecting on my gripe about the lack of any clear
architecture for our DDL code, I had the beginnings of an idea.

Perhaps it would improve matters if we refactored DDL processing so that
there were separate "parse analysis" and "execution" phases, where parse
analysis is (perhaps among other responsibilities) responsible for
identifying and locking all objects to be used in the command.  I note
that locking the referenced tables is the responsibility of the parse
analysis step in DML processing, so there's solid precedent for this.
Also, we have some of this approach already for certain commands such
as CREATE TABLE, cf parse_utilcmd.c.

If we did that, then it'd be feasible to fire event triggers after the
parse analysis step, and the rechecking that Robert describes could be
encapsulated as "redo the parse analysis and see if the result changed".

It's not clear to me just how this ought to extend to the cascaded-DROP
or inherited-table-ALTER cases, but hey, it's only the beginnings of
an idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Dimitri Fontaine  writes:
> Tom Lane  writes:
>> I think that we're not realistically going to be able to introduce
>> event triggers in very many of the places we'd like to have them
>> without first doing a lot of fundamental refactoring.

> We're only talking about ddl_command_start and ddl_command_end now. The
> table_rewrite event is still on the horizon, but is not realistic to get
> in 9.3 anymore, I think :(

> So we're talking about adding some call points only in utility.c, only
> before or after a ddl command is run, including nested sub-commands.

Well, that's already a problem, because as Robert keeps saying, what
goes through utility.c and what doesn't is pretty random right at the
moment, and we shouldn't expose that behavior to users for fear of not
being able to change it later.

I think we could possibly ship event triggers defined as start and end
of a *top level* command and have them working reliably in 9.3.  If you
want users to be looking at subcommands, there is a lot more work to do
there than we have any chance of getting done for 9.3 (if it is to ship
in 2013, that is).

Alternatively, if you want to get something into 9.3 that has not
necessarily got a long-term-stable API, I'd be inclined to suggest that
we forget about a SQL-level event trigger facility for now, and just put
in some hook call points.  It's pretty well established that we're
willing to change the API seen by hook functions across major releases.

TBH this might be the best thing anyway if your long-term goals have to
do with replication, because it'd be a lot cheaper to get to the point
where you could write the replication code and see if it all actually
works.  I'm fairly concerned that we might spend many man-months getting
event triggers with definitions A,B,C into the system, only to find out
later that what is really needed for logical replication is definitions
D,E,F.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Simon Riggs  writes:
> On 17 January 2013 20:24, Tom Lane  wrote:
> My comments were in response to this

>> I don't really agree with that.  I think the point is to expose what
>> the system is doing to the DBA.  I'm OK with exposing the fact that
>> creating a table with a serial column also creates a sequence.  There
>> is no problem with that - it's all good.  What we DON'T want to do is
>> set up a system where the fact that it creates a sequence is exposed
>> because that happens to go through ProcessUtility() and the fact that
>> it creates a constraint is not because that happens not to go through
>> ProcessUtility().  That is not the sort of quality that our users have
>> come to expect from PostgreSQL.

> The above functionality is sufficient to allow DDL replication. What
> else do we want to do that requires some other capability?

I think you and Robert are talking past each other.  Whether it is or
is not sufficient for DDL replication is not what he is concerned about;
what he's worried about is whether we can refactor the backend in future
without causing a user-visible change in the events that event triggers
see.  I think that that is an entirely valid concern, and it's not going
to go away on the strength of "but this is enough to let us do
replication".

The other point I'd make here is what I just said to Dimitri: it's far
from proven, AFAIK, that this actually *is* sufficient to allow DDL
replication.  I'd be a lot happier if we had a working proof of that
before we lock down a SQL-level facility that we're going to have a hard
time changing the details of.  Maybe we should just introduce some
C-level hooks and let you guys go off and code replication using the
hooks, before we put in a lot of expensive infrastructure that might
turn out to be Not Quite The Right Thing.  After we *know* the hooks
are in the right places and supply the right information, we can look at
replacing them with some more formalized facility.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Dimitri Fontaine
Tom Lane  writes:
> Well, that's already a problem, because as Robert keeps saying, what
> goes through utility.c and what doesn't is pretty random right at the
> moment, and we shouldn't expose that behavior to users for fear of not
> being able to change it later.

It didn't feel that random to me. In most cases, the rule is that if the
system wants to execute a command that you could have typed as a user in
the prompt, then it hacks together a parsenode and call ProcessUtility.

The main exception to that rule is the CASCADE implementation, for
reasons that you detailed in that thread earlier.

> I think we could possibly ship event triggers defined as start and end
> of a *top level* command and have them working reliably in 9.3.  If you
> want users to be looking at subcommands, there is a lot more work to do
> there than we have any chance of getting done for 9.3 (if it is to ship
> in 2013, that is).

By default, you only get top level in what I've been cooking. I think
the other contexts are needed for the logical replication, and maybe it
would be good enough if they are restricted to either only internal code
or event triggers coded in C. What do you think?

> Alternatively, if you want to get something into 9.3 that has not
> necessarily got a long-term-stable API, I'd be inclined to suggest that
> we forget about a SQL-level event trigger facility for now, and just put
> in some hook call points.  It's pretty well established that we're
> willing to change the API seen by hook functions across major releases.

Or just a hook. That would want to have about the exact same amount of
information as the Event Trigger anyway, so I'm thinking we could maybe
do that the same way as the parsetree exposure?

Another idea would be yet another GUC, ala allow_system_table_mods, so
that only intrepid users can have at it. Well, as long as the logical
replication use case is covered, I'm done here, so I want to hear from
Simon and Andres on that (and maybe the Slony and Londiste guys, etc),
and from you to triage what is possible and what is crazy.

> TBH this might be the best thing anyway if your long-term goals have to
> do with replication, because it'd be a lot cheaper to get to the point
> where you could write the replication code and see if it all actually
> works.  I'm fairly concerned that we might spend many man-months getting
> event triggers with definitions A,B,C into the system, only to find out
> later that what is really needed for logical replication is definitions
> D,E,F.

I'm worried about that too, and that's why I'm trying to remain general
and quite transparent about the way the system actually works.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 19:58 +, Simon Riggs wrote:
> Presumably we remember the state of the VM so we can skip the re-visit
> after every write?

That was not a part of my patch, although I remember that you mentioned
that previously and I thought it could be a good way to mitigate a
problem if it ever came up.

However, the tests I did didn't show any problem there. The tests were
somewhat noisy, so perhaps I was doing something wrong, but it didn't
appear that looking at the VM page for every update was a bottleneck.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing PD_ALL_VISIBLE

2013-01-17 Thread Jeff Davis
On Thu, 2013-01-17 at 09:53 -0500, Robert Haas wrote:
> The main question in my mind is whether
> there are any negative consequences to holding a VM buffer pin for
> that long without interruption.  The usual consideration - namely,
> blocking vacuum - doesn't apply here because vacuum does not take a
> cleanup lock on the visibility map page, only the heap page, but I'm
> not sure if there are others.

If the "without interruption" part becomes a practical problem, it seems
fairly easy to fix: drop the pin and pick it up again once every K
pages. Unless I'm missing something, this is a minor concern.

> The other part of the issue is cache pressure. I don't think I can say
> it better than what Tom already wrote:
> 
> # I'd be worried about the case of a lot of sessions touching a lot of
> # unrelated tables.  This change implies doubling the number of buffers
> # that are held pinned by any given query, and the distributed overhead
> # from that (eg, adding cycles to searches for free buffers) is what you
> # ought to be afraid of.
> 
> I agree that we ought to be afraid of that.

It's a legitimate concern, but I think being "afraid" goes to far (more
below).

> A pgbench test isn't
> going to find a problem in this area because there you have a bunch of
> sessions all accessing the same small group of tables.  To find a
> problem of the type above, you'd need lots of backends accessing lots
> of different, small tables.  That's not the use case we usually
> benchmark, but I think there are a fair number of people doing things
> like that - for example, imagine a hosting provider or web application
> with many databases or schemas on a single instance.  AFAICS, Jeff
> hasn't tried to test this scenario.

The concern here is over a lot of different, small tables (e.g.
multi-tenancy or something similar) as you say. If we're talking about
nearly empty tables, that's easy to fix: just don't use the VM on tables
less than N pages.

You could say that "small" tables are really 1-10MB each, and you could
have a zillion of those. I will try to create a worst-case here and see
what numbers come out. Perhaps the extra time to look for a free buffer
does add up.

Test plan:

  1. Take current patch (without "skip VM check for small tables"
optimization mentioned above).
  2. Create 500 tables each about 1MB.
  3. VACUUM them all.
  4. Start 500 connections (one for each table)
  5. Time the running of a loop that executes a COUNT(*) on that
connection's table 100 times.

I think shared_buffers=64MB is probably appropriate. We want some memory
pressure so that it has to find and evict pages to satisfy the queries.
But we don't want it to be totally exhausted and unable to even pin a
new page; that really doesn't tell us much except that max_connections
is too high.

Sound reasonable?

> Now, on the flip side, we should also be thinking about what we would
> gain from this patch, and what other ways there might be to achieve
> the same goals.

One thing to keep in mind is that the current code to maintain a
crash-safe PD_ALL_VISIBLE and VM bit is quite complex and doesn't play
by the normal rules. If you want to talk about distributed costs, that
has some very real distributed costs in terms of development effort. For
instance, my checksums patch took me extra time to write (despite the
patch being the simplest checksums design on the table) and will take
others extra time to review.

So, if the only things keeping that code in place are theoretical fears,
let's take them one by one and see if they are real problems or not.

> All of which is to say that I think this patch is premature.  If we
> adopt something like this, we're likely never going to revert back to
> the way we do it now, and whatever cache-pressure or other costs this
> approach carries will be hear to stay - so we had better think awfully
> carefully before we do that.

What about this patch makes it hard to undo/rework in the future?

>  Even if
> there were, this is exactly the sort of thing that should be committed
> at the beginning of a release cycle, not the end, so as to allow
> adequate time for discovery of unforeseen consequences before the code
> ends up out in the wild.

I'm concerned that such a comment at this stage will cut review early,
which could prevent also it from being committed early in 9.4.

> Of course, there's another issue here too, which is that as Pavan
> points out, the page throws crash-safety out the window

My mistake. I believe that is already fixed, and certainly not a
fundamental issue.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
On 2013-01-17 22:46:21 +0100, Andres Freund wrote:
> Hi,
> 
> 
> While checking whether I could reproduce the replication delay reported
> by Michael Paquier I found this very nice tidbit:
> 
> In a pretty trivial replication setup of only streaming replication I
> can currently easily reproduce this:
> 
> standby# BEGIN;SELECT * FROM foo;
> BEGIN
>  id | data 
> +--
> 
> 
> standby=# SELECT relation::regclass, locktype, mode FROM pg_locks;
>  relation |  locktype  |  mode   
> --++-
>  pg_locks | relation   | AccessShareLock
>  foo_pkey | relation   | AccessShareLock
>  foo  | relation   | AccessShareLock
>   | virtualxid | ExclusiveLock
>   | virtualxid | ExclusiveLock
> (5 rows)
> 
> primary# DROP TABLE foo;
> DROP TABLE
> 
> 
> standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks;
>  relation |  pid  |  locktype  |mode 
> --+---++-
>  pg_locks | 28068 | relation   | AccessShareLock
>  foo_pkey | 28068 | relation   | AccessShareLock
>  foo  | 28068 | relation   | AccessShareLock
>   | 28068 | virtualxid | ExclusiveLock
>   | 28048 | virtualxid | ExclusiveLock
>  foo  | 28048 | relation   | AccessExclusiveLock
> (6 rows)
> 
> standby# SELECT * FROM foo;
> ERROR:  relation "foo" does not exist
> LINE 1: select * from foo;
>   ^
> Note the conflicting locks held on relation foo by 28048 and 28068.
> 
> I don't immediately know which patch to blame here? Looks a bit like
> broken fastpath locking, but I don't immediately see anything in
> c00dc337b87 that should cause this?

Rather scarily this got broken in
96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago,
including in 9.2.1+. How the heck could this go unnoticed so long?

Not sure yet what the cause of this is.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2013-01-17 Thread Tom Lane
Dimitri Fontaine  writes:
> Tom Lane  writes:
>> Well, that's already a problem, because as Robert keeps saying, what
>> goes through utility.c and what doesn't is pretty random right at the
>> moment, and we shouldn't expose that behavior to users for fear of not
>> being able to change it later.

> It didn't feel that random to me. In most cases, the rule is that if the
> system wants to execute a command that you could have typed as a user in
> the prompt, then it hacks together a parsenode and call ProcessUtility.

Uh, no, there's no such "rule".  In some places it was convenient to do
that so people did it --- but often it's easier to just call the
appropriate function directly, especially if you have any special
locking or permissions requirements.  I don't think there's any great
consistency to it.

To be a little more concrete, I looked through backend/catalog and
backend/commands, which ought to pretty much cover all the places where
commands do things that could be considered subcommands.  I find three
uses of ProcessUtility:

extension.c: uses ProcessUtility to handle non-DML commands read from an
extension script.

schemacmds.c: CreateSchemaCommand uses ProcessUtility for any subcommand
of a CREATE SCHEMA statement.  This is really just direct execution of
something the user typed, it's not the system "creating" a subcommand.

trigger.c: ConvertTriggerToFK uses ProcessUtility to execute a consed-up
ALTER TABLE ADD FOREIGN KEY command in place of a series of legacy CREATE
CONSTRAINT TRIGGER commands.  Now this is the very definition of ugly,
and shouldn't drive our design decisions --- but I think that any user
event trigger would be darn surprised to find this being emitted,
especially when the two preceding CREATE CONSTRAINT TRIGGERs hadn't done
any such thing, and indeed hadn't created any constraint trigger.

AFAICT, everything else in those directories is using direct calls and
not going through utility.c.  So the only other cases where a trigger in
ProcessUtility would trap subcommands are where those subcommands are
things that parse_utilcmd.c generated during expansion of a CREATE TABLE
or such.  And that whole area is something that I feel strongly is
implementation detail we don't particularly want to expose to users.
For instance, the fact that a UNIQUE clause in a CREATE TABLE works that
way and not at some lower level is nothing but implementation artifact.

[ pokes around a bit more... ]  Having now actually looked at every call
point of ProcessUtility in the current code, I find myself agreeing very
much with Robert: we do *not* want to expose that exact set of events to
users.  Except for the original top-level commands, it's almost entirely
implementation artifacts that can and will change from release to
release.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HS locking broken in HEAD

2013-01-17 Thread Andres Freund
On 2013-01-17 23:56:16 +0100, Andres Freund wrote:
> On 2013-01-17 22:46:21 +0100, Andres Freund wrote:
  ^
> > Note the conflicting locks held on relation foo by 28048 and 28068.
> > 
> > I don't immediately know which patch to blame here? Looks a bit like
> > broken fastpath locking, but I don't immediately see anything in
> > c00dc337b87 that should cause this?
> 
> Rather scarily this got broken in
> 96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago,
> including in 9.2.1+. How the heck could this go unnoticed so long?

That only made the bug more visible - the real bug is somewhere
else. Which makes it even scarrier, the bug was in in the fast path
locking patch from the start...

It assumes conflicting fast path locks can only ever be in the same
database as the the backend transfering the locks to itself. But thats
obviously not true for the startup process which is in no specific
database.
I think it might also be a dangerous assumption for shared objects?

A patch minimally addressing the is attached, but it only addresses part
of the problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index f2cf5c6..b5240da 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2458,8 +2458,13 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 		 * less clear that our backend is certain to have performed a memory
 		 * fencing operation since the other backend set proc->databaseId.	So
 		 * for now, we test it after acquiring the LWLock just to be safe.
+		 *
+		 * But if we are the startup process we don't belong to a database but
+		 * still need to lock objects in other databases, so we can do this
+		 * optimization only in case we belong to a database.
+		 * XXX: explain why this is safe for shared tables.
 		 */
-		if (proc->databaseId != MyDatabaseId)
+		if (MyDatabaseId != InvalidOid && proc->databaseId != MyDatabaseId)
 		{
 			LWLockRelease(proc->backendLock);
 			continue;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Michael Paquier
On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao  wrote:

>  I encountered the problem that the timeline switch is not performed
> expectedly.
> I set up one master, one standby and one cascade standby. All the servers
> share the archive directory. restore_command is specified in the
> recovery.conf
> in those two standbys.
>
> I shut down the master, and then promoted the standby. In this case, the
> cascade standby should switch to new timeline and replication should be
> successfully restarted. But the timeline was never changed, and the
> following
> log messages were kept outputting.
>
> sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
>
I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then
kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
you will notice that the timeline jump is not done.

I don't know if Masao tried to put in sync the slave that reconnects to the
promoted slave, but in this case slave2 stucks in "potential" state". That
is due to timeline that has not changed on slave2 but better to let you
know...

The replication delays are still here.
-- 
Michael Paquier
http://michael.otacoo.com


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> This seems to provide a reasonably principled
> argument why we might want to fix this case with a localized use of an
> MVCC scan before we have such a fix globally.

I had discussed that idea a bit with Andres on IRC and my only concern
was if there's some reason that acquiring a snapshot during createdb()
would be problematic.  It doesn't appear to currently and I wasn't sure
if there'd be any issues.

I'll start working on a patch for that.

> Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
> kind of annotation.  We know we need the SnapshotNow scan fix.

Agreed.

Thanks,

Stephen


signature.asc
Description: Digital signature


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
> On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao  wrote:
> 
> >  I encountered the problem that the timeline switch is not performed
> > expectedly.
> > I set up one master, one standby and one cascade standby. All the servers
> > share the archive directory. restore_command is specified in the
> > recovery.conf
> > in those two standbys.
> >
> > I shut down the master, and then promoted the standby. In this case, the
> > cascade standby should switch to new timeline and replication should be
> > successfully restarted. But the timeline was never changed, and the
> > following
> > log messages were kept outputting.
> >
> > sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> >
> I am seeing similar issues with master at 88228e6.
> This is easily reproducible by setting up 2 slaves under a master, then
> kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
> you will notice that the timeline jump is not done.

Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
split off?).

> The replication delays are still here.

That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
-receivedUpto <= targetPagePtr + reqLen))
+receivedUpto < targetPagePtr + reqLen))
{
if (StandbyMode)
{

I didn't notice because I had a testscript inserting stuff continuously
and it cause at most lagging by one record...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2013-01-17 Thread Tomas Vondra
On 17.1.2013 20:19, Alvaro Herrera wrote:
> 
> I'm curious -- why would you drop tables in groups of 100 instead of
> just doing the 100,000 in a single transaction?  Maybe that's faster
> now, because you'd do a single scan of the buffer pool instead of 1000?
> (I'm assuming that "in groups of" means you do each group in a separate
> transaction)

There's a limited number of locks, and each DROP acquires a lock
(possibly more than one).

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not create directory "...": File exists

2013-01-17 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> This seems to provide a reasonably principled
>> argument why we might want to fix this case with a localized use of an
>> MVCC scan before we have such a fix globally.

> I had discussed that idea a bit with Andres on IRC and my only concern
> was if there's some reason that acquiring a snapshot during createdb()
> would be problematic.  It doesn't appear to currently and I wasn't sure
> if there'd be any issues.

Don't see what.  The main reason we've not yet attempted a global fix is
that the most straightforward way (take a new snapshot each time we
start a new SnapshotNow scan) seems too expensive.  But CREATE DATABASE
is so expensive that the cost of an extra snapshot there ain't gonna
matter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Michael Paquier
On Fri, Jan 18, 2013 at 8:48 AM, Andres Freund wrote:

> Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
> split off?).
>
Yes, it is reproducible before the xlog reader split.
Just an additional report, the master jumps correctly to the new timeline.


> > The replication delays are still here.
>
> That one is caused by this nice bug, courtesy of yours truly:
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index 90ba32e..1174493 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8874,7 +8874,7 @@ retry:
> /* See if we need to retrieve more data */
> if (readFile < 0 ||
> (readSource == XLOG_FROM_STREAM &&
> -receivedUpto <= targetPagePtr + reqLen))
> +receivedUpto < targetPagePtr + reqLen))
> {
> if (StandbyMode)
> {
>
> I didn't notice because I had a testscript inserting stuff continuously
> and it cause at most lagging by one record...
>
This fix is indeed working.
-- 
Michael Paquier
http://michael.otacoo.com


[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

2013-01-17 Thread Andres Freund
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
> On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao  wrote:
> 
> >  I encountered the problem that the timeline switch is not performed
> > expectedly.
> > I set up one master, one standby and one cascade standby. All the servers
> > share the archive directory. restore_command is specified in the
> > recovery.conf
> > in those two standbys.
> >
> > I shut down the master, and then promoted the standby. In this case, the
> > cascade standby should switch to new timeline and replication should be
> > successfully restarted. But the timeline was never changed, and the
> > following
> > log messages were kept outputting.
> >
> > sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/300 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> >
> I am seeing similar issues with master at 88228e6.
> This is easily reproducible by setting up 2 slaves under a master, then
> kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
> you will notice that the timeline jump is not done.
> 
> I don't know if Masao tried to put in sync the slave that reconnects to the
> promoted slave, but in this case slave2 stucks in "potential" state". That
> is due to timeline that has not changed on slave2 but better to let you
> know...

Ok, I know whats causing this now. Rather ugly.

Whenever accessing a page in a segment we haven't accessed before we
read the first page to do an extra bit of validation as the first page
in a segment contains more information.

Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends
there, wants to read the new timeline, requests record
0/06087088. xlogreader wants to do its validation and goes back to the
first page in the segment which triggers xlog.c to rerequest timeline1
to be transferred..

Heikki, any ideas?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2013-01-17 Thread Tatsuo Ishii
> On Thu, Jan 17, 2013 at 2:29 PM, Andrew Dunstan  wrote:
>>
>> On 01/17/2013 06:04 AM, Tomas Vondra wrote:
>>>
>>> The problem is I have access to absolutely no Windows machines,
>>> not mentioning the development tools (and that I have no clue about it).
>>>
>>> I vaguely remember there were people on this list doing Windows
>>> development
>>> on a virtual machine or something. Any interest in working on this /
>>> giving
>>> me some help?
>>>
>>>
>>
>>
>> One of the items on my very long TODO list (see stuff elsewhere about
>> committers not doing enough reviewing and committing) is to create a package
>> that can easily be run to set Windows Postgres development environments for
>> MSVC, Mingw and Cygwin on Amazon, GoGrid etc.
>>
>> Once I get that done I'll be a lot less sympathetic to "I don't have access
>> to Windows" pleas.
>>
>> Windows does run in a VM very well, but if you're going to set it up on your
>> own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own legal
>> copy to install there. If you don't already have one, that will set you back
>> about $140.00 (for w7 Pro) in the USA. Note that that's significantly better
>> than the situation with OSX, which you can't run at all except on Apple
>> hardware.
> 
> Yeah. I used to have an AMI with the VS environment preinstalled on
> Amazon, but I managed to fat finger things and delete it at some point
> and haven't really had time to rebuild it.
> 
> Having a script that would download and install all the pre-requisites
> on such a box would be *great*. Then you could get up and going pretty
> quickly, and getting a Windows box up running for a few hours there is
> almost free, and you don't have to deal with licensing hassles.
> 
> (Of course, the AMI method doesn't work all the way since you'd be
> distributing Visual Studio, but if we can have a script that
> auto-downloads-and-installs it as necessary we can get around that)

So if my understating is correct, 1)Tomas Vondra commits to work on
Windows support for 9.4, 2)on the assumption that one of Andrew
Dunstan, Dave Page or Magnus Hagander will help him in Windows
development.

Ok? If so, I can commit the patch for 9.3 without Windows support. If
not, I will move the patch to next CF (for 9.4).

Please correct me if I am wrong.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch to add \watch to psql

2013-01-17 Thread Daniel Farina
I have adjusted this patch a little bit to take care of the review
issues, along with just doing a bit of review myself.

On Thu, Oct 25, 2012 at 2:25 AM, Will Leinweber  wrote:
> Thanks for the reviews and comments. Responses inline:
> .
> On Sat, Oct 20, 2012 at 9:19 AM, Abhijit Menon-Sen  
> wrote:
>> Maybe you should call it \repeat or something. I'm sure people would get
>> around to using \watch that way eventually. :-)
>
>
> While I agree that clearing the screen would be nicer, I feel that
> watching the bottom of the screen instead of the top gets you 95% of
> the value of unix watch(1), and having the same name will greatly
> enhance discoverability. Perhaps later on if ncurses is linked for
> some other reason, this could take advantage of it then.
>
> That said, I'm not that strongly attached to the name one way or the other.

The name \repeat has grown on me, but I haven't bothered renaming it
for the time being.  I think sameness with the familiar 'watch'
program may not be such a big deal as I thought originally, but
'repeat' sounds a lot more like a kind of flow control for scripts,
whereas \watch is more clearly for humans, which is the idea.

> On Wed, Oct 24, 2012 at 2:55 PM, Peter Eisentraut  wrote:
>> This doesn't handle multiline queries:
>>
>> => \watch select 1 +
>> ERROR:  42601: syntax error at end of input
>> LINE 1:  select +
>>  ^
>>
>> I think to make it cooperate better with psql syntax, put the \watch at
>> the end, as a replacement for \g, like

I have implemented some kind of multi-line support.  The rough part is
in this part of the patch:

+ if (query_buf && query_buf->len > 0)
+ {
+ /*
+ * Check that the query in query_buf has been terminated.  This
+ * is mostly consistent with behavior from commands like \g.
+ * The reason this is here is to prevent terrible things from
+ * occuring from submitting incomplete input of statements
+ * like:
+ *
+ * DELETE FROM foo
+ * \watch
+ * WHERE
+ *
+ * Wherein \watch will go ahead and send whatever has been
+ * submitted so far.  So instead, insist that the user
+ * terminate the query with a semicolon to be safe.
+ */
+ if (query_buf->data[query_buf->len - 1] == ';')

What I found myself reaching for when giving up and writing this hack
was a way to thread through the last lexer state of  query_buf, which
seems it could stand to be accrue a bit more information than being
just a byte buffer.  But this is the simplest possible thing, so I'll
let others comment...

--
fdr


psql-watch-v2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >