Re: [HACKERS] Changeset Extraction v7.1

2014-01-27 Thread Robert Haas
On Sat, Jan 25, 2014 at 5:25 PM, Andres Freund  wrote:
>> I'm also wondering about
>> whether we've got the right naming here.  AFAICT, it's not the case
>> that we're going to use the "catalog xmin" for catalogs and the "data
>> xmin" for non-catalogs.  Rather, the "catalog xmin" is going to always
>> be included in globalxmin calculations, so IOW it should just be
>> called "xmin".
>
> Well, not really. That's true for GetSnapshotData(), but not for
> GetOldestXmin() where we calculate an xmin *not* including the catalog
> xmin. And the data_xmin is always used, regardless of
> catalog/non_catalog, we peg the xmin further for catalog tables, based
> on that value.
> The reason for doing things this way is that it makes all current usages
> of RecentGlobalXmin safe, since that is the more conservative
> value. Only in inspected location we can use RecentGlobalDataXmin which
> *does* include data_xmin but *not* catalog_xmin.

Well, OK, so I guess I'm turned around.  But I guess my point is - if
one of data_xmin and catalog_xmin is really just xmin, then I think it
would be more clear to call that one "xmin".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.1

2014-01-27 Thread Robert Haas
On Sat, Jan 25, 2014 at 5:25 PM, Andres Freund  wrote:
>> Looking over patch 0002, I see that there's code to allow a walsender
>> to create or drop a physical replication slot.  Also, if we've
>> acquired a replication slot, there's code to update it, and code to
>> make sure we disconnect from it, but there's no code to acquire it. I
>> think maybe the hunk in StartReplication() is supposed to be calling
>> ReplicationSlotAcquire() instead of ReplicationSlotRelease(),
>
> Uh. You had me worried here for a minute or two, a hunk or two earlier
> than the ReplicationSlotRelease() you mention. What probably confused
> you is that StartReplication only returns once all streaming is
> finished. Not my idea...

No, what confuses me is that there's no call to
ReplicationSlotAcquire() in patch 0001 or patch 0002  the function
is added but not called.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.1

2014-01-25 Thread Andres Freund
Hi Robert, all,

On 2014-01-24 20:38:11 -0500, Robert Haas wrote:
> On Fri, Jan 24, 2014 at 12:49 PM, Andres Freund  
> wrote:
> >> But this code is riddled with places where you track a catalog xmin
> >> and a data xmin separately.  The only point of doing it that way is to
> >> support a division that hasn't been made yet.
> >
> > If you think it will make stuff more manageable I can try separating all
> > lines dealing with catalog_xmin into another patch as long as data_xmin
> > doesn't have to be renamed.
> > That said, I don't really think it's a big problem that the division
> > hasn't been made, essentially the meaning is different, even if we don't
> > take advantage of it yet. data_xmin is there for streaming replication
> > where we need to prevent all removal, catalog_xmin is there for
> > changeset extraction.
> 
> I spent some more time studying the 0001 and 0002 patches this
> afternoon, with a side dish of 0004.  I'm leaning toward thinking we
> should go ahead and make that division.

Ok.

> I'm also wondering about
> whether we've got the right naming here.  AFAICT, it's not the case
> that we're going to use the "catalog xmin" for catalogs and the "data
> xmin" for non-catalogs.  Rather, the "catalog xmin" is going to always
> be included in globalxmin calculations, so IOW it should just be
> called "xmin".

Well, not really. That's true for GetSnapshotData(), but not for
GetOldestXmin() where we calculate an xmin *not* including the catalog
xmin. And the data_xmin is always used, regardless of
catalog/non_catalog, we peg the xmin further for catalog tables, based
on that value.
The reason for doing things this way is that it makes all current usages
of RecentGlobalXmin safe, since that is the more conservative
value. Only in inspected location we can use RecentGlobalDataXmin which
*does* include data_xmin but *not* catalog_xmin.

> It's interesting (though not immediately relevant) to speculate about
> how we might extend this to fine-grained xmin tracking more generally.
> [musings for another time]

Yea, I have wondered about that as well. I think the easiest thing would
be to to compute RecentGlobalDataXmin in a database specific manner
since by definition it will *not* include shared tables. We do that
already for GetOldestXmin() but that's not used for heap pruning. I'd
quickly tested that some months back and it gave significant speedups
for two pgbenches in two databases.

> >> I have zero confidence that it's OK to treat fsync() as an operation
> >> that won't fail.  Linux documents EIO as a plausible error return, for
> >> example.  (And really, how could it not?)
> >
> > But quite fundamentally having a the most basic persistency building
> > block fail is something you can't really handle safely. Note that
> > issue_xlog_fsync() has always (and I wager, will always) treated that as
> > a PANIC.
> > I don't recall many complaints about that for WAL. All of the ones I
> > found in a quick search were like "oh, the fs invalidated my fd because
> > of corruption". And few.
> 
> Well, you have a point.  And certainly this version looks much better
> than the previous version in terms of the likelihood of PANIC
> occurring in practice.  But I wonder if we couldn't cut it down even
> further without too much effort.  Suppose we define a slot to exist
> if, and only if, the state file exists.  A directory without a state
> file is subject to arbitrary removal.  Then we can proceed as follows:
> 
> - mkdir() the directory.
> - open() state.tmp
> - write() state.tmp
> - close() state.tmp
> - fsync() parent directory, directory and state.tmp
> - rename() state.tmp to state
> - fsync() state
> 
> If any step except the last one fails, no problem.  The next startup
> can nuke the leftovers; any future attempt to create a slot with the
> name can ignore an EEXIST failure from mkdir() and procedure just as
> above.  Only a failure of the very last fsync is a PANIC.   In some
> ways I think this'd be simpler than what you've got now, because we'd
> eliminate the dance with renaming the directory as well as the state
> file; only the state file matters.

Hm. I think this is pretty exactly what happens in the current patch,
right? There's an additional fsync() of the parent directory at the end,
but that's it.

> To drop a slot, just unlink the state file and fsync() the directory.
> If the unlink fails, it's just an error.  If the fsync() fails, it's a
> PANIC.  Once the state file is gone, removing everything else is only
> an ERROR, and you don't even need to fsync() it again.

Well, the patch as is renames the directory first and fsyncs that. Only
a failure in fsyncing is punishable by PANIC, if rmtree() on the temp
directory file fails it generates WARNINGs, that's it.

> To update a slot, open, write, close, and fsync state.tmp, then rename
> it to state and fsync() again.  None of these steps need PANIC; hold
> off on updating the values in memory until they're all done.  If any
> step 

Re: [HACKERS] Changeset Extraction v7.1

2014-01-24 Thread Robert Haas
On Fri, Jan 24, 2014 at 12:49 PM, Andres Freund  wrote:
>> But this code is riddled with places where you track a catalog xmin
>> and a data xmin separately.  The only point of doing it that way is to
>> support a division that hasn't been made yet.
>
> If you think it will make stuff more manageable I can try separating all
> lines dealing with catalog_xmin into another patch as long as data_xmin
> doesn't have to be renamed.
> That said, I don't really think it's a big problem that the division
> hasn't been made, essentially the meaning is different, even if we don't
> take advantage of it yet. data_xmin is there for streaming replication
> where we need to prevent all removal, catalog_xmin is there for
> changeset extraction.

I spent some more time studying the 0001 and 0002 patches this
afternoon, with a side dish of 0004.  I'm leaning toward thinking we
should go ahead and make that division.  I'm also wondering about
whether we've got the right naming here.  AFAICT, it's not the case
that we're going to use the "catalog xmin" for catalogs and the "data
xmin" for non-catalogs.  Rather, the "catalog xmin" is going to always
be included in globalxmin calculations, so IOW it should just be
called "xmin".  The "data xmin" is going to be included only for
non-catalog tables.  I guess "data" is a reasonable antonym for
catalog, but I'm slightly tempted to propose
RecentGlobalNonCatalogXmin and similar.  Maybe that's too ugly to
live, but I can see someone failing to guess what the exact
distinction is between "xmin" and "data xmin", and I bet they'd be a
lot less likely to misguess if we wrote "non catalog".

It's interesting (though not immediately relevant) to speculate about
how we might extend this to fine-grained xmin tracking more generally.
 The design sketch that comes to mind (and I think parts of this have
been proposed before) is to have a means by which backends can promise
not to lock any more tables except under a new snapshot.  At the read
committed isolation level, or in any single-statement transaction,
backends can so promise whenever (a) all tables mentioned in the query
have been locked and (b) all functions to be invoked during the query
via the fmgr interface promise (e.g. via function labeling) that they
won't directly or indirectly do such a thing.  If they break their
promise, we detect it and ereport(ERROR).  Backends that have made
such a guarantee can be ignored for global-xmin calculations that
don't involve the tables they have locked.  One idea is to keep a hash
table keyed by  with some limited number of entries in
shared memory; it caches the table-specific xmin, a usage counter, and
a flag indicating whether the cached xmin might be stale.  In order to
promise not to lock any new tables, backends must make or update
entries for all the tables they already have locked in this hash
table; if there aren't enough entries, they're not allowed to promise.
 Thus, backends wishing to prune can use the cached xmin value if it's
present (optionally updating it if it's stale) and the minimum of the
xmins of the backends that haven't made a promise if it isn't.  This
is a bit hairy though; access to the shared hash table had better be
*really* fast, and we'd better not need to recompute the cached value
too often.

Anyway, whether we end up pursuing something like that or not, I think
I'm persuaded that this particular optimization won't really be a
problem for hypothetical future work in this area; and also that it's
a good idea to do this much now specifically for logical decoding.

>> I have zero confidence that it's OK to treat fsync() as an operation
>> that won't fail.  Linux documents EIO as a plausible error return, for
>> example.  (And really, how could it not?)
>
> But quite fundamentally having a the most basic persistency building
> block fail is something you can't really handle safely. Note that
> issue_xlog_fsync() has always (and I wager, will always) treated that as
> a PANIC.
> I don't recall many complaints about that for WAL. All of the ones I
> found in a quick search were like "oh, the fs invalidated my fd because
> of corruption". And few.

Well, you have a point.  And certainly this version looks much better
than the previous version in terms of the likelihood of PANIC
occurring in practice.  But I wonder if we couldn't cut it down even
further without too much effort.  Suppose we define a slot to exist
if, and only if, the state file exists.  A directory without a state
file is subject to arbitrary removal.  Then we can proceed as follows:

- mkdir() the directory.
- open() state.tmp
- write() state.tmp
- close() state.tmp
- fsync() parent directory, directory and state.tmp
- rename() state.tmp to state
- fsync() state

If any step except the last one fails, no problem.  The next startup
can nuke the leftovers; any future attempt to create a slot with the
name can ignore an EEXIST failure from mkdir() and procedure just as
above.  Only a failure of the very 

Re: [HACKERS] Changeset Extraction v7.1

2014-01-24 Thread Andres Freund
On 2014-01-24 12:10:50 -0500, Robert Haas wrote:
> > Unfortunately not. Inside the walsender there's currently no
> > LWLockReleaseAll() for ERRORs since commands aren't run inside a
> > transaction command...
> >
> > But maybe I should have fixed this by adding the release to
> > WalSndErrorCleanup() instead? That'd still leave the problematic case
> > that currently we try to delete a replication slot inside a CATCH when
> > we fail while initializing the rest of logical replication... But I
> > guess adding it would be a good idea independent of that.
> 
> +1.  I think that if we can't rely on error handling to clean up the
> same things everywhere, it's gonna be a mess.  People won't be able to
> keep track of which error cleanup is engaged in which code paths, and
> screw-ups will result when old code paths are called from new call
> sites.

Ok, I'll additionally add it there.

> > We could also do a StartTransactionCommand() but I'd rather not, that
> > currently prevents code in that vicinity from doing anything it
> > shouldn't via various Assert()s in existing code.
> 
> Like what?  I mean, I'm OK with having a partial error-handling
> environment if that's all we need, but I think it's a bad plan to the
> extent that the code here needs to be aware of error-handling
> differences versus expectations elsewhere in our code.

Catalog lookups, building a snapshot, xid assignment, whatnot. All that
shouldn't happen in the locations creating/dropping a slot.
I think we should at some point separate parts of the error handling out
of xact.c. Currently its repeated slightly differently over logs of
places (check e.g. the callsites for LWLockReleaseAll), that's not
robust. But that's a project for another day.

Note that the actual decoding *does* happen inside a TransactionCommand,
as it'd be failure prone to copy all the cleanup logic. And we need to
have most of the normal cleanup code.

> I don't really see why it's simpler that way.  It's clearer what the
> point of the lock is if you only hold it for the operations that need
> to be protected by that lock.

> > In all other cases where we modify *_xmin we're only increasing it which
> > doesn't need a lock (HS feedback never has taken one, and
> > GetSnapshotData() modifies ->xmin while holding a shared lock), the only
> > potential danger is a slight delay in increasing the overall value.

> Right.  We might want to comment such places.

> > Don't we already have cases of that? I seem to remember so. If you
> > prefer having them there, I am certainly fine with doing that. This way
> > they aren't allocated if slots are disabled but it's just two
> > TransactionIds.
> 
> Let's go for it, unless we think of a reason not to.

ok on those counts.

> >> It's pretty evident that what's currently patch #4 (only peg the xmin
> >> horizon for catalog tables during logical decoding) needs to become
> >> patch #1, because it doesn't make any sense to apply this before we do
> >> that.
> >
> > Well, the slot code and the the slot support for streaming rep are
> > independent from and don't use it. So they easily can come before it.
> 
> But this code is riddled with places where you track a catalog xmin
> and a data xmin separately.  The only point of doing it that way is to
> support a division that hasn't been made yet.

If you think it will make stuff more manageable I can try separating all
lines dealing with catalog_xmin into another patch as long as data_xmin
doesn't have to be renamed.
That said, I don't really think it's a big problem that the division
hasn't been made, essentially the meaning is different, even if we don't
take advantage of it yet. data_xmin is there for streaming replication
where we need to prevent all removal, catalog_xmin is there for
changeset extraction.

> I have zero confidence that it's OK to treat fsync() as an operation
> that won't fail.  Linux documents EIO as a plausible error return, for
> example.  (And really, how could it not?)

But quite fundamentally having a the most basic persistency building
block fail is something you can't really handle safely. Note that
issue_xlog_fsync() has always (and I wager, will always) treated that as
a PANIC.
I don't recall many complaints about that for WAL. All of the ones I
found in a quick search were like "oh, the fs invalidated my fd because
of corruption". And few.

> >> Calling a slot "old" or "new" looks liable to cause problems.  Maybe
> >> change those names to contain a character not allowed in a slot name,
> >> if we're going to keep doing it that way.
> > I wondered about making them plain files as well but given the need for
> > a directory independent from this I don't really see the advantage,
> > we'll need to handle them anyway during cleanup.
> 
> Yeah, sure, but if it makes for fewer in-between states, it might be worth it.

I don't think it'd make anything simpler with the new version of the
code. Agreed?

Greetings,

Andres Freund

-- 
 Andres Freund   

Re: [HACKERS] Changeset Extraction v7.1

2014-01-24 Thread Robert Haas
On Thu, Jan 23, 2014 at 6:32 PM, Andres Freund  wrote:
>>  I also wonder if we should use the
>> terminology "attach" instead of "acquire"; that pairs more naturally
>> with "release".  Then the message, if we want more than an assert,
>> might be "this backend is already attached to a replication slot".
>
> I went with Acquire/Release because our locking code does so, and it
> seemed sensible to be consistent. I don't have strong feelings about it.

Yeah, but I think a slot is not really the same thing as a lock.
Acquire/release might be OK.  In some of my recent code I used
attach/detach, which feels a little more natural to me for something
like this, so I lean that direction.

> Unfortunately not. Inside the walsender there's currently no
> LWLockReleaseAll() for ERRORs since commands aren't run inside a
> transaction command...
>
> But maybe I should have fixed this by adding the release to
> WalSndErrorCleanup() instead? That'd still leave the problematic case
> that currently we try to delete a replication slot inside a CATCH when
> we fail while initializing the rest of logical replication... But I
> guess adding it would be a good idea independent of that.

+1.  I think that if we can't rely on error handling to clean up the
same things everywhere, it's gonna be a mess.  People won't be able to
keep track of which error cleanup is engaged in which code paths, and
screw-ups will result when old code paths are called from new call
sites.

> We could also do a StartTransactionCommand() but I'd rather not, that
> currently prevents code in that vicinity from doing anything it
> shouldn't via various Assert()s in existing code.

Like what?  I mean, I'm OK with having a partial error-handling
environment if that's all we need, but I think it's a bad plan to the
extent that the code here needs to be aware of error-handling
differences versus expectations elsewhere in our code.

>> This doesn't need the LWLockRelease either.  It does need the
>> SpinLockRelease, but as I think I noted previously, the right way to
>> write this is probably: SpinLockAcquire(&slot->mutex); was_active =
>> slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if
>> (was_active) ereport(). MyReplicatonSlot = slot.
>
> That's not really simpler tho? But if you prefer I can go that way.

It avoids a branch while holding the lock, and it puts the
SpinLockAcquire/Release pair much closer together, so it's easier to
visually verify that the lock is released in all cases.

>> ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and
>> the comment "Provide interlock against concurrent recomputations"
>> doesn't seem adequate to me.  I guess the idea here is that we regard
>> ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and
>> ReplicationSlotCtl->data_xmin, but if that's the idea then we only
>> need to hold the lock during the time when we actually update those
>> values, not the loop where we compute them.
>
> There's a comment someplace else to that end, but yes, that's
> essentially the idea. I decided to take it during the whole
> recomputation because we also take ProcArrayLock when creating a new
> decoding slot and initially setting ->catalog_xmin. That's not strictly 
> required
> but seemed simpler that way, and the code shouldn't be very hot.
> The code that initially computes the starting value for catalog_xmin
> when creating a new decoding slot has to take ProcArrayLock to be safe,
> that's why I though it'd be convenient to always use it for those
> values.

I don't really see why it's simpler that way.  It's clearer what the
point of the lock is if you only hold it for the operations that need
to be protected by that lock.

> In all other cases where we modify *_xmin we're only increasing it which
> doesn't need a lock (HS feedback never has taken one, and
> GetSnapshotData() modifies ->xmin while holding a shared lock), the only
> potential danger is a slight delay in increasing the overall value.

Right.  We might want to comment such places.

>> Also, if that's the
>> design, maybe they should be part of PROC_HDR *ProcGlobal rather than
>> here.  It seems weird to have some of the values protected by
>> ProcArrayLock live in a completely different data structure managed
>> almost entirely by some other part of the system.
>
> Don't we already have cases of that? I seem to remember so. If you
> prefer having them there, I am certainly fine with doing that. This way
> they aren't allocated if slots are disabled but it's just two
> TransactionIds.

Let's go for it, unless we think of a reason not to.

>> It's pretty evident that what's currently patch #4 (only peg the xmin
>> horizon for catalog tables during logical decoding) needs to become
>> patch #1, because it doesn't make any sense to apply this before we do
>> that.
>
> Well, the slot code and the the slot support for streaming rep are
> independent from and don't use it. So they easily can come before it.

But this code

Re: [HACKERS] Changeset Extraction v7.1

2014-01-23 Thread Andres Freund
Hi,

On 2014-01-23 16:04:10 -0500, Robert Haas wrote:
> Patch 0001:
> 
> +errmsg("could not find free replication 
> slot"),
> 
> Suggest: all replication slots are in use

That sounds better indeed.

> +   elog(ERROR, "cannot aquire a slot while another slot
> has been acquired");
> 
> Suggest: this backend has already acquired a replication slot
> 
> Or demote it to Assert().  I'm not really sure why this needs to be
> checked in non-assert builds.

Hm. Fine with me, not sure why I went with an elog(). Maybe because I
thought output plugin authors could have the idea of using another slot
while inside one?

>  I also wonder if we should use the
> terminology "attach" instead of "acquire"; that pairs more naturally
> with "release".  Then the message, if we want more than an assert,
> might be "this backend is already attached to a replication slot".

I went with Acquire/Release because our locking code does so, and it
seemed sensible to be consistent. I don't have strong feelings about it.

> +   if (slot == NULL)
> +   {
> +   LWLockRelease(ReplicationSlotCtlLock);
> +   ereport(ERROR,
> +   (errcode(ERRCODE_UNDEFINED_OBJECT),
> +errmsg("could not find replication
> slot \"%s\"", name)));
> +   }
> 
> The error will release the LWLock anyway; I'd get rid of the manual
> LWLockRelease, and the braces.  Similarly in ReplicationSlotDrop.

Unfortunately not. Inside the walsender there's currently no
LWLockReleaseAll() for ERRORs since commands aren't run inside a
transaction command...

But maybe I should have fixed this by adding the release to
WalSndErrorCleanup() instead? That'd still leave the problematic case
that currently we try to delete a replication slot inside a CATCH when
we fail while initializing the rest of logical replication... But I
guess adding it would be a good idea independent of that.

We could also do a StartTransactionCommand() but I'd rather not, that
currently prevents code in that vicinity from doing anything it
shouldn't via various Assert()s in existing code.

> +   /* acquire spinlock so we can test and set ->active safely */
> +   SpinLockAcquire(&slot->mutex);
> +
> +   if (slot->active)
> +   {
> +   SpinLockRelease(&slot->mutex);
> +   LWLockRelease(ReplicationSlotCtlLock);
> +   ereport(ERROR,
> +   (errcode(ERRCODE_OBJECT_IN_USE),
> +errmsg("slot \"%s\" already active", name)));
> +   }
> +
> +   /* we definitely have the slot, no errors possible anymore */
> +   slot->active = true;
> +   MyReplicationSlot = slot;
> +   SpinLockRelease(&slot->mutex);
> 
> This doesn't need the LWLockRelease either.  It does need the
> SpinLockRelease, but as I think I noted previously, the right way to
> write this is probably: SpinLockAcquire(&slot->mutex); was_active =
> slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if
> (was_active) ereport(). MyReplicatonSlot = slot.

That's not really simpler tho? But if you prefer I can go that way.

> ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and
> the comment "Provide interlock against concurrent recomputations"
> doesn't seem adequate to me.  I guess the idea here is that we regard
> ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and
> ReplicationSlotCtl->data_xmin, but if that's the idea then we only
> need to hold the lock during the time when we actually update those
> values, not the loop where we compute them.

There's a comment someplace else to that end, but yes, that's
essentially the idea. I decided to take it during the whole
recomputation because we also take ProcArrayLock when creating a new
decoding slot and initially setting ->catalog_xmin. That's not strictly required
but seemed simpler that way, and the code shouldn't be very hot.
The code that initially computes the starting value for catalog_xmin
when creating a new decoding slot has to take ProcArrayLock to be safe,
that's why I though it'd be convenient to always use it for those
values.

In all other cases where we modify *_xmin we're only increasing it which
doesn't need a lock (HS feedback never has taken one, and
GetSnapshotData() modifies ->xmin while holding a shared lock), the only
potential danger is a slight delay in increasing the overall value.

> Also, if that's the
> design, maybe they should be part of PROC_HDR *ProcGlobal rather than
> here.  It seems weird to have some of the values protected by
> ProcArrayLock live in a completely different data structure managed
> almost entirely by some other part of the system.

Don't we already have cases of that? I seem to remember so. If you
prefer having them there, I am certainly fine with doing that. This way
they aren't allocated if slots are disabled but it's just two
TransactionIds.

> It'

Re: [HACKERS] Changeset Extraction v7.1

2014-01-23 Thread Alvaro Herrera
> I wonder if it wouldn't be better to get rid of the subdirectories for
> the individual slots, and just have a file pg_replslot/$SLOTNAME, or
> not.  I know there are later patches that need subdirectories for
> their own private data, but they could just create
> pg_replslot/$SLOTNAME.dir and put whatever in it they like, without
> really implicating this code that much.  The advantage of that is that
> there would be fewer intermediate states.  The slot exists if the file
> exists, and not if it doesn't.  You still need half-alive and
> half-dead until the fsync finishes, but you don't need to worry about
> tracking both the state of the directory and the state of the file.

Why do we need directories at all?  I know there might be subsidiary
files to store stuff in separate files, but maybe we can just name files
using the slot name (or a transformation thereof) as a prefix.  It
shouldn't be difficult to remove the right files whenever there's a
need, and not having to worry about a directory that might need a
separate fsync might make things easier.

On the other hand, there might still be a need to fsync the parent
directory, so maybe there is not that much gain.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.1

2014-01-23 Thread Robert Haas
Patch 0001:

+errmsg("could not find free replication slot"),

Suggest: all replication slots are in use

+   elog(ERROR, "cannot aquire a slot while another slot
has been acquired");

Suggest: this backend has already acquired a replication slot

Or demote it to Assert().  I'm not really sure why this needs to be
checked in non-assert builds.  I also wonder if we should use the
terminology "attach" instead of "acquire"; that pairs more naturally
with "release".  Then the message, if we want more than an assert,
might be "this backend is already attached to a replication slot".

+   if (slot == NULL)
+   {
+   LWLockRelease(ReplicationSlotCtlLock);
+   ereport(ERROR,
+   (errcode(ERRCODE_UNDEFINED_OBJECT),
+errmsg("could not find replication
slot \"%s\"", name)));
+   }

The error will release the LWLock anyway; I'd get rid of the manual
LWLockRelease, and the braces.  Similarly in ReplicationSlotDrop.

+   /* acquire spinlock so we can test and set ->active safely */
+   SpinLockAcquire(&slot->mutex);
+
+   if (slot->active)
+   {
+   SpinLockRelease(&slot->mutex);
+   LWLockRelease(ReplicationSlotCtlLock);
+   ereport(ERROR,
+   (errcode(ERRCODE_OBJECT_IN_USE),
+errmsg("slot \"%s\" already active", name)));
+   }
+
+   /* we definitely have the slot, no errors possible anymore */
+   slot->active = true;
+   MyReplicationSlot = slot;
+   SpinLockRelease(&slot->mutex);

This doesn't need the LWLockRelease either.  It does need the
SpinLockRelease, but as I think I noted previously, the right way to
write this is probably: SpinLockAcquire(&slot->mutex); was_active =
slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if
(was_active) ereport(). MyReplicatonSlot = slot.

ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and
the comment "Provide interlock against concurrent recomputations"
doesn't seem adequate to me.  I guess the idea here is that we regard
ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and
ReplicationSlotCtl->data_xmin, but if that's the idea then we only
need to hold the lock during the time when we actually update those
values, not the loop where we compute them.  Also, if that's the
design, maybe they should be part of PROC_HDR *ProcGlobal rather than
here.  It seems weird to have some of the values protected by
ProcArrayLock live in a completely different data structure managed
almost entirely by some other part of the system.

It's pretty evident that what's currently patch #4 (only peg the xmin
horizon for catalog tables during logical decoding) needs to become
patch #1, because it doesn't make any sense to apply this before we do
that.  I'm still not 100% confident in that approach, but maybe I'd
better try to look at it RSN and get confident, because too much of
the rest of what you've got here hangs on that to proceed without it.
Or to put all that another way, if for any reason we decide that the
separate catalog xmin stuff is not viable, the rest of this is going
to need a lot of rework, so we'd better sort that now rather than
later.

With respect to the synchronize-slots-to-disk stuff we're arguing
about on the other thread, I think the basic design problem here is
that you assume that you can change stuff in memory and then change
stuff on disk, without either set of changes being atomic.  What I
think you need to do is making atomic actions on disk correspond to
atomic state changes in memory.  IOW, suppose that creating a slot
consists of two steps: mkdir() + fsync().  Then I think what you need
to do is - do the mkdir().  If it errors out, fine.  If it succeeds,
the mark the slot half-created.  This is just an assignment so it can
done immediately after you learn that mkdir() worked with no risk of
an intervening failure.  Then, try to fsync().  If it errors out, the
slot will get left in the half-created state.  If it works, then
immediately mark the slot as fully created.  Now, when the next guy
comes along and looks at the slot, he can tell what he needs to do.
Specifically, if the slot is half-created, and he wants to do anything
other than remove it, he's got to fsync() it first, and if that errors
out, so be it.  The next access to the slot will merely find it still
half-created and simply try the fsync() yet again.

Alternatively, since nearly everything we're trying to do here is a
two-step operation - do something and then fsync - maybe we have a
more generic fsync-pending flag, and each slot operation checks that
and retries the fsync() if it's set.  But it might be situation
dependent which thing we need to fsync, since there are multiple files
involved.

Broadly, what you're trying to accomplish here is to have something
that is crash-safe but without