Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-02-13 Thread Peter Eisentraut
On Sun, 2014-01-19 at 15:31 +0100, Stefan Kaltenbrunner wrote:
> > /* followings are for client encoding only */
> > PG_SJIS,/* Shift JIS
> > (Winindows-932) */
> 
> while you have that file open: s/Winindows-932/Windows-932 maybe?

done



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-23 Thread Andres Freund
On 2014-01-23 11:50:57 -0500, Robert Haas wrote:
> On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund  wrote:
> > I don't think shared buffers fsyncs are the apt comparison. It's more
> > something like UpdateControlFile(). Which PANICs.
> >
> > I really don't get why you fight PANICs in general that much. There are
> > some nasty PANICs in postgres which can happen in legitimate situations,
> > which should be made to fail more gracefully, but this surely isn't one
> > of them. We're doing rename(), unlink() and rmdir(). That's it.
> > We should concentrate on the ones that legitimately can happen, not the
> > ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
> > mount -o remount,ro /. We don't increase reliability by a bit adding
> > codepaths that will never get tested.
> 
> Sorry, I don't buy it.  Lots of people I know have stories that go
> like this "$HORRIBLE happened, and PostgreSQL kept on running, and it
> didn't even lose my data!", where $HORRIBLE may be variously that the
> disk filled up, that disk writes started failing with I/O errors, that
> somebody changed the permissions on the data directory inadvertently,
> that the entire data directory got removed, and so on.

Especially the "not loosing data" imo is because postgres is
conservative with continuing in situations it doesn't know anything
about. Most prominently the cluster wide restart after a segfault.

> I've been
> through some of those scenarios myself, and the care and effort that's
> been put into failure modes has saved my bacon more than a few times,
> too.  We *do* increase reliability by worrying about what will happen
> even in code paths that very rarely get exercised.

A part of thinking about them *is* restricting what happens in those
cases by keeping the possible states to worry about to a minimum.

Just splapping on an ERROR instead of PANIC can make things much
worse. Not releasing space until a restart, without a chance to do
anything about it because we failed to properly release the in-memory
slot will just make the problem bigger because now the cleanup might
take a week (VACUUM FULLing the entire cluster?).

> I think it's completely unacceptable for the failure of routine
> filesystem operations to result in a PANIC.  I grant you that we have
> some existing cases where that can happen (like UpdateControlFile),
> but that doesn't mean we should add more.  Right this very minute
> there is massive bellyaching on a nearby thread caused by the fact
> that a full disk condition while writing WAL can PANIC the server,
> while on this thread at the very same time you're arguing that adding
> more ways for a full disk to cause PANICs won't inconvenience anyone.

A full disk won't cause any of the problems for the case we're
discussing, will it? We're just doing rename(), unlink(), rmdir() here,
all should succeed while the FS is full (afair rename() does on all
common FSs because inodes are kept separately).

> The other thread is right, and your argument here is wrong.  We have
> been able to - and have taken the time to - fix comparable problems in
> other cases, and we should do the same thing here.

I don't think the WAL case is comparable at all. ENOSPC is something
expected that can happen during normal operation and doesn't include
malintended operator and is reasonably easy to test. unlink() or fsync()
randomly failing is not.
In fact, isn't the consequence out of that thread that we need a
significant amount of extra complexity to handle the case? We shouldn't
spend that effort for cases that don't deserve it because they are too
unlikely in practice.

And yes, there's not too many other places PANICing - because most can
rely on WAL handling those tricky cases for them...

> Second, I have
> encountered a few situations where customers had production servers
> that repeatedly PANICked due to some bug or other.  If I've ever
> encountered angrier customers, I can't remember when.  A PANIC is no
> big deal when it happens on your development box, but when it happens
> on a machine with 100 users connected to it, it's a big deal,
> especially if a single underlying cause makes it happen over and over
> again.

Sure. But blindly continuing and then, possibly quite a bit later,
loosing data, causing an outage that takes a long while to recover or
something isn't any better.

> I think we should be devoting our time to figuring how to improve
> this, not whether to improve it.

If you'd argue that creating a new slot should fail gracefull, ok, I can
relatively easily be convinced of that. But trying to handle failures in
the midst of deletion in cases that won't happen in reality is just
inviting trouble imo.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/ma

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-23 Thread Robert Haas
On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund  wrote:
> I don't think shared buffers fsyncs are the apt comparison. It's more
> something like UpdateControlFile(). Which PANICs.
>
> I really don't get why you fight PANICs in general that much. There are
> some nasty PANICs in postgres which can happen in legitimate situations,
> which should be made to fail more gracefully, but this surely isn't one
> of them. We're doing rename(), unlink() and rmdir(). That's it.
> We should concentrate on the ones that legitimately can happen, not the
> ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
> mount -o remount,ro /. We don't increase reliability by a bit adding
> codepaths that will never get tested.

Sorry, I don't buy it.  Lots of people I know have stories that go
like this "$HORRIBLE happened, and PostgreSQL kept on running, and it
didn't even lose my data!", where $HORRIBLE may be variously that the
disk filled up, that disk writes started failing with I/O errors, that
somebody changed the permissions on the data directory inadvertently,
that the entire data directory got removed, and so on.  I've been
through some of those scenarios myself, and the care and effort that's
been put into failure modes has saved my bacon more than a few times,
too.  We *do* increase reliability by worrying about what will happen
even in code paths that very rarely get exercised.  It's certainly
true that our bug count there is higher there than for the parts of
our code that get exercised more regularly, but it's also lower than
it would be if we didn't make the effort, and the dividend that we get
from that effort is that we have a well-deserved reputation for
reliability.

I think it's completely unacceptable for the failure of routine
filesystem operations to result in a PANIC.  I grant you that we have
some existing cases where that can happen (like UpdateControlFile),
but that doesn't mean we should add more.  Right this very minute
there is massive bellyaching on a nearby thread caused by the fact
that a full disk condition while writing WAL can PANIC the server,
while on this thread at the very same time you're arguing that adding
more ways for a full disk to cause PANICs won't inconvenience anyone.
The other thread is right, and your argument here is wrong.  We have
been able to - and have taken the time to - fix comparable problems in
other cases, and we should do the same thing here.

As for why I fight PANICs so much in general, there are two reasons.
First, I believe that to be project policy.  I welcome correction if I
have misinterpreted our stance in that area.  Second, I have
encountered a few situations where customers had production servers
that repeatedly PANICked due to some bug or other.  If I've ever
encountered angrier customers, I can't remember when.  A PANIC is no
big deal when it happens on your development box, but when it happens
on a machine with 100 users connected to it, it's a big deal,
especially if a single underlying cause makes it happen over and over
again.

I think we should be devoting our time to figuring how to improve
this, not whether to improve it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-23 Thread Andres Freund
Hi,

On 2014-01-22 13:00:44 -0500, Robert Haas wrote:
> Well, apparently, one is going to PANIC and reinitialize the system.
> I presume that upon reinitialization we'll decide that the slot is
> gone, and thus won't recreate it in shared memory.

Yea, and if it's half-gone we'll continue deletion. And since yesterday
evening we'll even fsync things during startup to handle scenarios
similar to 20140122162115.gl21...@alap3.anarazel.de .

> Of course, if the entire system suffers a hard power failure after that and 
> before the
> directory is succesfully fsync'd, then the slot could reappear on the
> next startup. Which is also exactly what would happen if we removed
> the slot from shared memory after doing the unlink, and then the
> system suffered a hard power failure before the directory contents
> made it to disk.  Except that we also panicked.

Yes, but that could only happen as long as no relevant data has been
lost since we hold relevant locks during this.

> In the case of shared buffers, the way we handle fsync failures is by
> not allowing the system to checkpoint until all of the fsyncs succeed.

I don't think shared buffers fsyncs are the apt comparison. It's more
something like UpdateControlFile(). Which PANICs.

I really don't get why you fight PANICs in general that much. There are
some nasty PANICs in postgres which can happen in legitimate situations,
which should be made to fail more gracefully, but this surely isn't one
of them. We're doing rename(), unlink() and rmdir(). That's it.
We should concentrate on the ones that legitimately can happen, not the
ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
mount -o remount,ro /. We don't increase reliability by a bit adding
codepaths that will never get tested.

> If there's an OS-level reset before that happens, WAL replay will
> perform the same buffer modifications over again and the next
> checkpoint will again try to flush them to disk and will not complete
> unless it does.  That forms a closed system where we never advance the
> redo pointer over the covering WAL record until the changes it covers
> are on the disk.  But I don't think this code has any similar
> interlock; if it does, I missed it.

No, it doesn't (until the first rename() at least), but the number of
failure scenarios is far smaller.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Robert Haas
On Wed, Jan 22, 2014 at 10:48 AM, Andres Freund  wrote:
> Yes, individual operations should be, but you cannot be sure whether a
> rename()/unlink() will survive a crash until the directory is
> fsync()ed. So, what is one going to do if the unlink suceeded, but the
> fsync didn't?

Well, apparently, one is going to PANIC and reinitialize the system.
I presume that upon reinitialization we'll decide that the slot is
gone, and thus won't recreate it in shared memory.  Of course, if the
entire system suffers a hard power failure after that and before the
directory is succesfully fsync'd, then the slot could reappear on the
next startup.  Which is also exactly what would happen if we removed
the slot from shared memory after doing the unlink, and then the
system suffered a hard power failure before the directory contents
made it to disk.  Except that we also panicked.

In the case of shared buffers, the way we handle fsync failures is by
not allowing the system to checkpoint until all of the fsyncs succeed.
 If there's an OS-level reset before that happens, WAL replay will
perform the same buffer modifications over again and the next
checkpoint will again try to flush them to disk and will not complete
unless it does.  That forms a closed system where we never advance the
redo pointer over the covering WAL record until the changes it covers
are on the disk.  But I don't think this code has any similar
interlock; if it does, I missed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Andres Freund
On 2014-01-22 10:14:27 -0500, Robert Haas wrote:
> On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund  wrote:
> > On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
> >> > I am not sure I understand that point. We can either update the
> >> > in-memory bit before performing the on-disk operations or
> >> > afterwards. Either way, there's a way to be inconsistent if the disk
> >> > operation fails somewhere inbetween (it might fail but still have
> >> > deleted the file/directory!). The normal way to handle that in other
> >> > places is PANICing when we don't know so we recover from the on-disk
> >> > state.
> >> > I really don't see the problem here? Code doesn't get more robust by
> >> > doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
> >> > ERROR, often that's not warranted.
> >>
> >> People get cranky when the database PANICs because of a filesystem
> >> failure.  We should avoid that, especially when it's trivial to do so.
> >>  The update to shared memory should be done second and should be set
> >> up to be no-fail.
> >
> > I don't see how that would help. If we fail during unlink/rmdir, we
> > don't really know at which point we failed.
> 
> This doesn't make sense to me.  unlink/rmdir are atomic operations.

Yes, individual operations should be, but you cannot be sure whether a
rename()/unlink() will survive a crash until the directory is
fsync()ed. So, what is one going to do if the unlink suceeded, but the
fsync didn't?

Deletion currently works like:
if (rename(path, tmppath) != 0)
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not rename \"%s\" to \"%s\": %m",
path, tmppath)));

/* make sure no partial state is visible after a crash */
fsync_fname(tmppath, false);
fsync_fname("pg_replslot", true);

if (!rmtree(tmppath, true))
{
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not remove directory \"%s\": %m",
tmppath)));
}

If we fail between the rename() and the fsync_fname() we don't really
know which state we are in. We'd also have to add code to handle
incomplete slot directories, which currently only exists for startup, to
other places.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Robert Haas
On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund  wrote:
> On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
>> > I am not sure I understand that point. We can either update the
>> > in-memory bit before performing the on-disk operations or
>> > afterwards. Either way, there's a way to be inconsistent if the disk
>> > operation fails somewhere inbetween (it might fail but still have
>> > deleted the file/directory!). The normal way to handle that in other
>> > places is PANICing when we don't know so we recover from the on-disk
>> > state.
>> > I really don't see the problem here? Code doesn't get more robust by
>> > doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
>> > ERROR, often that's not warranted.
>>
>> People get cranky when the database PANICs because of a filesystem
>> failure.  We should avoid that, especially when it's trivial to do so.
>>  The update to shared memory should be done second and should be set
>> up to be no-fail.
>
> I don't see how that would help. If we fail during unlink/rmdir, we
> don't really know at which point we failed.

This doesn't make sense to me.  unlink/rmdir are atomic operations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-22 Thread Andres Freund
On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
> > I am not sure I understand that point. We can either update the
> > in-memory bit before performing the on-disk operations or
> > afterwards. Either way, there's a way to be inconsistent if the disk
> > operation fails somewhere inbetween (it might fail but still have
> > deleted the file/directory!). The normal way to handle that in other
> > places is PANICing when we don't know so we recover from the on-disk
> > state.
> > I really don't see the problem here? Code doesn't get more robust by
> > doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
> > ERROR, often that's not warranted.
> 
> People get cranky when the database PANICs because of a filesystem
> failure.  We should avoid that, especially when it's trivial to do so.
>  The update to shared memory should be done second and should be set
> up to be no-fail.

I don't see how that would help. If we fail during unlink/rmdir, we
don't really know at which point we failed. Just keeping the slot in
memory, won't help us in any way - we'll continue to reserve resources
while the slot is half-gone.
I don't think trying to handle errors we don't understand and we don't
routinely expect actually improves robustness. It just leads to harder
to diagnose errors. It's not like the cases here are likely to be caused
by anthything but severe admin failure like removing the write
permissions of the postgres directory while the server is running. Or do
you see more valid causes?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-19 Thread Stefan Kaltenbrunner
On 01/18/2014 02:31 PM, Robert Haas wrote:
> On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer  wrote:
>> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
>> than as an input/output encoding, is into pain and suffering. Personally
>> I'd be quite happy to see it supported as client_encoding, but forbidden
>> as a server-side encoding. That's not the case right now - so since we
>> support it, we'd better guard against its quirks.
> 
> I think that *is* the case right now.  pg_wchar.h sayeth:
> 
> /* followings are for client encoding only */
> PG_SJIS,/* Shift JIS
> (Winindows-932) */

while you have that file open: s/Winindows-932/Windows-932 maybe?



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer  wrote:
>> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
>> than as an input/output encoding, is into pain and suffering. Personally
>> I'd be quite happy to see it supported as client_encoding, but forbidden
>> as a server-side encoding. That's not the case right now - so since we
>> support it, we'd better guard against its quirks.

> I think that *is* the case right now.

SHIFT_JIS is not and never will be allowed as a server encoding,
precisely because it has multi-byte characters of which some bytes could
be taken for ASCII.  The same is true of our other client-only encodings.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Craig Ringer
On 01/18/2014 09:31 PM, Robert Haas wrote:
> On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer  wrote:
>> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
>> than as an input/output encoding, is into pain and suffering. Personally
>> I'd be quite happy to see it supported as client_encoding, but forbidden
>> as a server-side encoding. That's not the case right now - so since we
>> support it, we'd better guard against its quirks.
> 
> I think that *is* the case right now.  pg_wchar.h sayeth:
> 
> /* followings are for client encoding only */
> PG_SJIS,/* Shift JIS
> (Winindows-932) */
> PG_BIG5,/* Big5 (Windows-950) 
> */
> PG_GBK, /* GBK (Windows-936) 
> */

Perfect - that makes ASCII-only just fine, IMO.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Robert Haas
On Thu, Jan 16, 2014 at 9:54 AM, Andres Freund  wrote:
>> Maybe it would be better to get rid of active/in_use and have
>> three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
>> REPLSLOT_FREE.  Or something like that.
>
> Hm. Color me unenthusiastic. If you feel strongly I can change it, but
> otherwise not.

I found the active/in_use distinction confusing; I thought one
three-state flag rather than two Booleans might be clearer.  But I
might be able to just suck it up.

>> >> - If there's a coding rule that slot->database can't be changed while
>> >> the slot is active, then the check to make sure that the user isn't
>> >> trying to bind to a slot with a mis-matching database could be done
>> >> before the code described in the previous point, avoiding the need to
>> >> go back and release the resource.
>> >
>> > I don't think slot->database should be allowed to change at all...
>>
>> Well, it can if the slot is dropped and a new one created.
>
> Well. That obviously requires the lwlock to be acquired...

Right, so the point of this comment originally was you had some logic
that could be moved sooner to avoid having to undo so much on a
failure.

>> >> - I think the critical section in ReplicationSlotDrop is bogus.  If
>> >> DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
>> >> gone.
>> >
>> > Well, if delete slot fails, we don't really know at which point it
>> > failed which means that the on-disk state might not correspond to the
>> > in-memory state. I don't see a point in adding code trying to handle
>> > that case correctly...
>>
>> Deleting the slot should be an atomic operation.  There's some
>> critical point before which the slot will be picked up by recovery and
>> after which it won't.  You either did that operation, or not, and can
>> adjust the in-memory state accordingly.
>
> I am not sure I understand that point. We can either update the
> in-memory bit before performing the on-disk operations or
> afterwards. Either way, there's a way to be inconsistent if the disk
> operation fails somewhere inbetween (it might fail but still have
> deleted the file/directory!). The normal way to handle that in other
> places is PANICing when we don't know so we recover from the on-disk
> state.
> I really don't see the problem here? Code doesn't get more robust by
> doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
> ERROR, often that's not warranted.

People get cranky when the database PANICs because of a filesystem
failure.  We should avoid that, especially when it's trivial to do so.
 The update to shared memory should be done second and should be set
up to be no-fail.

>> > The slot details get updates by the respective replication code. For
>> > streaming rep, that should happen via reply and feedback
>> > messages. For changeset extraction it happens when
>> > LogicalConfirmReceivedLocation() is called; the walsender interface
>> > does that using reply messages, the SQL interface calls it when
>> > finished (unless you use the _peek_ functions).
>>
>> Right, but where is this code?  I don't see this updating the reply
>> and feedback message processing code to touch slots.  Did I miss that?
>
> It's in "wal_decoding: logical changeset extraction walsender interface"
> currently :(. Splitting the streaming replication part of that patch off
> isn't easy...

Ack.  I was hoping to work through these patches one at a time, but
that's not going to work if they are interdependent to that degree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Robert Haas
On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer  wrote:
> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
> than as an input/output encoding, is into pain and suffering. Personally
> I'd be quite happy to see it supported as client_encoding, but forbidden
> as a server-side encoding. That's not the case right now - so since we
> support it, we'd better guard against its quirks.

I think that *is* the case right now.  pg_wchar.h sayeth:

/* followings are for client encoding only */
PG_SJIS,/* Shift JIS
(Winindows-932) */
PG_BIG5,/* Big5 (Windows-950) */
PG_GBK, /* GBK (Windows-936) */

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-16 Thread Craig Ringer
On 01/16/2014 02:28 AM, Robert Haas wrote:
> - If you address /* FIXME: apply sanity checking to slot name */, then
> I think that also addresses /* XXX: do we want to use truncate
> identifier instead? */. In other words, let's just error out if the
> name is too long.  I'm not sure what other sanity checking is needed
> here; maybe just restrict it to 7-bit characters to avoid problems
> with encodings for individual databases varying.

It's a common misunderstanding that restricting to 7-bit solves encoding
issues.

Thanks to the joy that is SHIFT_JIS, we must also disallow the backslash
and tilde characters.

Anybody who actually uses SHIFT_JIS as an operational encoding, rather
than as an input/output encoding, is into pain and suffering. Personally
I'd be quite happy to see it supported as client_encoding, but forbidden
as a server-side encoding. That's not the case right now - so since we
support it, we'd better guard against its quirks.

slotnames can't be regular identifiers, because they might contain chars
not valid in another DB's encoding. So lets just restrict them to
[a-zA-Z0-9_ -] and be done with it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-16 Thread Andres Freund
Hi,

On 2014-01-16 09:34:51 -0500, Robert Haas wrote:
> >> - ReplicationSlotAcquire probably needs to ignore slots that are not 
> >> active.
> > Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.
> 
> active != in_use.
> 
> I suppose your point is that the slot can't be in_use if it's not also
> active.

Yes. There's asserts to that end...

> Maybe it would be better to get rid of active/in_use and have
> three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
> REPLSLOT_FREE.  Or something like that.

Hm. Color me unenthusiastic. If you feel strongly I can change it, but
otherwise not.

> >> - If there's a coding rule that slot->database can't be changed while
> >> the slot is active, then the check to make sure that the user isn't
> >> trying to bind to a slot with a mis-matching database could be done
> >> before the code described in the previous point, avoiding the need to
> >> go back and release the resource.
> >
> > I don't think slot->database should be allowed to change at all...
> 
> Well, it can if the slot is dropped and a new one created.

Well. That obviously requires the lwlock to be acquired...

> >> - I think the critical section in ReplicationSlotDrop is bogus.  If
> >> DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
> >> gone.
> >
> > Well, if delete slot fails, we don't really know at which point it
> > failed which means that the on-disk state might not correspond to the
> > in-memory state. I don't see a point in adding code trying to handle
> > that case correctly...
> 
> Deleting the slot should be an atomic operation.  There's some
> critical point before which the slot will be picked up by recovery and
> after which it won't.  You either did that operation, or not, and can
> adjust the in-memory state accordingly.

I am not sure I understand that point. We can either update the
in-memory bit before performing the on-disk operations or
afterwards. Either way, there's a way to be inconsistent if the disk
operation fails somewhere inbetween (it might fail but still have
deleted the file/directory!). The normal way to handle that in other
places is PANICing when we don't know so we recover from the on-disk
state.
I really don't see the problem here? Code doesn't get more robust by
doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
ERROR, often that's not warranted.

> >> - A comment in KillSlot wonders whether locking is required.  I say
> >> yes.  It's safe to take lwlocks and spinlocks during shmem exit, and
> >> failing to do so seems like a recipe for subtle corner-case bugs.
> >
> > I agree that it's safe to use spinlocks, but lwlocks? What if we are
> > erroring out while holding an lwlock? Code that runs inside a
> > TransactionCommand is protected against that, but if not ProcKill()
> > which invokes LWLockReleaseAll() runs pretty late in the teardown
> > process...
> 
> Hmm.  I guess it'd be fine to decide that a connected slot can be
> marked not-connected without the lock.

I now acquire the spinlock since that has to work, or we have much worse
problems... That guarantees that other backends see the value as well.

> I think you'd want a rule that
> a slot can't be freed except when it's not-connected; otherwise, you
> might end up marking the slot not-connected after someone else had
> already recycled it for an unrelated purpose (drop slot, create new
> slot).

Yea, that rule is there. Otherwise we'd get in great trouble.

> >> - There seems to be no interface to acquire or release slots from
> >> either SQL or the replication protocol, nor any way for a client of
> >> this code to update its slot details.
> >
> > I don't think either ever needs to do that - START_TRANSACTION SLOT slot
> > ...; and decoding_slot_*changes will acquire/release for them while
> > active. What would the usecase be to allow them to be acquired from SQL?
> 
> My point isn't so much about SQL as that with just this patch I don't
> see any way for anyone to ever acquire a slot for anything, ever.  So
> I think there's a piece missing, or three.

The slot is acquired by code using the slot. So when START_TRANSACTION
SLOT ... (in contrast to a START_TRANSACTION without SLOT) is sent,
walsender.c does an ReplicationSlotAcquire(cmd->slotname) in
StartReplication() and releases it after it has finished.

> > The slot details get updates by the respective replication code. For
> > streaming rep, that should happen via reply and feedback
> > messages. For changeset extraction it happens when
> > LogicalConfirmReceivedLocation() is called; the walsender interface
> > does that using reply messages, the SQL interface calls it when
> > finished (unless you use the _peek_ functions).
> 
> Right, but where is this code?  I don't see this updating the reply
> and feedback message processing code to touch slots.  Did I miss that?

It's in "wal_decoding: logical changeset extraction walsender interface"
currently :(. Splitting the streaming

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-16 Thread Robert Haas
On Wed, Jan 15, 2014 at 3:39 PM, Andres Freund  wrote:
> On 2014-01-15 13:28:25 -0500, Robert Haas wrote:
>> - I think you should just regard ReplicationSlotCtlLock as protecting
>> the "name" and "active" flags of every slot.  ReplicationSlotCreate()
>> would then not need to mess with the spinlocks at all, and
>> ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
>> think.  Functions like ReplicationSlotsComputeRequiredXmin() can
>> acquire this lock in shared mode and only lock the slots that are
>> actually active, instead of all of them.
>
> I first thought you meant that we should get rid of the spinlock, but
> after rereading I think you just mean that ->name, ->active, ->in_use
> are only allowed to change while holding the lwlock exclusively so we
> don't need to spinlock in those cases? If so, yes, that works for me.

Yeah, that's about what I had in mind.

>> - If you address /* FIXME: apply sanity checking to slot name */, then
>> I think that also addresses /* XXX: do we want to use truncate
>> identifier instead? */. In other words, let's just error out if the
>> name is too long.  I'm not sure what other sanity checking is needed
>> here; maybe just restrict it to 7-bit characters to avoid problems
>> with encodings for individual databases varying.
>
> Yea, erroring out seems like a good idea. But I think we need to
> restrict slot names a bit more than that, given they are used as
> filenames... We could instead name the files using the slot's offset,
> but I'd prefer to not go that route.

OK.  Well, add some code, then.  :-)

>> - ReplicationSlotAcquire probably needs to ignore slots that are not active.
> Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.

active != in_use.

I suppose your point is that the slot can't be in_use if it's not also
active.  Maybe it would be better to get rid of active/in_use and have
three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
REPLSLOT_FREE.  Or something like that.

>> - If there's a coding rule that slot->database can't be changed while
>> the slot is active, then the check to make sure that the user isn't
>> trying to bind to a slot with a mis-matching database could be done
>> before the code described in the previous point, avoiding the need to
>> go back and release the resource.
>
> I don't think slot->database should be allowed to change at all...

Well, it can if the slot is dropped and a new one created.

>> - I think the critical section in ReplicationSlotDrop is bogus.  If
>> DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
>> gone.
>
> Well, if delete slot fails, we don't really know at which point it
> failed which means that the on-disk state might not correspond to the
> in-memory state. I don't see a point in adding code trying to handle
> that case correctly...

Deleting the slot should be an atomic operation.  There's some
critical point before which the slot will be picked up by recovery and
after which it won't.  You either did that operation, or not, and can
adjust the in-memory state accordingly.

>> - cancel_before_shmem_exit is only guaranteed to remove the
>> most-recently-added callback.
>
> Yea :(. I think that's safe for the current usages but seems mighty
> fragile... Not sure what to do about it. Just register KillSlot once and
> keep it registered?

Yep.  Use a module-private flag to decide whether it needs to do anything.

>> - A comment in KillSlot wonders whether locking is required.  I say
>> yes.  It's safe to take lwlocks and spinlocks during shmem exit, and
>> failing to do so seems like a recipe for subtle corner-case bugs.
>
> I agree that it's safe to use spinlocks, but lwlocks? What if we are
> erroring out while holding an lwlock? Code that runs inside a
> TransactionCommand is protected against that, but if not ProcKill()
> which invokes LWLockReleaseAll() runs pretty late in the teardown
> process...

Hmm.  I guess it'd be fine to decide that a connected slot can be
marked not-connected without the lock.  I think you'd want a rule that
a slot can't be freed except when it's not-connected; otherwise, you
might end up marking the slot not-connected after someone else had
already recycled it for an unrelated purpose (drop slot, create new
slot).

>> - There seems to be no interface to acquire or release slots from
>> either SQL or the replication protocol, nor any way for a client of
>> this code to update its slot details.
>
> I don't think either ever needs to do that - START_TRANSACTION SLOT slot
> ...; and decoding_slot_*changes will acquire/release for them while
> active. What would the usecase be to allow them to be acquired from SQL?

My point isn't so much about SQL as that with just this patch I don't
see any way for anyone to ever acquire a slot for anything, ever.  So
I think there's a piece missing, or three.

> The slot details get updates by the respective replication code. For
> streaming rep, that should happen via reply

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-15 Thread Andres Freund
Hi,

On 2014-01-15 13:28:25 -0500, Robert Haas wrote:
> - I think you should just regard ReplicationSlotCtlLock as protecting
> the "name" and "active" flags of every slot.  ReplicationSlotCreate()
> would then not need to mess with the spinlocks at all, and
> ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
> think.  Functions like ReplicationSlotsComputeRequiredXmin() can
> acquire this lock in shared mode and only lock the slots that are
> actually active, instead of all of them.

I first thought you meant that we should get rid of the spinlock, but
after rereading I think you just mean that ->name, ->active, ->in_use
are only allowed to change while holding the lwlock exclusively so we
don't need to spinlock in those cases? If so, yes, that works for me.

> - If you address /* FIXME: apply sanity checking to slot name */, then
> I think that also addresses /* XXX: do we want to use truncate
> identifier instead? */. In other words, let's just error out if the
> name is too long.  I'm not sure what other sanity checking is needed
> here; maybe just restrict it to 7-bit characters to avoid problems
> with encodings for individual databases varying.

Yea, erroring out seems like a good idea. But I think we need to
restrict slot names a bit more than that, given they are used as
filenames... We could instead name the files using the slot's offset,
but I'd prefer to not go that route.

> - ReplicationSlotAcquire probably needs to ignore slots that are not active.

Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.

> - If there's a coding rule that slot->database can't be changed while
> the slot is active, then the check to make sure that the user isn't
> trying to bind to a slot with a mis-matching database could be done
> before the code described in the previous point, avoiding the need to
> go back and release the resource.

I don't think slot->database should be allowed to change at all...

> - I think the critical section in ReplicationSlotDrop is bogus.  If
> DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
> gone.

Well, if delete slot fails, we don't really know at which point it
failed which means that the on-disk state might not correspond to the
in-memory state. I don't see a point in adding code trying to handle
that case correctly...

> - cancel_before_shmem_exit is only guaranteed to remove the
> most-recently-added callback.

Yea :(. I think that's safe for the current usages but seems mighty
fragile... Not sure what to do about it. Just register KillSlot once and
keep it registered?

> - Why does ReplicationSlotsComputeRequiredXmin() need to hold
> ProcArrayLock at all?

There's reasoning, but I just noticed that it's basis might be flawed
anyway :(.
When starting changeset extraction in a new slot we need to guarantee
that we only start decoding records we know the catalog tuples haven't
been removed for.
So, when creating the slot I've so far done a GetOldestXmin() and used
that to check against xl_running_xact->oldestRunningXid. But
GetOldestXmin() can go backwards...

I'll think a bit and try to simplify this.

> - ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
> spinlock while it does all of those gyrations.  It can just acquire
> the spinlock, copy the three fields needed into local variables, and
> release the spinlock.  The rest can be worked out afterwards.
> Similarly in ReplicationSlotsComputeRequiredXmin.

Yea, will change.

> - A comment in KillSlot wonders whether locking is required.  I say
> yes.  It's safe to take lwlocks and spinlocks during shmem exit, and
> failing to do so seems like a recipe for subtle corner-case bugs.

I agree that it's safe to use spinlocks, but lwlocks? What if we are
erroring out while holding an lwlock? Code that runs inside a
TransactionCommand is protected against that, but if not ProcKill()
which invokes LWLockReleaseAll() runs pretty late in the teardown
process...

> - pg_get_replication_slots() wonders what permissions we require.  I
> don't know that any special permissions are needed here; the data
> we're exposing doesn't appear to be sensitive.  Unless I'm missing
> something?

I don't see a problem either, but it seems others have -
pg_stat_replication only displays minor amounts of information if one
doesn't have the replication privilege... Not sure what the reasoning
there is, and whether it applies here as well.

> - There seems to be no interface to acquire or release slots from
> either SQL or the replication protocol, nor any way for a client of
> this code to update its slot details.

I don't think either ever needs to do that - START_TRANSACTION SLOT slot
...; and decoding_slot_*changes will acquire/release for them while
active. What would the usecase be to allow them to be acquired from SQL?

The slot details get updates by the respective replication code. For
streaming rep, that should happen via reply and feedback messages. For
changeset ext

Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-15 Thread Robert Haas
Review of patch 0002:

- I think you should just regard ReplicationSlotCtlLock as protecting
the "name" and "active" flags of every slot.  ReplicationSlotCreate()
would then not need to mess with the spinlocks at all, and
ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
think.  Functions like ReplicationSlotsComputeRequiredXmin() can
acquire this lock in shared mode and only lock the slots that are
actually active, instead of all of them.

- If you address /* FIXME: apply sanity checking to slot name */, then
I think that also addresses /* XXX: do we want to use truncate
identifier instead? */. In other words, let's just error out if the
name is too long.  I'm not sure what other sanity checking is needed
here; maybe just restrict it to 7-bit characters to avoid problems
with encodings for individual databases varying.

- ReplicationSlotAcquire probably needs to ignore slots that are not active.

- ReplicationSlotAcquire should be tweaked so that the code that holds
the spinlock is more self-contained.  If you adopt the above-proposed
recasting of ReplicationSlotCtlLock, then the part that holds the
spinlock can probably look like this: SpinLockAcquire(&slot->mutex);
was_active = slot->active; slot->active = true;
SpinLockRelease(&slot->mutex), which looks quite a bit safer.

- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.

- I think the critical section in ReplicationSlotDrop is bogus.  If
DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
gone.

- cancel_before_shmem_exit is only guaranteed to remove the
most-recently-added callback.

- Why does ReplicationSlotsComputeRequiredXmin() need to hold
ProcArrayLock at all?

- ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
spinlock while it does all of those gyrations.  It can just acquire
the spinlock, copy the three fields needed into local variables, and
release the spinlock.  The rest can be worked out afterwards.
Similarly in ReplicationSlotsComputeRequiredXmin.

- A comment in KillSlot wonders whether locking is required.  I say
yes.  It's safe to take lwlocks and spinlocks during shmem exit, and
failing to do so seems like a recipe for subtle corner-case bugs.

- pg_get_replication_slots() wonders what permissions we require.  I
don't know that any special permissions are needed here; the data
we're exposing doesn't appear to be sensitive.  Unless I'm missing
something?

- PG_STAT_GET_LOGICAL_DECODING_SLOTS_COLS has a leftover "logical" in its name.

- There seems to be no interface to acquire or release slots from
either SQL or the replication protocol, nor any way for a client of
this code to update its slot details.  The value of
catalog_xmin/data_xmin vs. effective_catalog_xmin/effective_data_xmin
is left to the imagination.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-15 Thread Robert Haas
This 0001 patch, to log running transactions more frequently, has been
pending for a long time now, and I haven't heard any objections, so
I've gone ahead and committed that part.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers