Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-07-07 Thread Alvaro Herrera
On 2020-Jun-24, Andres Freund wrote:

> As I said before, I've utilized being able to do both over a single
> connection (among others to initialize a logical replica using a base
> backup). And I've seen at least one other codebase (developed without my
> input) doing so. I really don't understand how you just dismiss this
> without any sort of actual argument.  Yes, those uses can be fixed to
> reconnect with a different replication parameter, but that's code that
> needs to be adjusted and it requires adjustments to pg_hba.conf etc.
> 
> And obviously you'd lock out older versions of jdbc, and possibly other
> drivers.

Well, I had understood that you were talking from a hypothetical
position, not that you were already using the thing that way.  After
these arguments, I agree to leave things alone, and nobody else seems to
be arguing in that direction, so I'll mark the open item as closed.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-25 Thread Robert Haas
On Wed, Jun 24, 2020 at 3:41 PM Alvaro Herrera  wrote:
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.

But you don't know that that's true of everyone using this feature,
and even if it were, so what? Breaking a feature that someone didn't
know they were using is just as much of a break as breaking a feature
someone DID know they were using.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Andres Freund
Hi,

On 2020-06-24 15:41:14 -0400, Alvaro Herrera wrote:
> On 2020-Jun-24, Robert Haas wrote:
> 
> > So really I think this turns on #1: is it plausible
> > that people are using this feature, however inadvertent it may be, and
> > is it potentially useful? I don't see that anybody's made an argument
> > against either of those things. Unless someone can do so, I think we
> > shouldn't disable this.
> 
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.

As I said before, I've utilized being able to do both over a single
connection (among others to initialize a logical replica using a base
backup). And I've seen at least one other codebase (developed without my
input) doing so. I really don't understand how you just dismiss this
without any sort of actual argument.  Yes, those uses can be fixed to
reconnect with a different replication parameter, but that's code that
needs to be adjusted and it requires adjustments to pg_hba.conf etc.

And obviously you'd lock out older versions of jdbc, and possibly other
drivers.

Obviously we should allow more granular permissions here, I don't think
anybody is arguing against that.

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Dave Cramer
On Wed, 24 Jun 2020 at 15:41, Alvaro Herrera 
wrote:

> On 2020-Jun-24, Robert Haas wrote:
>
> > So really I think this turns on #1: is it plausible
> > that people are using this feature, however inadvertent it may be, and
> > is it potentially useful? I don't see that anybody's made an argument
> > against either of those things. Unless someone can do so, I think we
> > shouldn't disable this.
>
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.
>
>
Not sure we are using it as much as we accidently did it that way. It would
be trivial to fix.

That said I think we should fix the security hole this opens and leave the
functionality.

Dave


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Robert Haas wrote:

> So really I think this turns on #1: is it plausible
> that people are using this feature, however inadvertent it may be, and
> is it potentially useful? I don't see that anybody's made an argument
> against either of those things. Unless someone can do so, I think we
> shouldn't disable this.

People (specifically the jdbc driver) *are* using this feature in this
way, but they didn't realize they were doing it.  It was an accident and
they didn't notice.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Robert Haas
On Wed, Jun 24, 2020 at 1:06 PM Alvaro Herrera  wrote:
> On 2020-Jun-24, Stephen Frost wrote:
> > Doesn't mean it makes sense or that we should be supporting that.  What
> > we should have is a way to allow administrators to configure a system
> > for exactly what they want to allow, and it doesn't seem like we're
> > doing that today and therefore we should fix it.  This isn't the only
> > area we have that issue in.
>
> The way to do that, for the case under discussion, is to reject using a
> logical replication connection for physical replication commands.

Reading over this discussion, I see basically three arguments:

1. Andres argues that being able to execute physical replication
commands from the same connection as SQL queries is useful, and that
people may be relying on it, and that we shouldn't break it without
need.

2. Fujii Masao argues that the current situation makes it impossible
to write a pg_hba.conf rule that disallows all physical replication
connections, because people could get around it by using a logical
replication connection for physical replication.

3. Various people argue that it's only accidental that physical
replication on a replication=database connection ever worked at all,
and therefore we ought to block it.

I find argument #1 most convincing, #2 less convincing, and #3 least
convincing. In my view, the problem with argument #3 is that just
because some feature combination was unintentional doesn't mean it's
unuseful or unused. As for #2, suppose someone were to propose a
design for logical replication that allowed it to take place without a
database connection, so that it could be done with just a regular
replication connection. Such a feature would create the same problem
Fujii Masao mentions here, but it seems inconceivable that we would
for that reason reject it; we make decisions about features based on
their usefulness, not their knock-on effects on pg_hba.conf rules. We
can always add new kinds of access control restrictions if they are
needed; that is a better approach than removing features so that the
existing pg_hba.conf facilities can be used to accomplish some
particular goal. So really I think this turns on #1: is it plausible
that people are using this feature, however inadvertent it may be, and
is it potentially useful? I don't see that anybody's made an argument
against either of those things. Unless someone can do so, I think we
shouldn't disable this.

That having been said, I think that the fact that you can execute SQL
queries in replication=database connections is horrifying. I really
hate that feature. I think it's a bad design, and a bad
implementation, and a recipe for tons of bugs. But, blocking physical
replication commands on such connections isn't going to solve any of
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Stephen Frost wrote:

> Doesn't mean it makes sense or that we should be supporting that.  What
> we should have is a way to allow administrators to configure a system
> for exactly what they want to allow, and it doesn't seem like we're
> doing that today and therefore we should fix it.  This isn't the only
> area we have that issue in.

The way to do that, for the case under discussion, is to reject using a
logical replication connection for physical replication commands.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Stephen Frost
Greetings,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> On 2020-Jun-24, Kyotaro Horiguchi wrote:
> 
> > In logical replication, a replication role is intended to be
> > accessible only to the GRANTed databases.  On the other hand the same
> > role can create a dead copy of the whole cluster, including
> > non-granted databases.
> 
> In other words -- essentially, if you grant replication access to a role
> only to a specific database, they can steal the whole cluster.
> 
> I don't see what's so great about that, but apparently people like it.

Sure, people who aren't in charge of security I'm sure like the ease of
use.

Doesn't mean it makes sense or that we should be supporting that.  What
we should have is a way to allow administrators to configure a system
for exactly what they want to allow, and it doesn't seem like we're
doing that today and therefore we should fix it.  This isn't the only
area we have that issue in.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Kyotaro Horiguchi wrote:

> In logical replication, a replication role is intended to be
> accessible only to the GRANTed databases.  On the other hand the same
> role can create a dead copy of the whole cluster, including
> non-granted databases.

In other words -- essentially, if you grant replication access to a role
only to a specific database, they can steal the whole cluster.

I don't see what's so great about that, but apparently people like it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Fujii Masao




On 2020/06/24 11:56, Kyotaro Horiguchi wrote:

At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier  wrote 
in

On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:

I still maintain that adding restrictions here is a bad idea. Even
disregarding the discussion of running normal queries interspersed, it's
useful to be able to both request WAL and receive logical changes over
the same connection. E.g. for creating a logical replica by first doing
a physical base backup (vastly faster), or fetching WAL for decoding
large transactions onto a standby.

And I just don't see any reasons to disallow it. There's basically no
reduction in complexity by doing so.


Yeah, I still stand by the same opinion here to do nothing.  I suspect
that we have good chances to annoy people and some cases we are
overlooking here, that used to work.


In logical replication, a replication role is intended to be
accessible only to the GRANTed databases.  On the other hand the same
role can create a dead copy of the whole cluster, including
non-granted databases.  It seems like a sieve missing a mesh screen.


Personally I'd like to disallow physical replication commands
when I explicitly reject physical replication connection
(i.e., set "host replication user x.x.x.x/x reject") in pg_hba.conf,
whether on physical or logical replication connection.



I agree that that doesn't harm as far as roles are strictly managed so
I don't insist so strongly on inhibiting the behavior. However, the
documentation at least needs amendment.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-23 Thread Kyotaro Horiguchi
At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier  wrote 
in 
> On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:
> > I still maintain that adding restrictions here is a bad idea. Even
> > disregarding the discussion of running normal queries interspersed, it's
> > useful to be able to both request WAL and receive logical changes over
> > the same connection. E.g. for creating a logical replica by first doing
> > a physical base backup (vastly faster), or fetching WAL for decoding
> > large transactions onto a standby.
> > 
> > And I just don't see any reasons to disallow it. There's basically no
> > reduction in complexity by doing so.
> 
> Yeah, I still stand by the same opinion here to do nothing.  I suspect
> that we have good chances to annoy people and some cases we are
> overlooking here, that used to work.

In logical replication, a replication role is intended to be
accessible only to the GRANTed databases.  On the other hand the same
role can create a dead copy of the whole cluster, including
non-granted databases.  It seems like a sieve missing a mesh screen.

I agree that that doesn't harm as far as roles are strictly managed so
I don't insist so strongly on inhibiting the behavior. However, the
documentation at least needs amendment.

https://www.postgresql.org/docs/13/protocol-replication.html


To initiate streaming replication, the frontend sends the replication
parameter in the startup message. A Boolean value of true (or on, yes,
1) tells the backend to go into physical replication walsender mode,
wherein a small set of replication commands, shown below, can be
issued instead of SQL statements.

Passing database as the value for the replication parameter instructs
the backend to go into logical replication walsender mode, connecting
to the database specified in the dbname parameter. In logical
replication walsender mode, the replication commands shown below as
well as normal SQL commands can be issued.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-22 Thread Michael Paquier
On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:
> I still maintain that adding restrictions here is a bad idea. Even
> disregarding the discussion of running normal queries interspersed, it's
> useful to be able to both request WAL and receive logical changes over
> the same connection. E.g. for creating a logical replica by first doing
> a physical base backup (vastly faster), or fetching WAL for decoding
> large transactions onto a standby.
> 
> And I just don't see any reasons to disallow it. There's basically no
> reduction in complexity by doing so.

Yeah, I still stand by the same opinion here to do nothing.  I suspect
that we have good chances to annoy people and some cases we are
overlooking here, that used to work.
--
Michael


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-21 Thread Andres Freund
Hi,

On 2020-06-21 13:45:36 -0400, Jonathan S. Katz wrote:
> The PG13 RMT had a discussion about this thread, and while the initial
> crash has been fixed, we decided to re-open the Open Item around whether
> we should allow physical replication to be initiated in a logical
> replication session.

Since this is a long-time issue, this doesn't quite seem like an issue
for the RMT?


> We anticipate a resolution for PG13, whether it is explicitly
> disallowing physical replication from occurring on a logical replication
> slot, maintaining the status quo, or something else such that there is
> consensus on the approach.

s/logical replication slot/logical replication connection/?


I still maintain that adding restrictions here is a bad idea. Even
disregarding the discussion of running normal queries interspersed, it's
useful to be able to both request WAL and receive logical changes over
the same connection. E.g. for creating a logical replica by first doing
a physical base backup (vastly faster), or fetching WAL for decoding
large transactions onto a standby.

And I just don't see any reasons to disallow it. There's basically no
reduction in complexity by doing so.

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-21 Thread Jonathan S. Katz
Hi,

On 6/5/20 11:51 AM, Alvaro Herrera wrote:
> On 2020-Jun-05, Dave Cramer wrote:
> 
>> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera 
>> wrote:
> 
>>> Ouch ... so they made IDENT in the replication grammar be a trigger to
>>> enter the regular grammar.  Crazy.  No way to put those worms back in
>>> the tin now, I guess.
>>
>> Is that documented ?
> 
> I don't think it is.
> 
>>> It is still my opinion that we should prohibit a logical replication
>>> connection from being used to do physical replication.  Horiguchi-san,
>>> Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
>>> the JDBC team) is not opposed to the change -- he says they're just
>>> using it because they didn't realize they should be doing differently.
>>
>> I think my exact words were
>>
>> "I don't see this is a valid reason to keep doing something. If it is
>> broken then fix it.
>> Clients can deal with the change."
>>
>> in response to:
>>
>>> Well, I don't really think that we can just break a behavior that
>>> exists since 9.4 as you could break applications relying on the
>>> existing behavior, and that's also the point of Vladimir upthread.
>>
>> Which is different than not being opposed to the change. I don't see this
>> as broken, and it's quite possible that some of our users are using
>> it.
> 
> Apologies for misinterpreting.
> 
>> It certainly needs to be documented
> 
> I'd rather not.

The PG13 RMT had a discussion about this thread, and while the initial
crash has been fixed, we decided to re-open the Open Item around whether
we should allow physical replication to be initiated in a logical
replication session.

We anticipate a resolution for PG13, whether it is explicitly
disallowing physical replication from occurring on a logical replication
slot, maintaining the status quo, or something else such that there is
consensus on the approach.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-07 Thread Michael Paquier
On Thu, Jun 04, 2020 at 11:07:29AM +0900, Michael Paquier wrote:
> Are there any objections in fixing the issue first then?  As far as I
> can see there is no objection to this part, like here:
> https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql

Hearing nothing, I have applied this part and fixed the crash to take
care of the open item.
--
Michael


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-05 Thread Alvaro Herrera
On 2020-Jun-05, Dave Cramer wrote:

> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera 
> wrote:

> > Ouch ... so they made IDENT in the replication grammar be a trigger to
> > enter the regular grammar.  Crazy.  No way to put those worms back in
> > the tin now, I guess.
> 
> Is that documented ?

I don't think it is.

> > It is still my opinion that we should prohibit a logical replication
> > connection from being used to do physical replication.  Horiguchi-san,
> > Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
> > the JDBC team) is not opposed to the change -- he says they're just
> > using it because they didn't realize they should be doing differently.
> 
> I think my exact words were
> 
> "I don't see this is a valid reason to keep doing something. If it is
> broken then fix it.
> Clients can deal with the change."
> 
> in response to:
> 
> > Well, I don't really think that we can just break a behavior that
> > exists since 9.4 as you could break applications relying on the
> > existing behavior, and that's also the point of Vladimir upthread.
> 
> Which is different than not being opposed to the change. I don't see this
> as broken, and it's quite possible that some of our users are using
> it.

Apologies for misinterpreting.

> It certainly needs to be documented

I'd rather not.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-05 Thread Dave Cramer
On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera 
wrote:

> On 2020-Jun-04, Andres Freund wrote:
>
> > postgres[52656][1]=# SELECT 1;
> > ┌──┐
> > │ ?column? │
> > ├──┤
> > │1 │
> > └──┘
> > (1 row)
> >
> >
> > I am very much not in love with the way that was implemented, but it's
> > there, and it's used as far as I know (cf tablesync.c).
>
> Ouch ... so they made IDENT in the replication grammar be a trigger to
> enter the regular grammar.  Crazy.  No way to put those worms back in
> the tin now, I guess.
>

Is that documented ?

>
> It is still my opinion that we should prohibit a logical replication
> connection from being used to do physical replication.  Horiguchi-san,
> Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
> the JDBC team) is not opposed to the change -- he says they're just
> using it because they didn't realize they should be doing differently.


I think my exact words were

"I don't see this is a valid reason to keep doing something. If it is
broken then fix it.
Clients can deal with the change."

in response to:

Well, I don't really think that we can just break a behavior that
> exists since 9.4 as you could break applications relying on the
> existing behavior, and that's also the point of Vladimir upthread.
>

Which is different than not being opposed to the change. I don't see this
as broken,
and it's quite possible that some of our users are using it. It certainly
needs to be documented

Dave


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-04 Thread Alvaro Herrera
On 2020-Jun-04, Andres Freund wrote:

> postgres[52656][1]=# SELECT 1;
> ┌──┐
> │ ?column? │
> ├──┤
> │1 │
> └──┘
> (1 row)
> 
> 
> I am very much not in love with the way that was implemented, but it's
> there, and it's used as far as I know (cf tablesync.c).

Ouch ... so they made IDENT in the replication grammar be a trigger to
enter the regular grammar.  Crazy.  No way to put those worms back in
the tin now, I guess.

It is still my opinion that we should prohibit a logical replication
connection from being used to do physical replication.  Horiguchi-san,
Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
the JDBC team) is not opposed to the change -- he says they're just
using it because they didn't realize they should be doing differently.

Both Michael P. and you are saying we shouldn't break it because it
works today, but there isn't a real use-case for it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-04 Thread Andres Freund
Hi,

On 2020-06-04 16:44:53 -0400, Alvaro Herrera wrote:
> A logical replication connection cannot run SQL anyway, can it?

You can:

andres@awork3:~/src/postgresql$ psql 'replication=database'

postgres[52656][1]=# IDENTIFY_SYSTEM;
┌─┬──┬┬──┐
│  systemid   │ timeline │  xlogpos   │  dbname  │
├─┼──┼┼──┤
│ 6821634567571961151 │1 │ 1/D256EC40 │ postgres │
└─┴──┴┴──┘
(1 row)

postgres[52656][1]=# SELECT 1;
┌──┐
│ ?column? │
├──┤
│1 │
└──┘
(1 row)


I am very much not in love with the way that was implemented, but it's
there, and it's used as far as I know (cf tablesync.c).

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-04 Thread Alvaro Herrera
On 2020-Jun-04, Michael Paquier wrote:

> On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote:

> >> I don't think having a physical replication connection access catalog
> >> data directly is a great idea.  We already have gadgets like
> >> IDENTIFY_SYSTEM for physical replication that can do that, and if you
> >> need particular settings you can use SHOW (commit d1ecd539477).  If
> >> there was a strong need for even more than that, we can add something to
> >> the grammar.
> > 
> > Those special case things are a bad idea, and we shouldn't introduce
> > more. It's unrealistic that we can ever make that support everything,
> > and since we already have to support the database connected thing, I
> > don't see the point.
> 
> Let's continue discussing this part as well.

A logical replication connection cannot run SQL anyway, can it?  it's
limited to the replication grammar.  So it's not like you can run
arbitrary queries to access catalog data.  So even if we do need to
access the catalogs, we'd have to add stuff to the replication grammar
in order to support that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Alvaro Herrera
On 2020-Jun-03, Andres Freund wrote:

> On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
> > On 2020-Jun-03, Andres Freund wrote:
> > > I don't think we should prohibit this. For one, it'd probably break some
> > > clients, without a meaningful need.
> > 
> > There *is* a need, namely to keep complexity down.  This is quite
> > convoluted, it's got a lot of historical baggage because of the way it
> > was implemented, and it's very difficult to understand.  The greatest
> > motive I see is to make this easier to understand, so that it is easier
> > to modify and improve in the future.
> 
> That seems like a possibly convincing argument for not introducing the
> capability, but doesn't seem strong enough to remove it.

This "capability" has never been introduced.  The fact that it's there
is just an accident.  In fact, it's not a capability, since the feature
(physical replication) is invoked differently -- namely, using a
physical replication connection.  JDBC uses a logical replication
connection for it only because they never realized that they were
supposed to do differently, because we failed to throw the correct
error message in the first place.

> > I don't think having a physical replication connection access catalog
> > data directly is a great idea.  We already have gadgets like
> > IDENTIFY_SYSTEM for physical replication that can do that, and if you
> > need particular settings you can use SHOW (commit d1ecd539477).  If
> > there was a strong need for even more than that, we can add something to
> > the grammar.
> 
> Those special case things are a bad idea, and we shouldn't introduce
> more.

What special case things?  The replication connection has never been
supposed to run SQL.  That's why we have SHOW in the replication
grammar.

> It's unrealistic that we can ever make that support everything,
> and since we already have to support the database connected thing, I
> don't see the point.

A logical replication connection is not supposed to be used for physical
replication.  That's just going to make more bugs appear.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Michael Paquier
On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote:
> On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
>> There *is* a need, namely to keep complexity down.  This is quite
>> convoluted, it's got a lot of historical baggage because of the way it
>> was implemented, and it's very difficult to understand.  The greatest
>> motive I see is to make this easier to understand, so that it is easier
>> to modify and improve in the future.
> 
> That seems like a possibly convincing argument for not introducing the
> capability, but doesn't seem strong enough to remove it. Especially not
> if it was just broken as part of effectively a refactoring, as far as I
> understand?

Are there any objections in fixing the issue first then?  As far as I
can see there is no objection to this part, like here:
https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql

>> I don't think having a physical replication connection access catalog
>> data directly is a great idea.  We already have gadgets like
>> IDENTIFY_SYSTEM for physical replication that can do that, and if you
>> need particular settings you can use SHOW (commit d1ecd539477).  If
>> there was a strong need for even more than that, we can add something to
>> the grammar.
> 
> Those special case things are a bad idea, and we shouldn't introduce
> more. It's unrealistic that we can ever make that support everything,
> and since we already have to support the database connected thing, I
> don't see the point.

Let's continue discussing this part as well.
--
Michael


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
> On 2020-Jun-03, Andres Freund wrote:
> > I don't think we should prohibit this. For one, it'd probably break some
> > clients, without a meaningful need.
> 
> There *is* a need, namely to keep complexity down.  This is quite
> convoluted, it's got a lot of historical baggage because of the way it
> was implemented, and it's very difficult to understand.  The greatest
> motive I see is to make this easier to understand, so that it is easier
> to modify and improve in the future.

That seems like a possibly convincing argument for not introducing the
capability, but doesn't seem strong enough to remove it. Especially not
if it was just broken as part of effectively a refactoring, as far as I
understand?


> > But I think it's also actually quite useful to be able to access
> > catalogs before streaming data. You e.g. can look up configuration of
> > the primary before streaming WAL. With a second connection that's
> > actually harder to do reliably in some cases, because you need to be
> > sure that you actually reached the right server (consider a pooler,
> > automatic failover etc).
> 
> I don't think having a physical replication connection access catalog
> data directly is a great idea.  We already have gadgets like
> IDENTIFY_SYSTEM for physical replication that can do that, and if you
> need particular settings you can use SHOW (commit d1ecd539477).  If
> there was a strong need for even more than that, we can add something to
> the grammar.

Those special case things are a bad idea, and we shouldn't introduce
more. It's unrealistic that we can ever make that support everything,
and since we already have to support the database connected thing, I
don't see the point.

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Alvaro Herrera
On 2020-Jun-02, Michael Paquier wrote:

> I can note as well that StartLogicalReplication() moves in this sense
> by setting xlogreader to be the one from logical_decoding_ctx once the
> decoding context has been created.
> 
> This results in the attached.  The extra test from upthread to check
> that logical decoding is not allowed in a non-database WAL sender is a
> good idea, so I have kept it.

I don't particularly disagree with your proposed patch -- in fact, it
seems to make things cleaner.  It is a little wasteful, but I don't
really mind that.  It's just some memory, and it's not a significant
amount.

That said, I would *also* apply Kyotaro's proposed patch to prohibit a
physical standby running with a logical slot, if only because that
reduces the number of combinations that we need to test and keep our
collective heads straight about.  Just reject the weird case and have
one type of slot for each type of replication.  I didn't even think this
was at all possible.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Alvaro Herrera
On 2020-Jun-03, Andres Freund wrote:

> I don't think we should prohibit this. For one, it'd probably break some
> clients, without a meaningful need.

There *is* a need, namely to keep complexity down.  This is quite
convoluted, it's got a lot of historical baggage because of the way it
was implemented, and it's very difficult to understand.  The greatest
motive I see is to make this easier to understand, so that it is easier
to modify and improve in the future.

> But I think it's also actually quite useful to be able to access
> catalogs before streaming data. You e.g. can look up configuration of
> the primary before streaming WAL. With a second connection that's
> actually harder to do reliably in some cases, because you need to be
> sure that you actually reached the right server (consider a pooler,
> automatic failover etc).

I don't think having a physical replication connection access catalog
data directly is a great idea.  We already have gadgets like
IDENTIFY_SYSTEM for physical replication that can do that, and if you
need particular settings you can use SHOW (commit d1ecd539477).  If
there was a strong need for even more than that, we can add something to
the grammar.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-02 14:23:50 +0900, Fujii Masao wrote:
> On 2020/06/02 13:24, Michael Paquier wrote:
> > On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
> > > Yes. Conversely, if we start logical replication in a physical
> > > replication connection (i.g. replication=true) we got an error before
> > > staring replication:
> > > 
> > > ERROR:  logical decoding requires a database connection
> > > 
> > > I think we can prevent that SEGV in a similar way.
> > 
> > Still unconvinced as this restriction stands for logical decoding
> > requiring a database connection but it is not necessarily true now as
> > physical replication has less restrictions than a logical one.
> 
> Could you tell me what the benefit for supporting physical replication on
> logical rep connection is? If it's only for "undocumented"
> backward-compatibility, IMO it's better to reject such "tricky" set up.
> But if there are some use cases for that, I'm ok to support that.

I don't think we should prohibit this. For one, it'd probably break some
clients, without a meaningful need.

But I think it's also actually quite useful to be able to access
catalogs before streaming data. You e.g. can look up configuration of
the primary before streaming WAL. With a second connection that's
actually harder to do reliably in some cases, because you need to be
sure that you actually reached the right server (consider a pooler,
automatic failover etc).

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Fujii Masao




On 2020/06/03 20:33, Dave Cramer wrote:




On Wed, 3 Jun 2020 at 01:19, Michael Paquier mailto:mich...@paquier.xyz>> wrote:

On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
 > On 2020/06/02 13:24, Michael Paquier wrote:
 >> Still unconvinced as this restriction stands for logical decoding
 >> requiring a database connection but it is not necessarily true now as
 >> physical replication has less restrictions than a logical one.
 >
 > Could you tell me what the benefit for supporting physical replication on
 > logical rep connection is? If it's only for "undocumented"
 > backward-compatibility, IMO it's better to reject such "tricky" set up.
 > But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.


For the back branches, I agree with you. Even if it's undocumented behavior,
basically we should not get rid of it from the back branches unless there is
very special reason.

For v13, if it has no functional merit, I don't think it's so bad to get rid of
that undocumented (and maybe not-fully tested) behavior. If there are
applications depending it, I think that they can be updated.


I don't see this is a valid reason to keep doing something. If it is broken 
then fix it.
Clients can deal with the change.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Dave Cramer
On Wed, 3 Jun 2020 at 01:19, Michael Paquier  wrote:

> On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
> > On 2020/06/02 13:24, Michael Paquier wrote:
> >> Still unconvinced as this restriction stands for logical decoding
> >> requiring a database connection but it is not necessarily true now as
> >> physical replication has less restrictions than a logical one.
> >
> > Could you tell me what the benefit for supporting physical replication on
> > logical rep connection is? If it's only for "undocumented"
> > backward-compatibility, IMO it's better to reject such "tricky" set up.
> > But if there are some use cases for that, I'm ok to support that.
>
> Well, I don't really think that we can just break a behavior that
> exists since 9.4 as you could break applications relying on the
> existing behavior, and that's also the point of Vladimir upthread.
>

I don't see this is a valid reason to keep doing something. If it is broken
then fix it.
Clients can deal with the change.

Dave Cramer
https://www.postgres.rocks


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-02 Thread Michael Paquier
On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
> On 2020/06/02 13:24, Michael Paquier wrote:
>> Still unconvinced as this restriction stands for logical decoding
>> requiring a database connection but it is not necessarily true now as
>> physical replication has less restrictions than a logical one.
> 
> Could you tell me what the benefit for supporting physical replication on
> logical rep connection is? If it's only for "undocumented"
> backward-compatibility, IMO it's better to reject such "tricky" set up.
> But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.

On top of it, the issue is actually unrelated to if we want to
restrict things more or not when starting replication in a WAL sender
because the xlogreader creation just needs to happen when starting
replication.  Now we have a static "fake" one created when a WAL
sender process starts, something that it would not need in most cases
like answering to a BASE_BACKUP command for example.

>> I can note as well that StartLogicalReplication() moves in this sense
>> by setting xlogreader to be the one from logical_decoding_ctx once the
>> decoding context has been created.
>> 
>> This results in the attached.  The extra test from upthread to check
>> that logical decoding is not allowed in a non-database WAL sender is a
>> good idea, so I have kept it.
> 
> Yes. Also we should add the test to check if physical replication can work
> fine even on logical rep connection?

I found confusing the use of psql to confirm that it actually works,
because we'd just return a protocol-level error in this case with psql
bumping on COPY_BOTH and it is not reliable to do just an error
message match.  Note as well that GetConnection() discards
automatically the database name for pg_basebackup and pg_receivewal as
well as libpqrcv_connect() for standbys so we cannot use that.
Perhaps using psql is better than nothing, but that makes me
uncomfortable.
--
Michael


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-02 Thread Kyotaro Horiguchi
At Tue, 2 Jun 2020 13:24:56 +0900, Michael Paquier  wrote 
in 
> On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
> > Yes. Conversely, if we start logical replication in a physical
> > replication connection (i.g. replication=true) we got an error before
> > staring replication:
> > 
> > ERROR:  logical decoding requires a database connection
> > 
> > I think we can prevent that SEGV in a similar way.
> 
> Still unconvinced as this restriction stands for logical decoding
> requiring a database connection but it is not necessarily true now as
> physical replication has less restrictions than a logical one.

If we deliberately allow physical replication on a
database-replication connection, we should revise the documentation
that way. On the other hand physical replication has wider access to a
database cluster than logical replication.  Thus allowing to start
physical replication on a logical replication connection could
introduce a problem related to privileges.  So I think it might be
better that physical and logical replication have separate pg_hba
lines.

Once we explicitly allow physical replication on a logical replication
connection in documentation, it would be far harder to change the
behavior than now.

If we are sure that that cannot be a problem, I don't object the
change in documented behavior.

> Looking at the code, I think that there is some confusion with the
> fake WAL reader used as base reference in InitWalSender() where we
> assume that it could only be used in the context of a non-database WAL
> sender.  However, this initialization happens when the WAL sender
> connection is initialized, and what I think this misses is that we 
> should try to initialize a WAL reader when actually going through a
> START_REPLICATION command.

At first fake_xlogreader was really a fake one that only provides
callback routines, but it should have been changed to a real
xlogreader at the time it began to store segment information. In that
sense moving to real xlogreader makes sense to me separately from
whether we allow physicalrep on logicalrep connections.

> I can note as well that StartLogicalReplication() moves in this sense
> by setting xlogreader to be the one from logical_decoding_ctx once the
> decoding context has been created.
> 
> This results in the attached.  The extra test from upthread to check
> that logical decoding is not allowed in a non-database WAL sender is a
> good idea, so I have kept it.

+   ereport(ERROR,
+   (errcode(ERRCODE_OUT_OF_MEMORY),
+errmsg("out of memory")));

The same error message is accompanied by a DETAILS in some other
places. Don't we need one for this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-01 Thread Fujii Masao




On 2020/06/02 13:24, Michael Paquier wrote:

On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR:  logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.


Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.


Could you tell me what the benefit for supporting physical replication on
logical rep connection is? If it's only for "undocumented"
backward-compatibility, IMO it's better to reject such "tricky" set up.
But if there are some use cases for that, I'm ok to support that.


Looking at the code, I think that there is some confusion with the
fake WAL reader used as base reference in InitWalSender() where we
assume that it could only be used in the context of a non-database WAL
sender.  However, this initialization happens when the WAL sender
connection is initialized, and what I think this misses is that we
should try to initialize a WAL reader when actually going through a
START_REPLICATION command.

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached.  The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.


Yes. Also we should add the test to check if physical replication can work
fine even on logical rep connection?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-01 Thread Michael Paquier
On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
> Yes. Conversely, if we start logical replication in a physical
> replication connection (i.g. replication=true) we got an error before
> staring replication:
> 
> ERROR:  logical decoding requires a database connection
> 
> I think we can prevent that SEGV in a similar way.

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Looking at the code, I think that there is some confusion with the
fake WAL reader used as base reference in InitWalSender() where we
assume that it could only be used in the context of a non-database WAL
sender.  However, this initialization happens when the WAL sender
connection is initialized, and what I think this misses is that we 
should try to initialize a WAL reader when actually going through a
START_REPLICATION command.

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached.  The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.
--
Michael
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d930fe957d..b0f2a6ed43 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -262,10 +262,6 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
-/* Initialize supporting structures */
-extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
-			   int segsize, const char *waldir);
-
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..cb76be4f46 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -44,6 +44,8 @@ static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
 			XLogRecPtr recptr);
 static void ResetDecoder(XLogReaderState *state);
+static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
+			   int segsize, const char *waldir);
 
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
@@ -210,7 +212,7 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 /*
  * Initialize the passed segment structs.
  */
-void
+static void
 WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
    int segsize, const char *waldir)
 {
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..aeda307c6b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -130,13 +130,11 @@ bool		log_replication_commands = false;
 bool		wake_wal_senders = false;
 
 /*
- * Physical walsender does not use xlogreader to read WAL, but it does use a
- * fake one to keep state.  Logical walsender uses a proper xlogreader.  Both
- * keep the 'xlogreader' pointer to the right one, for the sake of common
- * routines.
+ * xlogreader used for replication.  Note that a WAL sender doing physical
+ * replication does not need xlogreader to read WAL, but it needs one to
+ * keep a state of its work.
  */
-static XLogReaderState fake_xlogreader;
-static XLogReaderState *xlogreader;
+static XLogReaderState *xlogreader = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -285,20 +283,6 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
-
-	/*
-	 * Prepare physical walsender's fake xlogreader struct.  Logical walsender
-	 * does this later.
-	 */
-	if (!am_db_walsender)
-	{
-		xlogreader = _xlogreader;
-		xlogreader->routine =
-			*XL_ROUTINE(.segment_open = WalSndSegmentOpen,
-		.segment_close = wal_segment_close);
-		WALOpenSegmentInit(>seg, >segcxt,
-		   wal_segment_size, NULL);
-	}
 }
 
 /*
@@ -594,6 +578,18 @@ StartReplication(StartReplicationCmd *cmd)
 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg("IDENTIFY_SYSTEM has not been run before START_REPLICATION")));
 
+	/* create xlogreader for physical replication */
+	xlogreader =
+		XLogReaderAllocate(wal_segment_size, NULL,
+		   XL_ROUTINE(.segment_open = WalSndSegmentOpen,
+	  .segment_close = wal_segment_close),
+		   NULL);
+
+	if (!xlogreader)
+		ereport(ERROR,
+(errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+
 	/*
 	 * We assume 

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-29 Thread Masahiko Sawada
On Fri, 29 May 2020 at 17:57, Kyotaro Horiguchi  wrote:
>
> At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier  
> wrote in
> > On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:
> > > Mmm. It is not the proper way to use physical replication and it's
> > > totally accidental that that worked (or even it might be a bug). The
> > > documentation is saying as the follows, as more-or-less the same for
> > > all versions since 9.4.
> > >
> > > https://www.postgresql.org/docs/13/protocol-replication.html
> >
> > +   if (am_db_walsender)
> > +   ereport(ERROR,
> > +   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +errmsg("cannot initiate physical
> > replication on a logical replication connection")));
> >
> > I don't agree with this change.  The only restriction that we have in
> > place now in walsender.c regarding MyDatabaseId not being set is to
> > prevent the execution of SQL commands.  Note that it is possible to
> > start physical replication even if MyDatabaseId is set in a
> > replication connection, so you could break cases that have been valid
> > until now.
>
> It donesn't check MyDatabase, but whether the connection parameter
> "repliation" is "true" or "database".  The documentation is telling
> that "replication" should be "true" for a connection that is to be
> used for physical replication, and "replication" should literally be
> "database" for a connection that is for logical replication.  We need
> to revise the documentation if we are going to allow physical
> replication on a conection with "replication = database".
>

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR:  logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-29 Thread Kyotaro Horiguchi
At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier  wrote 
in 
> On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:
> > Mmm. It is not the proper way to use physical replication and it's
> > totally accidental that that worked (or even it might be a bug). The
> > documentation is saying as the follows, as more-or-less the same for
> > all versions since 9.4.
> > 
> > https://www.postgresql.org/docs/13/protocol-replication.html
> 
> +   if (am_db_walsender)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +errmsg("cannot initiate physical
> replication on a logical replication connection")));
> 
> I don't agree with this change.  The only restriction that we have in
> place now in walsender.c regarding MyDatabaseId not being set is to
> prevent the execution of SQL commands.  Note that it is possible to
> start physical replication even if MyDatabaseId is set in a
> replication connection, so you could break cases that have been valid
> until now.

It donesn't check MyDatabase, but whether the connection parameter
"repliation" is "true" or "database".  The documentation is telling
that "replication" should be "true" for a connection that is to be
used for physical replication, and "replication" should literally be
"database" for a connection that is for logical replication.  We need
to revise the documentation if we are going to allow physical
replication on a conection with "replication = database".

> I think that we actually should be much more careful with the
> initialization of the WAL reader used in the context of a WAL sender
> before calling WALRead() and attempting to read a new WAL page.

I agree that the initialization can be improved, but the current code
is no problem if we don't allow to run both logical and physical
replication on a single session.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-29 Thread Michael Paquier
On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:
> Mmm. It is not the proper way to use physical replication and it's
> totally accidental that that worked (or even it might be a bug). The
> documentation is saying as the follows, as more-or-less the same for
> all versions since 9.4.
> 
> https://www.postgresql.org/docs/13/protocol-replication.html

+   if (am_db_walsender)
+   ereport(ERROR,
+   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("cannot initiate physical
replication on a logical replication connection")));

I don't agree with this change.  The only restriction that we have in
place now in walsender.c regarding MyDatabaseId not being set is to
prevent the execution of SQL commands.  Note that it is possible to
start physical replication even if MyDatabaseId is set in a
replication connection, so you could break cases that have been valid
until now.

I think that we actually should be much more careful with the
initialization of the WAL reader used in the context of a WAL sender
before calling WALRead() and attempting to read a new WAL page.
--
Michael


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Kyotaro Horiguchi
At Thu, 28 May 2020 09:08:19 -0400, Dave Cramer  
wrote in 
> On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi 
> wrote:
> > Mmm. It is not the proper way to use physical replication and it's
> > totally accidental that that worked (or even it might be a bug). The
> > documentation is saying as the follows, as more-or-less the same for
> > all versions since 9.4.
> >
> > https://www.postgresql.org/docs/13/protocol-replication.html
...
> >
> While the documentation does indeed say that there is quite a bit of
> additional confusion added by:
> 
> and
> START_REPLICATION [ SLOT *slot_name* ] [ PHYSICAL ] *XXX/XXX* [ TIMELINE
> *tli* ]
> 
> If we already have a physical replication slot according to the startup
> message why do we need to specify it in the START REPLICATION message ?

I don't know, but physical replication has worked that way since
before the replication slots was introduced so we haven't needed to do
so.  Physical replication slots are not assumed as more than
memorandum for the oldest required WAL segment (and oldest xmin).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Dave Cramer
On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi 
wrote:

> Hello, Vladimir.
>
> At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov <
> sitnikov.vladi...@gmail.com> wrote in
> > Kyotaro>It seems to me that that crash means Pgjdbc is initiating a
> logical
> > Kyotaro>replication connection to start physical replication.
> >
> > Well, it used to work previously, so it might be a breaking change from
> the
> > client/application point of view.
>
> Mmm. It is not the proper way to use physical replication and it's
> totally accidental that that worked (or even it might be a bug). The
> documentation is saying as the follows, as more-or-less the same for
> all versions since 9.4.
>
> https://www.postgresql.org/docs/13/protocol-replication.html
>
> > To initiate streaming replication, the frontend sends the replication
> > parameter in the startup message. A Boolean value of true (or on, yes,
> > 1) tells the backend to go into physical replication walsender mode,
> > wherein a small set of replication commands, shown below, can be
> > issued instead of SQL statements.
> >
> > Passing database as the value for the replication parameter instructs
> > the backend to go into logical replication walsender mode, connecting
> > to the database specified in the dbname parameter. In logical
> > replication walsender mode, the replication commands shown below as
> > well as normal SQL commands can be issued.
>
> regards.
>
> While the documentation does indeed say that there is quite a bit of
additional confusion added by:

and
START_REPLICATION [ SLOT *slot_name* ] [ PHYSICAL ] *XXX/XXX* [ TIMELINE
*tli* ]

If we already have a physical replication slot according to the startup
message why do we need to specify it in the START REPLICATION message ?

Dave


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Kyotaro Horiguchi
Hello, Vladimir.

At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov 
 wrote in 
> Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
> Kyotaro>replication connection to start physical replication.
> 
> Well, it used to work previously, so it might be a breaking change from the
> client/application point of view.

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

> To initiate streaming replication, the frontend sends the replication
> parameter in the startup message. A Boolean value of true (or on, yes,
> 1) tells the backend to go into physical replication walsender mode,
> wherein a small set of replication commands, shown below, can be
> issued instead of SQL statements.
> 
> Passing database as the value for the replication parameter instructs
> the backend to go into logical replication walsender mode, connecting
> to the database specified in the dbname parameter. In logical
> replication walsender mode, the replication commands shown below as
> well as normal SQL commands can be issued.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Vladimir Sitnikov
Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
Kyotaro>replication connection to start physical replication.

Well, it used to work previously, so it might be a breaking change from the
client/application point of view.

Vladimir


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Kyotaro Horiguchi
At Thu, 28 May 2020 09:07:04 +0300, Vladimir Sitnikov 
 wrote in 
> Pgjdbc test suite identified a SIGSEGV in the recent HEAD builds of
> PostgreSQL, Ubuntu 14.04.5 LTS
> 
> Here's a call stack:
> https://travis-ci.org/github/pgjdbc/pgjdbc/jobs/691794110#L7484
> The crash is consistent, and it reproduces 100% of the cases so far.
> 
> The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
> bad by 19 May 14:00 UTC,
> so the regression was introduced somewhere in-between.
> 
> Does that ring any bells?

Thanks for the report.  It is surely a bug since the server crashes,
on the other hand Pgjdbc seems doing bad, too.

It seems to me that that crash means Pgjdbc is initiating a logical
replication connection to start physical replication.

> In case you wonder:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  XLogSendPhysical () at walsender.c:2762
> 2762 if (!WALRead(xlogreader,
> (gdb) #0  XLogSendPhysical () at walsender.c:2762
> SendRqstPtr = 133473640
> startptr = 133473240
> endptr = 133473640
> nbytes = 400
> segno = 1
> errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1,
>   wre_read = -1, wre_seg = {ws_file = 4714224,
> ws_segno = 140729887364688, ws_tli = 0}}
> __func__ = "XLogSendPhysical"

I see the probably the same symptom by the following steps with the
current HEAD.

psql 'host=/tmp replication=database'
=# START_REPLICATION 0/1;


Physical replication is not assumed to be started on a logical
replication connection. The attached would fix that.  The patch adds
two tests.  One for this case and another for the reverse.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 1f94c98f7459ca8a4942246325815a3e0a91caa4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 28 May 2020 15:55:30 +0900
Subject: [PATCH v1] Fix crash when starting physical replication on logical
 connection

It is an illegal operation to try starting physical replication on a
logical replication session.  We should properly warn the client
instead of crashing.
---
 src/backend/replication/walsender.c |  5 +
 src/test/recovery/t/001_stream_rep.pl   | 14 +++---
 src/test/recovery/t/006_logical_decoding.pl | 10 +-
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..7b79c75311 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -589,6 +589,11 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 
+	if (am_db_walsender)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot initiate physical replication on a logical replication connection")));
+
 	if (ThisTimeLineID == 0)
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 0c316c1808..0b69b7d8d1 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 35;
+use Test::More tests => 36;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -18,6 +18,14 @@ my $backup_name = 'my_backup';
 # Take backup
 $node_master->backup($backup_name);
 
+# Check if logical-rep session properly refuses to start physical-rep
+my ($ret, $stdout, $stderr) =
+  $node_master->psql('template1',
+	 qq[START_REPLICATION PHYSICAL 0/1],
+	 replication=>'database');
+ok($stderr =~ /ERROR:  cannot initiate physical replication on a logical replication connection/,
+   "check if physical replication is rejected on logical-rep session");
+
 # Create streaming standby linking to master
 my $node_standby_1 = get_new_node('standby_1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
@@ -94,7 +102,7 @@ sub test_target_session_attrs
 
 	# The client used for the connection does not matter, only the backend
 	# point does.
-	my ($ret, $stdout, $stderr) =
+	($ret, $stdout, $stderr) =
 	  $node1->psql('postgres', 'SHOW port;',
 		extra_params => [ '-d', $connstr ]);
 	is( $status == $ret && $stdout eq $target_node->port,
@@ -136,7 +144,7 @@ my $connstr_rep= "$connstr_common replication=1";
 my $connstr_db = "$connstr_common replication=database dbname=postgres";
 
 # Test SHOW ALL
-my ($ret, $stdout, $stderr) = $node_master->psql(
+($ret, $stdout, $stderr) = $node_master->psql(
 	'postgres', 'SHOW ALL;',
 	on_error_die => 1,
 	extra_params => [ '-d', $connstr_rep ]);
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index ee05535b1c..eefde8c3f1 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Michael Paquier
On Thu, May 28, 2020 at 04:32:22PM +0900, Kyotaro Horiguchi wrote:
> I think that's not the case.  I think I cause this crash with the
> HEAD.  I'll post a fix soon.
> 
> Pgjdbc seems initiating physical replication on a logical replication
> session.

Let me see...  Indeed:
$ psql "replication=database dbname=postgres"
=# START_REPLICATION SLOT test_slot PHYSICAL 0/0;
[one  later]

(gdb) bt
#0  XLogSendPhysical () at walsender.c:2762
#1  0x55d5f7803318 in WalSndLoop (send_data=0x55d5f78039c7
) at walsender.c:2300
#2  0x55d5f7800d70 in StartReplication (cmd=0x55d5f919bc60) at
walsender.c:750
#3  0x55d5f78025ad in exec_replication_command
(cmd_string=0x55d5f9119a80 "START_REPLICATION SLOT test_slot PHYSICAL
0/0;") at walsender.c:1643
#4  0x55d5f786a7ea in PostgresMain (argc=1, argv=0x55d5f91472c8,
dbname=0x55d5f9147210 "ioltas", username=0x55d5f91471f0 "ioltas") at
postgres.c:4311
#5  0x55d5f77b4e9c in BackendRun (port=0x55d5f913dcd0) at
postmaster.c:4523
#6  0x55d5f77b4606 in BackendStartup (port=0x55d5f913dcd0) at
postmaster.c:4215
#7  0x55d5f77b08ad in ServerLoop () at postmaster.c:1727
#8  0x55d5f77b00fc in PostmasterMain (argc=3, argv=0x55d5f9113260)
at postmaster.c:1400
#9  0x55d5f76b3736 in main (argc=3, argv=0x55d5f9113260) at
main.c:210

No need for the JDBC test suite then.
--
Michael


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Kyotaro Horiguchi
At Thu, 28 May 2020 16:22:33 +0900, Michael Paquier  wrote 
in 
> On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote:
> > The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
> > bad by 19 May 14:00 UTC,
> > so the regression was introduced somewhere in-between.
> > 
> > Does that ring any bells?
> 
> It does, thanks!  This would map with 1d374302 or 850196b6 that
> reworked this area of the code, so it seems like we are not quite done
> with this work yet.  Do you still see the problem as of 55ca50d
> (today's latest HEAD)?

I think that's not the case.  I think I cause this crash with the
HEAD.  I'll post a fix soon.

> Also, just wondering..  If I use more or less the same commands as
> your travis job I should be able to reproduce the problem with a fresh
> JDBC repository, right?  Or do you a sub-portion of your regression
> tests to run that easily?

Pgjdbc seems initiating physical replication on a logical replication
session.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-05-28 Thread Michael Paquier
On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote:
> The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
> bad by 19 May 14:00 UTC,
> so the regression was introduced somewhere in-between.
> 
> Does that ring any bells?

It does, thanks!  This would map with 1d374302 or 850196b6 that
reworked this area of the code, so it seems like we are not quite done
with this work yet.  Do you still see the problem as of 55ca50d
(today's latest HEAD)?

Also, just wondering..  If I use more or less the same commands as
your travis job I should be able to reproduce the problem with a fresh
JDBC repository, right?  Or do you a sub-portion of your regression
tests to run that easily?
--
Michael


signature.asc
Description: PGP signature