Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-11-12 Thread Bruce Momjian
On Thu, Oct 15, 2020 at 12:43:22PM -0400, Bruce Momjian wrote:
> On Thu, Oct 15, 2020 at 12:01:21PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > We would want the timeline history file contents label changed from
> > > BYTEA to TEXT in the docs changed for all supported versions, add a C
> > > comment to all backbranches that BYTEA is the same as TEXT protocol
> > > fields, and change the C code to return TEXT IN PG 14.  Is that what
> > > people want?
> > 
> > I still think there is no need to touch back branches.  What you
> > propose here is more likely to confuse people than help them.
> > Having the documentation disagree with the code about how the
> > field is labeled is not good either.
> 
> Understood.
> 
> > Furthermore, it absolutely does not make sense to say (or imply)
> > that the unknown-encoding business applies to all text fields.
> > There are a very small number of fields where we should say that.
> 
> Yes, I am seeing now that even IDENTIFY_SYSTEM above it properly does
> encoding.  I guess TIMELINE_HISTORY works this way because it is pulling
> from the file system, not from system tables.  I ended up with just a
> new doc sentence and C comment in back branches, and a relabeling of the
> timeline history 'content' field as TEXT in the C code and docs,
> attached.

I have applied the doc-only patch to back branches, and the data type
change to master;  the master change should cause no behavioral change.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Bruce Momjian
On Thu, Oct 15, 2020 at 12:01:21PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > We would want the timeline history file contents label changed from
> > BYTEA to TEXT in the docs changed for all supported versions, add a C
> > comment to all backbranches that BYTEA is the same as TEXT protocol
> > fields, and change the C code to return TEXT IN PG 14.  Is that what
> > people want?
> 
> I still think there is no need to touch back branches.  What you
> propose here is more likely to confuse people than help them.
> Having the documentation disagree with the code about how the
> field is labeled is not good either.

Understood.

> Furthermore, it absolutely does not make sense to say (or imply)
> that the unknown-encoding business applies to all text fields.
> There are a very small number of fields where we should say that.

Yes, I am seeing now that even IDENTIFY_SYSTEM above it properly does
encoding.  I guess TIMELINE_HISTORY works this way because it is pulling
from the file system, not from system tables.  I ended up with just a
new doc sentence and C comment in back branches, and a relabeling of the
timeline history 'content' field as TEXT in the C code and docs,
attached.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5e06c7523d..4bb1ca7e26 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1859,7 +1859,10 @@ The commands accepted in replication mode are:
  
   Requests the server to send over the timeline history file for timeline
   tli.  Server replies with a
-  result set of a single row, containing two fields:
+  result set of a single row, containing two fields.  While the
+  fields are labeled as text and bytea,
+  they effectively return raw bytes, with no escaping or encoding
+  conversion:
  
 
  
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index df27e84761..7c939897dd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -496,6 +496,10 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
 	pq_sendstring(, "content"); /* col name */
 	pq_sendint32(, 0);		/* table oid */
 	pq_sendint16(, 0);		/* attnum */
+	/*
+	 * While this is labeled as BYTEAOID, it is the same output format
+	 * as TEXTOID above.
+	 */
 	pq_sendint32(, BYTEAOID);	/* type oid */
 	pq_sendint16(, -1);		/* typlen */
 	pq_sendint32(, 0);		/* typmod */
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5e06c7523d..4a5c5f9458 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1859,7 +1859,9 @@ The commands accepted in replication mode are:
  
   Requests the server to send over the timeline history file for timeline
   tli.  Server replies with a
-  result set of a single row, containing two fields:
+  result set of a single row, containing two fields.  While the fields
+  are labeled as text, they effectively return raw bytes,
+  with no encoding conversion:
  
 
  
@@ -1877,7 +1879,7 @@ The commands accepted in replication mode are:
 
   
   
-   content (bytea)
+   content (text)
   
   
   
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index df27e84761..e6f1503f92 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -496,7 +496,7 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
 	pq_sendstring(, "content"); /* col name */
 	pq_sendint32(, 0);		/* table oid */
 	pq_sendint16(, 0);		/* attnum */
-	pq_sendint32(, BYTEAOID);	/* type oid */
+	pq_sendint32(, TEXTOID);	/* type oid */
 	pq_sendint16(, -1);		/* typlen */
 	pq_sendint32(, 0);		/* typmod */
 	pq_sendint16(, 0);		/* format code */


Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Tom Lane
Bruce Momjian  writes:
> We would want the timeline history file contents label changed from
> BYTEA to TEXT in the docs changed for all supported versions, add a C
> comment to all backbranches that BYTEA is the same as TEXT protocol
> fields, and change the C code to return TEXT IN PG 14.  Is that what
> people want?

I still think there is no need to touch back branches.  What you
propose here is more likely to confuse people than help them.
Having the documentation disagree with the code about how the
field is labeled is not good either.

Furthermore, it absolutely does not make sense to say (or imply)
that the unknown-encoding business applies to all text fields.
There are a very small number of fields where we should say that.

regards, tom lane




Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Bruce Momjian
On Thu, Oct 15, 2020 at 11:18:33AM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote:
> >> I don't really see why our only documentation choices are "text" or
> >> "bytea".  Can't we just write "(raw file contents)" or the like?
> 
> > Agreed.  The reason they are those labels is that the C code has to
> > label the output fields, so it can only use SQL types.  There are many
> > protocol fields mislabeled in this way, not just timeline history.
> 
> Ah, right.  Well, let's label it as text and then in the description
> of the message, note that the field contains the raw file contents,
> whose encoding is unknown to the server.

OK, but this not would need to be in the protocol docs at the top, not
for the timeline history file, since it affects all columns labeled as
TEXT.

We would want the timeline history file contents label changed from
BYTEA to TEXT in the docs changed for all supported versions, add a C
comment to all backbranches that BYTEA is the same as TEXT protocol
fields, and change the C code to return TEXT IN PG 14.  Is that what
people want?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Tom Lane
Bruce Momjian  writes:
> On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote:
>> I don't really see why our only documentation choices are "text" or
>> "bytea".  Can't we just write "(raw file contents)" or the like?

> Agreed.  The reason they are those labels is that the C code has to
> label the output fields, so it can only use SQL types.  There are many
> protocol fields mislabeled in this way, not just timeline history.

Ah, right.  Well, let's label it as text and then in the description
of the message, note that the field contains the raw file contents,
whose encoding is unknown to the server.

regards, tom lane




Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Bruce Momjian
On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote:
> Brar Piening  writes:
> > Bruce Momjian  wrote:
> >> I did some more research on this. It turns out timeline 'content' is
> >> the only BYTEA listed in the protocol docs, even though it just passes C
> >> strings to pq_sendbytes(), just like many other fields like the field
> >> above it, the timeline history filename. The proper fix is to change
> >> the code to return the timeline history file contents as TEXT instead of
> >> BYTEA.
> 
> > If the timeline history file can contain strings which "may not be made 
> > just of ASCII characters" this would probably make the client side assume 
> > that the content is being sent as TEXT encoded in client_encoding which 
> > again isn't true.
> 
> I think this is overthinking the problem, TBH.  Yeah, perhaps somebody
> put non-ASCII comments into that file, but they'd probably be expecting
> the exact same encoding they used there to come out the other end.
> 
> We should certainly *not* apply byteaout, as that would break cases that
> work fine today; and if we do not do that, it is quite misleading to
> suggest that the data is given in bytea format.
> 
> I don't really see why our only documentation choices are "text" or
> "bytea".  Can't we just write "(raw file contents)" or the like?

Agreed.  The reason they are those labels is that the C code has to
label the output fields, so it can only use SQL types.  There are many
protocol fields mislabeled in this way, not just timeline history.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Fujii Masao




On 2020/10/15 23:03, Tom Lane wrote:

Brar Piening  writes:

Bruce Momjian  wrote:

I did some more research on this. It turns out timeline 'content' is
the only BYTEA listed in the protocol docs, even though it just passes C
strings to pq_sendbytes(), just like many other fields like the field
above it, the timeline history filename. The proper fix is to change
the code to return the timeline history file contents as TEXT instead of
BYTEA.



If the timeline history file can contain strings which "may not be made just of 
ASCII characters" this would probably make the client side assume that the content 
is being sent as TEXT encoded in client_encoding which again isn't true.


I think this is overthinking the problem, TBH.  Yeah, perhaps somebody
put non-ASCII comments into that file, but they'd probably be expecting
the exact same encoding they used there to come out the other end.

We should certainly *not* apply byteaout, as that would break cases that
work fine today;


+1


and if we do not do that, it is quite misleading to
suggest that the data is given in bytea format.

I don't really see why our only documentation choices are "text" or
"bytea".  Can't we just write "(raw file contents)" or the like?


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Tom Lane
Brar Piening  writes:
> Bruce Momjian  wrote:
>> I did some more research on this. It turns out timeline 'content' is
>> the only BYTEA listed in the protocol docs, even though it just passes C
>> strings to pq_sendbytes(), just like many other fields like the field
>> above it, the timeline history filename. The proper fix is to change
>> the code to return the timeline history file contents as TEXT instead of
>> BYTEA.

> If the timeline history file can contain strings which "may not be made just 
> of ASCII characters" this would probably make the client side assume that the 
> content is being sent as TEXT encoded in client_encoding which again isn't 
> true.

I think this is overthinking the problem, TBH.  Yeah, perhaps somebody
put non-ASCII comments into that file, but they'd probably be expecting
the exact same encoding they used there to come out the other end.

We should certainly *not* apply byteaout, as that would break cases that
work fine today; and if we do not do that, it is quite misleading to
suggest that the data is given in bytea format.

I don't really see why our only documentation choices are "text" or
"bytea".  Can't we just write "(raw file contents)" or the like?

regards, tom lane




Aw: Re: Minor documentation error regarding streaming replication protocol

2020-10-15 Thread Brar Piening
Bruce Momjian  wrote:
>> Good point. The reporter was assuming the data would come to the client
>> in the bytea output format specified by the GUC, e.g. \x..., so that
>> doesn't happen either. As I said before, it is more raw bytes, but we
>> don't have an SQL data type for that.

> I did some more research on this. It turns out timeline 'content' is
> the only BYTEA listed in the protocol docs, even though it just passes C
> strings to pq_sendbytes(), just like many other fields like the field
> above it, the timeline history filename. The proper fix is to change
> the code to return the timeline history file contents as TEXT instead of
> BYTEA.

In the light of what Michael wrote above, I don't think that this is really 
enough.

If the timeline history file can contain strings which "may not be made just of 
ASCII characters" this would probably make the client side assume that the 
content is being sent as TEXT encoded in client_encoding which again isn't true.
In the worst case this could lead to nasty decoding bugs on the client side 
which could even result in security issues.

Since you probably can't tell in which encoding the aforementioned "recovery 
target name" was written to the timeline history file, I agree with Michael 
that BYTEA is probably the sanest way to send this file.

IMO the best way out of this is to either really encode the content as BYTEA by 
passing it through byteaout() and by that escaping characters <0x20 and >0x7e, 
or to document that the file is being sent "as raw bytes that can be read as 
'bytea Escape Format' by parsers compatible with byteain()" (this works because 
byteain() doesn't check whether bytes <0x20 or >0x7e are actually escaped).

Again, reading the raw bytes, either via byteain() or just as raw bytes, isn't 
really a problem and I don't want to bring you into a situation where the cure 
is worse than the disease.