Re: [HACKERS] Making TEXT NUL-transparent

2011-12-23 Thread Florian Weimer
* Florian Pflug:

 On Nov24, 2011, at 10:54 , Florian Weimer wrote:
 Or is it not only about being able to *store* NULs in a text field?
 
 No, the entire core should be NUL-transparent.

 That's unlikely to happen.

Yes, with the type input/output functions tied to NUL-terminated
strings, that seems indeed unlikely to happen.

 A more realistic approach would be to solve this only for UTF-8
 encoded strings by encoding the NUL character not as a single 0 byte,
 but as sequence of non-0 bytes.

0xFF cannot occur in valid UTF-8, so that's one possibility.

 Java, for example, seems to use it to serialize Strings (which may contain
 NUL characters) to UTF-8.

Only internally in the VM.  UTF-8 produced by the I/O encoder/decoders
produces and consumes NUL bytes.

 Should you try to add a new encoding which supports that, you might also
 want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means
 that code points representable by UTF-16 surrogate pairs may be encoded by
 separately encoding the two surrogate characters in UTF-8.

I'm not sure if this is a good idea.  The motivation behind CESU-8 is
that it sorts byte-encoded strings in the same order as UTF-16, which is
a completely separate concern.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Pavel Stehule
Hello

2011/11/24 Florian Weimer fwei...@bfk.de:
 Occasionally, we get bitten by embedded NUL bytes in TEXT values.  We
 take care of generating proper UTF-8, but this additional restriction
 sometimes slips by.  It would be really helpful if PostgreSQL could
 store such TEXT fields as-is (at least if they are computed internally,
 or come from query parameters).


and why you don't use bytea ? Text should be correct literal.

Regards

Pavel Stehule

 I haven't even looked how difficult it would be to implement this.  Do
 you think it's worth the trouble?

 --
 Florian Weimer                fwei...@bfk.de
 BFK edv-consulting GmbH       http://www.bfk.de/
 Kriegsstraße 100              tel: +49-721-96201-1
 D-76133 Karlsruhe             fax: +49-721-96201-99

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Florian Weimer
* Pavel Stehule:

 Hello

 2011/11/24 Florian Weimer fwei...@bfk.de:
 Occasionally, we get bitten by embedded NUL bytes in TEXT values.  We
 take care of generating proper UTF-8, but this additional restriction
 sometimes slips by.  It would be really helpful if PostgreSQL could
 store such TEXT fields as-is (at least if they are computed internally,
 or come from query parameters).


 and why you don't use bytea ? Text should be correct literal.

It's actually UTF-8 text, and some PostgreSQL functions are only
available for TEXT, but not BYTEA, e.g.:

bfk_int= SELECT '\x006500'::bytea ~ 'A';
ERROR:  operator does not exist: bytea ~ unknown
LINE 1: SELECT '\x006500'::bytea ~ 'A';
 ^
HINT:  No operator matches the given name and argument type(s). You might need 
to add explicit type casts.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Alexander Shulgin

Excerpts from Florian Weimer's message of Thu Nov 24 11:27:51 +0200 2011:
 
  and why you don't use bytea ? Text should be correct literal.
 
 It's actually UTF-8 text, and some PostgreSQL functions are only
 available for TEXT, but not BYTEA, e.g.:
 
 bfk_int= SELECT '\x006500'::bytea ~ 'A';
 ERROR:  operator does not exist: bytea ~ unknown

And how will those TEXT functions behave on a value with an embedded NUL?  Or 
is it not only about being able to *store* NULs in a text field?

--
Alex

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Pavel Stehule
2011/11/24 Florian Weimer fwei...@bfk.de:
 * Pavel Stehule:

 Hello

 2011/11/24 Florian Weimer fwei...@bfk.de:
 Occasionally, we get bitten by embedded NUL bytes in TEXT values.  We
 take care of generating proper UTF-8, but this additional restriction
 sometimes slips by.  It would be really helpful if PostgreSQL could
 store such TEXT fields as-is (at least if they are computed internally,
 or come from query parameters).


 and why you don't use bytea ? Text should be correct literal.

 It's actually UTF-8 text, and some PostgreSQL functions are only
 available for TEXT, but not BYTEA, e.g.:

 bfk_int= SELECT '\x006500'::bytea ~ 'A';
 ERROR:  operator does not exist: bytea ~ unknown
 LINE 1: SELECT '\x006500'::bytea ~ 'A';
                                 ^
 HINT:  No operator matches the given name and argument type(s). You might 
 need to add explicit type casts.


then you can prepare some custom function that can filter zeroes and
prepare valid text functions - some postgresql's functions (based on
external libs) cannot to work with string containing zeroes too.
Probably you can do it in plperl.

I don't think so this functionality is generally requested. If you
have broken strings, then you have to clean it before storing to
database.

Regards

Pavel

 --
 Florian Weimer                fwei...@bfk.de
 BFK edv-consulting GmbH       http://www.bfk.de/
 Kriegsstraße 100              tel: +49-721-96201-1
 D-76133 Karlsruhe             fax: +49-721-96201-99


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Florian Weimer
* Alexander Shulgin:

 It's actually UTF-8 text, and some PostgreSQL functions are only
 available for TEXT, but not BYTEA, e.g.:
 
 bfk_int= SELECT '\x006500'::bytea ~ 'A';
 ERROR:  operator does not exist: bytea ~ unknown

 And how will those TEXT functions behave on a value with an embedded
 NUL?

They need to be audited and fixed if necessary.  I'm not saying that
this would be a trivial change.

 Or is it not only about being able to *store* NULs in a text field?

No, the entire core should be NUL-transparent.

By the way, I refuse the notion that UTF-8 strings with embedded NULs
are broken.  I can't recall any other system which enforces UTF-8
well-formedness, but does not permit embedded NULs.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Pavel Stehule
2011/11/24 Florian Weimer fwei...@bfk.de:
 * Alexander Shulgin:

 It's actually UTF-8 text, and some PostgreSQL functions are only
 available for TEXT, but not BYTEA, e.g.:

 bfk_int= SELECT '\x006500'::bytea ~ 'A';
 ERROR:  operator does not exist: bytea ~ unknown

 And how will those TEXT functions behave on a value with an embedded
 NUL?

 They need to be audited and fixed if necessary.  I'm not saying that
 this would be a trivial change.

 Or is it not only about being able to *store* NULs in a text field?

 No, the entire core should be NUL-transparent.

 By the way, I refuse the notion that UTF-8 strings with embedded NULs
 are broken.  I can't recall any other system which enforces UTF-8
 well-formedness, but does not permit embedded NULs.


I have a different question. What is reason for embedded NULs inside strings?

Regards

Pavel Stehule

 --
 Florian Weimer                fwei...@bfk.de
 BFK edv-consulting GmbH       http://www.bfk.de/
 Kriegsstraße 100              tel: +49-721-96201-1
 D-76133 Karlsruhe             fax: +49-721-96201-99


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Florian Weimer
* Pavel Stehule:

 By the way, I refuse the notion that UTF-8 strings with embedded NULs
 are broken.  I can't recall any other system which enforces UTF-8
 well-formedness, but does not permit embedded NULs.

 I have a different question. What is reason for embedded NULs inside
 strings?

The source system does not enforce that constraint, so from time to
time, such data slips through.  I don't know why it's there in the first
place, and I have no control over the original data source.  Usually,
it's okay to silently strip NUL bytes; it's just a bit awkward.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Alexander Shulgin

Excerpts from Florian Weimer's message of Thu Nov 24 12:59:09 +0200 2011:
 
  I have a different question. What is reason for embedded NULs inside
  strings?
 
 The source system does not enforce that constraint, so from time to
 time, such data slips through.  I don't know why it's there in the first
 place, and I have no control over the original data source.  Usually,
 it's okay to silently strip NUL bytes; it's just a bit awkward.

Strip?  Like this: loss\0less = lossless?  That would be awkward for sure.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Pavel Stehule
2011/11/24 Alexander Shulgin a...@commandprompt.com:

 Excerpts from Florian Weimer's message of Thu Nov 24 12:59:09 +0200 2011:

  I have a different question. What is reason for embedded NULs inside
  strings?

 The source system does not enforce that constraint, so from time to
 time, such data slips through.  I don't know why it's there in the first
 place, and I have no control over the original data source.  Usually,
 it's okay to silently strip NUL bytes; it's just a bit awkward.


the source should be buggy application that has broken replace or
concat of strings.

Pavel

 Strip?  Like this: loss\0less = lossless?  That would be awkward for 
 sure.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Robert Haas
On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer fwei...@bfk.de wrote:
 By the way, I refuse the notion that UTF-8 strings with embedded NULs
 are broken.  I can't recall any other system which enforces UTF-8
 well-formedness, but does not permit embedded NULs.

This seems like a key point.  If UTF-8 allows embedded NULs, then a
text field supposedly encoded as UTF-8 ought to allow them, too.

However, our propensity to translate between text and cstring might
make this difficult to implement in practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Florian Pflug
On Nov24, 2011, at 10:54 , Florian Weimer wrote:
 Or is it not only about being able to *store* NULs in a text field?
 
 No, the entire core should be NUL-transparent.

That's unlikely to happen. A more realistic approach would be to solve
this only for UTF-8 encoded strings by encoding the NUL character not as
a single 0 byte, but as sequence of non-0 bytes.

Such a thing is possible in UTF-8 because there are multiple ways to
encode the same character once you drop the requirement that characters
be encoded in the *shortest* possible way.

Since we very probably won't loosen up UTF-8's integrity checks to allow
that, it'd have to be done as a new encoding, say 'utf8-loose'.

That new encoding could, for example, use 0xC0 0x80 to represent NUL
characters. This byte sequence is invalid in standard-conforming UTF-8
because it's a non-normalized (i.e. overly long) representation a code
point (the code point NUL, incidentally). A bit of googling suggests that
quite a few piece of software use this kind of modified UTF-8 encoding.

Java, for example, seems to use it to serialize Strings (which may contain
NUL characters) to UTF-8.

Should you try to add a new encoding which supports that, you might also
want to allow CESU-8-style encoding of UTF-16 surrogate pairs. This means
that code points representable by UTF-16 surrogate pairs may be encoded by
separately encoding the two surrogate characters in UTF-8.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Andrew Dunstan
On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer fwei...@bfk.de wrote:


 By the way, I refuse the notion that UTF-8 strings with embedded NULs
 are broken.  I can't recall any other system which enforces UTF-8
 well-formedness, but does not permit embedded NULs.



Refuse away, but I don't think you know what you're asking for. Given the
number of places that treat text values as being nul-terminated in the
code, we'd be many years fixing bugs from a change like this, IMNSHO.

cheers

andrew


Re: [HACKERS] Making TEXT NUL-transparent

2011-11-24 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On Thu, Nov 24, 2011 at 4:54 AM, Florian Weimer fwei...@bfk.de wrote:
 By the way, I refuse the notion that UTF-8 strings with embedded NULs
 are broken.  I can't recall any other system which enforces UTF-8
 well-formedness, but does not permit embedded NULs.

 Refuse away, but I don't think you know what you're asking for. Given the
 number of places that treat text values as being nul-terminated in the
 code, we'd be many years fixing bugs from a change like this, IMNSHO.

Yeah.  This has been discussed before (see the archives) and every time
we've concluded that the amount of pain and developer effort would
outweigh the benefit by many orders of magnitude.  Just to start with,
we'd have to change the convention that datatype I/O functions take and
return cstring, thus breaking every core and add-on datatype.

If you really need to store nulls, use bytea.  If there are specific
operations you need that aren't available on bytea, we'd probably
entertain proposals to add them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers