What is the state of play with this item? I think this is a must-fix bug
for 8.3. There was a flurry of messages back in April but since then I
don't recall seeing anything.
cheers
andrew
Mark Dilger wrote:
Mark Dilger wrote:
Bruce Momjian wrote:
Added to TODO:
* Fix cases where
Martijn van Oosterhout wrote:
So your implemntation is simply:
1. Take number and make UTF-8 string
2. Convert it to database encoding.
Aah, now I can spot where the misunderstanding is.
That's not what I mean.
I mean that chr() should simply 'typecast' to char.
So when the database encoding
Tatsuo Ishii wrote:
I think we need to continute design discussion, probably
targetting for 8.4, not 8.3.
But isn't a simple fix for chr() and ascii(), which does not
require a redesign, a Good Thing for 8.3 if possible? Something
that maintains as much upward and/or Oracle compatibility as
On Thu, Apr 05, 2007 at 09:34:25AM +0900, Tatsuo Ishii wrote:
I'm not sure what kind of use case for unicode_char() you are thinking
about. Anyway if you want a code point from a character, we could
easily add such functions to all backend encodings currently we
support. Probably it would look
On Thu, Apr 05, 2007 at 11:52:14AM +0200, Albe Laurenz wrote:
But isn't a simple fix for chr() and ascii(), which does not
require a redesign, a Good Thing for 8.3 if possible? Something
that maintains as much upward and/or Oracle compatibility as
possible while doing away with ascii('EUR')
Martijn van Oosterhout kleptog@svana.org writes:
I think the problem is that most encodings do not have the concept of a
code point anyway, so implementing it for them is fairly useless.
Yeah. I'm beginning to think that the right thing to do is
(a) make chr/ascii do the same thing as Oracle
On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote:
I think it's probably defensible for non-Unicode encodings. To do
otherwise would require (a) figuring out what the equivalent concept to
code point is for each encoding, and (b) having a separate code path
for each encoding to perform
Mark Dilger wrote:
What I suggest (and what Oracle implements, and isn't CHR() and
ASCII()
partly for Oracle compatibility?) is that CHR() and ASCII()
convert between a character (in database encoding) and
that database encoding in numeric form.
Looking at Oracle documentation, it appears
What do others think? Should the argument to CHR() be a
Unicode code point or the numeric representation of the
database encoding?
When the database uses a single byte encoding, the chr function takes
the binary byte representation as an integer number between 0 and 255
(e.g. ascii code).
When the database uses a single byte encoding, the chr function takes
the binary byte representation as an integer number between 0 and 255
(e.g. ascii code).
When the database encoding is one of the unicode encodings it takes a
unicode code point.
This is also what Oracle does.
Sorry, but
When the database uses a single byte encoding, the chr function
takes
the binary byte representation as an integer number between 0 and
255
(e.g. ascii code).
When the database encoding is one of the unicode encodings it takes
a
unicode code point.
This is also what Oracle does.
Martijn van Oosterhout wrote:
On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote:
I think it's probably defensible for non-Unicode encodings. To do
otherwise would require (a) figuring out what the equivalent concept to
code point is for each encoding, and (b) having a separate code
On 2007-04-04, Alvaro Herrera [EMAIL PROTECTED] wrote:
Right -- IMHO what we should be doing is reject any input to chr() which
is beyond plain ASCII (or maybe 255), and create a separate function
(unicode_char() sounds good) to get an Unicode character from a code
point, converted to the
Alvaro Herrera [EMAIL PROTECTED] writes:
Right -- IMHO what we should be doing is reject any input to chr() which
is beyond plain ASCII (or maybe 255), and create a separate function
(unicode_char() sounds good) to get an Unicode character from a code
point, converted to the local
Andrew - Supernews [EMAIL PROTECTED] writes:
Thinking about this made me realize that there's another, ahem, elephant
in the room here: convert().
By definition convert() returns text strings which are not valid in the
server encoding. How can this be addressed?
Remove convert(). Or at least
On Wed, Apr 04, 2007 at 10:22:28AM -0400, Tom Lane wrote:
Alvaro Herrera [EMAIL PROTECTED] writes:
Right -- IMHO what we should be doing is reject any input to chr() which
is beyond plain ASCII (or maybe 255), and create a separate function
(unicode_char() sounds good) to get an Unicode
Alvaro Herrera [EMAIL PROTECTED] writes:
Right -- IMHO what we should be doing is reject any input to chr() which
is beyond plain ASCII (or maybe 255), and create a separate function
(unicode_char() sounds good) to get an Unicode character from a code
point, converted to the local
Tatsuo Ishii wrote:
BTW, every encoding has its own charset. However the relationship
between encoding and charset are not so simple as Unicode. For
example, encoding EUC_JP correponds to multiple charsets, namely
ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
returns a
Am Mittwoch, 4. April 2007 16:22 schrieb Tom Lane:
Alvaro Herrera [EMAIL PROTECTED] writes:
Right -- IMHO what we should be doing is reject any input to chr() which
is beyond plain ASCII (or maybe 255), and create a separate function
(unicode_char() sounds good) to get an Unicode character
Albe Laurenz wrote:
There's one thing that strikes me as weird in your implementation:
pgsql=# select chr(0);
ERROR: character 0x00 of encoding SQL_ASCII has no equivalent in
UTF8
0x00 is a valid UNICODE code point and also a valid UTF-8 character!
It's not my code that rejects this. I'm
Tatsuo Ishii wrote:
SNIP. I think we need to continute design discussion, probably
targetting for 8.4, not 8.3.
The discussion came about because Andrew - Supernews noticed that chr()
returns invalid utf8, and we're trying to fix all the bugs with invalid
utf8 in the system. Something
Mark Dilger [EMAIL PROTECTED] writes:
Albe Laurenz wrote:
0x00 is a valid UNICODE code point and also a valid UTF-8 character!
It's not my code that rejects this. I'm passing the resultant string to
the pg_verify_mbstr(...) function and it is rejecting a null. I could
change that, of
Tatsuo Ishii wrote:
SNIP. I think we need to continute design discussion, probably
targetting for 8.4, not 8.3.
The discussion came about because Andrew - Supernews noticed that chr()
returns invalid utf8, and we're trying to fix all the bugs with invalid
utf8 in the system.
Tatsuo Ishii wrote:
BTW, every encoding has its own charset. However the relationship
between encoding and charset are not so simple as Unicode. For
example, encoding EUC_JP correponds to multiple charsets, namely
ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
Andrew - Supernews [EMAIL PROTECTED] writes:
Thinking about this made me realize that there's another, ahem, elephant
in the room here: convert().
By definition convert() returns text strings which are not valid in the
server encoding. How can this be addressed?
Remove convert(). Or
On 2007-04-05, Tatsuo Ishii [EMAIL PROTECTED] wrote:
Andrew - Supernews [EMAIL PROTECTED] writes:
Thinking about this made me realize that there's another, ahem, elephant
in the room here: convert().
By definition convert() returns text strings which are not valid in the
server encoding.
Mark Dilger wrote:
In particular, in UTF8 land I'd have expected the argument of chr()
to be interpreted as a Unicode code point, not as actual UTF8 bytes
with a randomly-chosen endianness.
Not sure what to do in other multibyte encodings.
Not sure what to do in other multibyte encodings
On 2007-04-03, Albe Laurenz [EMAIL PROTECTED] wrote:
According to RFC 2279, the Euro,
Unicode code point 0x20AC = 0010 1010 1100,
will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.
IMHO this is the only good and intuitive way for CHR() and ASCII().
It is beyond ludicrous for
On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote:
IMHO this is the only good and intuitive way for CHR() and ASCII().
Hardly. The comment earlier about mbtowc was much closer to the mark.
And wide characters are defined as Unicode points.
Basically, CHR() takes a unicode point and
Andrew wrote:
According to RFC 2279, the Euro,
Unicode code point 0x20AC = 0010 1010 1100,
will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.
IMHO this is the only good and intuitive way for CHR() and ASCII().
It is beyond ludicrous for functions like chr() or ascii() to
Martijn van Oosterhout wrote:
On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote:
IMHO this is the only good and intuitive way for CHR() and ASCII().
Hardly. The comment earlier about mbtowc was much closer to the mark.
And wide characters are defined as Unicode points.
Basically,
Mark Dilger [EMAIL PROTECTED] writes:
Martijn van Oosterhout wrote:
Just about every multibyte encoding other than Unicode has the problem
of not distinguishing between the code point and the encoding of it.
Thanks for the feedback. Would you say that the way I implemented things in
the
Albe Laurenz wrote:
What I suggest (and what Oracle implements, and isn't CHR() and ASCII()
partly for Oracle compatibility?) is that CHR() and ASCII()
convert between a character (in database encoding) and
that database encoding in numeric form.
Looking at Oracle documentation, it appears
Andrew - Supernews wrote:
On 2007-04-01, Mark Dilger [EMAIL PROTECTED] wrote:
Do any of the string functions (see
http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the
risk of generating invalid utf8 encoded strings? Do I need to add checks?
Are there known bugs with
Mark Dilger wrote:
Andrew - Supernews wrote:
On 2007-04-01, Mark Dilger [EMAIL PROTECTED] wrote:
Do any of the string functions (see
http://www.postgresql.org/docs/8.2/interactive/functions-string.html)
run the risk of generating invalid utf8 encoded strings? Do I need
to add checks?
Are
Mark Dilger [EMAIL PROTECTED] writes:
pgsql=# select chr(14989485);
chr
-
ä¸
(1 row)
Is there a principled rationale for this particular behavior as
opposed to any other?
In particular, in UTF8 land I'd have expected the argument of chr()
to be interpreted as a Unicode code point, not
Tom Lane wrote:
Mark Dilger [EMAIL PROTECTED] writes:
pgsql=# select chr(14989485);
chr
-
ä¸
(1 row)
Is there a principled rationale for this particular behavior as
opposed to any other?
In particular, in UTF8 land I'd have expected the argument of chr()
to be interpreted as a Unicode
Mark Dilger wrote:
Tom Lane wrote:
Mark Dilger [EMAIL PROTECTED] writes:
pgsql=# select chr(14989485);
chr
-
ä¸
(1 row)
Is there a principled rationale for this particular behavior as
opposed to any other?
In particular, in UTF8 land I'd have expected the argument of chr()
to be
Mark Dilger wrote:
Tom Lane wrote:
Mark Dilger [EMAIL PROTECTED] writes:
pgsql=# select chr(14989485);
chr
-
ä¸
(1 row)
Is there a principled rationale for this particular behavior as
opposed to any other?
In particular, in UTF8 land I'd have expected the argument of chr()
to be
Mark Dilger wrote:
Since chr() is defined in oracle_compat.c, I decided to look at what
Oracle might do. See
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm
It looks to me like they are doing the same thing that I did, though I
don't have Oracle
On 2007-04-02, Mark Dilger [EMAIL PROTECTED] wrote:
Here's the code for the new chr() function:
if (pg_database_encoding_max_length() 1 !lc_ctype_is_c())
Clearly wrong - this allows returning invalid UTF8 data in locale C, which
is not an uncommon setting to use.
Treating the parameter
On Sat, Mar 31, 2007 at 07:47:21PM -0700, Mark Dilger wrote:
OK, I can take a stab at fixing this. I'd like to state some assumptions
so people can comment and reply:
I assume that I need to fix *all* cases where invalid byte encodings get
into the database through functions shipped in
On 2007-04-01, Mark Dilger [EMAIL PROTECTED] wrote:
Do any of the string functions (see
http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the
risk of generating invalid utf8 encoded strings? Do I need to add checks?
Are there known bugs with these functions in this
Martijn van Oosterhout wrote:
There's also the performance angle. The current mbverify is very
inefficient for encodings like UTF-8. You might need to refactor a bit
there...
There appears to be a lot of function call overhead in the current
implementation. In pg_verify_mbstr, the function
Mark Dilger [EMAIL PROTECTED] writes:
Refactoring the way these table driven functions work would impact
lots of other code. Just grep for all files #including mb/pg_wchar.h
for the list of them. The list includes interfaces/libpq, and I'm
wondering if software that links against postgres
Mark Dilger [EMAIL PROTECTED] writes:
Refactoring the way these table driven functions work would impact
lots of other code. Just grep for all files #including mb/pg_wchar.h
for the list of them. The list includes interfaces/libpq, and I'm
wondering if software that links against
Tatsuo Ishii [EMAIL PROTECTED] writes:
No, we've never exported those with the intent that client code should
use 'em.
I thought PQescapeString() of 8.3 uses mbverify functions to make sure
that user supplied multibyte string is valid.
Certainly --- but we can change PQescapeString to match
Bruce Momjian wrote:
Added to TODO:
* Fix cases where invalid byte encodings are accepted by the database,
but throw an error on SELECT
http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
Is anyone working on fixing this bug?
Hi, has anyone
Mark Dilger wrote:
Bruce Momjian wrote:
Added to TODO:
* Fix cases where invalid byte encodings are accepted by the
database,
but throw an error on SELECT
http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
Is anyone working on fixing this bug?
Hi, has
Added to TODO:
* Fix cases where invalid byte encodings are accepted by the database,
but throw an error on SELECT
http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
Is anyone working on fixing this bug?
Am Sonntag, 18. März 2007 12:36 schrieb Martijn van Oosterhout:
It seems to me that the easiest solution would be to forbid \x?? escape
sequences where it's greater than \x7F for UTF-8 server encodings.
Instead introduce a \u escape for specifying the unicode character
directly. Under the
evil mode1
Maybe we should add as resurce intensive check to ascii encoding(s),
that would even the score ;p
/evil mode1
evil mode 2
let's test mysql on this, and see how worse does it perform.
/evil mode 2
--
Grzegorz 'the evil' Jaskiewicz
evil C/evil C++ developer for hire
On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote:
How can we fix this? Frankly, the statement in the docs warning about
making sure that escaped sequences are valid in the server encoding is a
cop-out. We don't accept invalid data elsewhere, and this should be no
different
Martijn van Oosterhout wrote:
On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote:
How can we fix this? Frankly, the statement in the docs warning about
making sure that escaped sequences are valid in the server encoding is a
cop-out. We don't accept invalid data elsewhere, and
On Sun, Mar 18, 2007 at 08:25:56AM -0400, Andrew Dunstan wrote:
It does also seem from my test results that transcoding to MB charsets
(or at least to utf-8) is surprisingly expensive, and that this would be
a good place to look at optimisation possibilities. The validity tests
can also be
I wrote:
The escape processing is actually done in the lexer in the case of
literals. We have to allow for bytea literals there too, regardless of
encoding. The lexer naturally has no notion of the intended
destination of the literal, So we need to defer the validity check to
the *in
Andrew Dunstan [EMAIL PROTECTED] writes:
Below is a list of the input routines in the adt directory, courtesy of grep.
Grep isn't a good way to get these, your list missed a bunch.
postgres=# select distinct prosrc from pg_proc where oid in (select typinput
from pg_type);
prosrc
Gregory Stark wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Below is a list of the input routines in the adt directory, courtesy of grep.
Grep isn't a good way to get these, your list missed a bunch.
postgres=# select distinct prosrc from pg_proc where oid in (select typinput
Andrew Dunstan [EMAIL PROTECTED] writes:
Ok, good point. Now, which of those need to have a check for valid encoding?
The vast majority will barf on any non-ASCII character anyway ... only
the ones that don't will need a check.
regards, tom lane
Jeff Davis [EMAIL PROTECTED] wrote:
Some people think it's a bug, some people don't. It is technically
documented behavior, but I don't think the documentation is clear
enough. I think it is a bug that should be fixed, and here's another
message in the thread that expresses my opinion:
Jeff Davis wrote:
On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had
Andrew Dunstan [EMAIL PROTECTED] writes:
Last year Jeff suggested adding something like:
pg_verifymbstr(string,strlen(string),0);
to each relevant input routine. Would that be an acceptable solution?
The problem with that is that it duplicates effort: in many cases
(especially COPY IN) the
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Last year Jeff suggested adding something like:
pg_verifymbstr(string,strlen(string),0);
to each relevant input routine. Would that be an acceptable solution?
The problem with that is that it duplicates effort: in many
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
The problem with that is that it duplicates effort: in many cases
(especially COPY IN) the data's already been validated.
One thought I had was that it might make sense to have a flag that would
inhibit the check, that could be set
I wrote:
Actually, I have to take back that objection: on closer look, COPY
validates the data only once and does so before applying its own
backslash-escaping rules. So there is a risk in that path too.
It's still pretty annoying to be validating the data twice in the
common case where no
Tom Lane wrote:
I wrote:
Actually, I have to take back that objection: on closer look, COPY
validates the data only once and does so before applying its own
backslash-escaping rules. So there is a risk in that path too.
It's still pretty annoying to be validating the data twice
Andrew Dunstan [EMAIL PROTECTED] writes:
Here are some timing tests in 1m rows of random utf8 encoded 100 char
data. It doesn't look to me like the saving you're suggesting is worth
the trouble.
Hmm ... not sure I believe your numbers. Using a test file of 1m lines
of 100 random latin1
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Here are some timing tests in 1m rows of random utf8 encoded 100 char
data. It doesn't look to me like the saving you're suggesting is worth
the trouble.
Hmm ... not sure I believe your numbers. Using a test file of 1m lines
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Here are some timing tests in 1m rows of random utf8 encoded 100 char
data. It doesn't look to me like the saving you're suggesting is worth
the trouble.
Hmm ... not sure I believe your numbers. Using a test file of 1m lines
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
Andrew Dunstan wrote:
This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.
The question in my mind is how we help people to recover from the fact
Mario Weilguni wrote:
Is there anything I can do to help with this problem? Maybe
implementing a new
GUC variable that turns off accepting wrong encoded sequences (so DBAs
still
can turn it on if they really depend on it)?
I think that this should be done away with unconditionally.
Or does
Albe Laurenz wrote:
Mario Weilguni wrote:
Is there anything I can do to help with this problem? Maybe
implementing a new
GUC variable that turns off accepting wrong encoded sequences (so DBAs
still
can turn it on if they really depend on it)?
I think that this
On Tue, 2007-03-13 at 12:00 +0100, Mario Weilguni wrote:
Hi,
I've a problem with a database, I can dump the database to a file, but
restoration fails, happens with 8.1.4.
I reported the same problem a while back:
http://archives.postgresql.org/pgsql-bugs/2006-10/msg00246.php
Some people
On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?
What issues?
Andrew Dunstan wrote:
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for
validity
This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.
The question in my mind is how we
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?
What issues? I've upgraded several 8.0 database to 8.1. without having to
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
Is there anything in the SQL spec that asks for such a behaviour? I guess
not.
I think that the octal escapes are a holdover from the single-byte days where
they were simply a way to enter characters that are difficult to find on a
Hi,
I've a problem with a database, I can dump the database to a file, but
restoration fails, happens with 8.1.4.
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); == Is akzepted, even if not UTF8.
pg_dump
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); == Is akzepted, even if not UTF8.
This is working as expected, see the remark in
Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); == Is akzepted, even if not UTF8.
This is working as expected, see the
Mario Weilguni wrote:
Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); == Is akzepted, even if not UTF8.
This is
Am Dienstag, 13. März 2007 15:12 schrieb Andrew Dunstan:
The sentence quoted from the docs is perhaps less than a model of
clarity. I would take it to mean that no client-encoding -
server-encoding translation will take place. Does it really mean that
the server will happily accept any escaped
Mario Weilguni wrote:
Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); == Is akzepted, even if not UTF8.
This is working as expected, see the remark in
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for validity
This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.
The question in my mind is how we help people to recover
Andrew Dunstan wrote:
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for
validity
This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.
The question in my mind is how
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Andrew Dunstan wrote:
Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for
validity
This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is
86 matches
Mail list logo