Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 Thread Juerd Waalboer
Chris Hall skribis 2008-03-12 13:20 (+):
  OK.  In the meantime IMHO chr(n) should be handling utf8 and has no
  business worrying about things which UTF-8 or UCS think aren't
  characters.
 It should do Unicode, not any specific byte encoding, like UTF-?8.
 IMHO chr(n) should do characters, which may be interpreted as per
 Unicode, but may not.
 When I said utf8 I was following the (sloppy) convention that utf8 means
 how Perl handles characters in strings...

I'm working hard to break this convention. I've changed a lot of Perl
documentation, and the result was released with Perl 5.10.

If in any place in Perl's official documentation, it still reads UTF-8
or UTF8 for *characters in text strings*, it's wrong. Let me know and I
will fix it :)

   b. in a Perl string, characters are held in a UTF-8 like form.

I'd say *inside* a Perl string. This is the C implementation, but a Perl
programmer should not have to know the specific *internal* encoding of a
Perl string.

Likewise, in Perl you don't have to know whether your number is
internally encoded as a long integer or a double.

  Where UTF-8 (upper case, with hyphen) means the RFC 3629 
  Unicode Consortium defined byte-wise encoding.

That's the theory, but it's so often not entirely following spec.

  This form is referred to as utf8 (lower case, no hyphen).

Yes, but note that encoding names in Perl are case insensitive. I tend
to call it UTF8 sometimes.

  There is really no need to discuss this, except in the context of
  messing around in guts of Perl.

Exactly.

  String literals are represented by UCS code points.  Which
  reinforces the feeling that characters in Perl are Unicode.

Yes!

  'C' uses 'wide' to refer to characters that may have values
   255.  IMHO it's a shame that Perl did not follow this.

It does in some places, most notably warnings about wide characters.

   d. when exchanging character data with other systems one needs to
  deal with character set and encoding issues.

Not just other systems. All I/O is done in bytes, even with yourself,
for example if you forked.

 Isolated surrogate code units have no interpretation on
  their own.
 (...)
Clearly these are illegal in UTF-8.

They have no interpretation, but this also doesn't say it's illegal.

Compare it with the undefined behavior of multiple ++ in a single
expression. There's no specification of what should happen, but it's not
illegal to do it.

 Applications are free to use any of these noncharacter code
  points internally but should never attempt to exchange
  them.

I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.

 I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
 friends) in the same way as U+ (and friends).

My gut says it's out of ignorance of the rules, and certainly not an
intentional deviation.

 The result is Unicode.
 IMHO the result of chr(n) should just be a character.

We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.

 OK, sure.  I was using utf8 to mean any character value you like, and
 UTF-8 to imply a value which is recognised in UCS -- rather than the
 encoding.

Please use utf8 only for naming the byte encoding that allows any
character value you like, not for the ordinal values themselves.

 FWIW I note that printf %vX is suggested as a means to render IPv6
 addresses.  This implies the use of a string containing eight characters
 0..0x as the packed form of IPv6.  Building one of those using
 chr(n) will generate spurious warnings about 0xFFFE and 0x !

Interesting point.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 Thread Chris Hall

On Wed, 12 Mar 2008 Juerd Waalboer wrote

Chris Hall skribis 2008-03-12 13:20 (+):



 String literals are represented by UCS code points.  Which
 reinforces the feeling that characters in Perl are Unicode.



Yes!


OK.  For the avoidance of doubt:

  a. are you saying that characters in Perl are Unicode ?

  b. or are you agreeing that characters in Perl take values
 0..0x7FFF_ (or beyond), which are generally interpreted as
 UCS, where required and possible ?

If (a) then characters with ordinals beyond 0x10_ should throw 
warnings (at least) since they clearly are not Unicode !


[in the context of U+D800..U+DFFF]

Isolated surrogate code units have no interpretation on
 their own.
(...)
   Clearly these are illegal in UTF-8.



They have no interpretation, but this also doesn't say it's illegal.


The Unicode Standard defines the set of 'Unicode scalar values' which 
consists of U+..U+D7FF and U+E000..U+10_.  All Unicode 
encodings, including UTF-8, encode only the 'Unicode scalar values'.


The code points U+D800..U+DFFF exist, but do not contain any character 
assignments.  Given that no Unicode encoding exists that allows these 
code points, it's unclear how one would ever end up with one of these 
things on its hands !


[in the context of U+FFFE, U+ etc.]

Applications are free to use any of these noncharacter code
 points internally but should never attempt to exchange
 them.



I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.


Well UTF-8 is jumping all over U+ (at least).  The warnings thrown 
by chr() and \x{h...h} suggest that Perl feels that exchanging these 
values ain't kosher.



I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+ (and friends).



My gut says it's out of ignorance of the rules, and certainly not an
intentional deviation.


Well... I'm running some more tests on UTF-8 to see what it thinks is 
illegal.


.

The result is Unicode.
IMHO the result of chr(n) should just be a character.



We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.


OK.  This is the hair which I am splitting.

IMHO the things in strings and the things that chr() and ord() return or 
process should be plain characters (ordinal U_INT) -- so that these are 
general purpose.  Only when it's necessary to attach meaning to the 
characters in a string, should Perl treat them as Unicode code points -- 
I accept that this is most of the time (but not *all* the time).



FWIW I note that printf %vX is suggested as a means to render IPv6
addresses.  This implies the use of a string containing eight characters
0..0x as the packed form of IPv6.  Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0x !



Interesting point.


What's more, the Unicode standard suggests various *internal* uses for 
U+FFFE and U+ (and friends), including, but not limited to, 
terminators and separators.  This will also generate spurious warnings 
from chr() or \x{...} !


Chris
--
Chris Hall   highwayman.com


signature.asc
Description: PGP signature


UTF-8 (strict) appears borken

2008-03-12 Thread Chris Hall

1. 'Ill-formed' UTF-8
=

The Unicode Standard specifies that any UTF-8 sequence that does not
correspond to this table is 'ill-formed':

   Code Points| 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
   ---+--+--+--+--+
 U+..U+007F   |  00..7F  |--|--|--|
 U+0080..U+07FF   |  C2..DF  |  80..BF  |--|--|
 U+0800..U+0FFF   |E0|  A0..BF  |  80..BF  |--|
 U+1000..U+CFFF   |  E1..EC  |  80..BF  |  80..BF  |--|
 U+D000..U+D7FF   |ED|  80..9F  |  80..BF  |--|
 U+E000..U+   |  EE..EF  |  80..BF  |  80..BF  |--|
U+1..U+3  |F0|  90..BF  |  80..BF  |  80..BF  |
U+4..U+F  |  F1..F3  |  80..BF  |  80..BF  |  80..BF  |
   U+10..U+10 |F4|  80..8F  |  80..BF  |  80..BF  |

Note in particular that:

  - anything beyond U+10 is ill-formed.

  - anything U+D800..U+DFFF is ill-formed.

  - only one encoding for each Code Point is well-formed.

We'd expect UTF-8 decode to spot ill-formed sequences.  Though some
special handling of incomplete sequences at the end of a buffer would be
handy.

We'd expect UTF-8 encode to only generate well-formed sequences.

2. Extended Sequences
=

Unicode and ISO/IEC 10646:2003 define meanings for UTF-8 compatible
sequences up to 6 bytes, which allows for characters up to 0x7FFF_.

The Unicode reference code for reading UTF-8 recognises these extended
sequences as being single entities (though ill-formed).

Perl has its own further 7 and 13 byte forms, allowing for characters up
to 0xF__ and 2^72-1, respectively.  These are beyond UTF-8.

3. Non-Characters
=

The only other cause for concern are non-characters.  These are:

  * U+FFFE and U+ and the last two code points in every other
Unicode plane.

Unicode code space is divided into 17 'planes' of 65,536 characters,
each.  So characters U+01_FFFE, U+01_, U+02_FFFE, U+02_, ...
U+10_FFFE and U+10_ are all non-characters.

  * U+FDD0..U+FDEF

Now, Unicode 5.0.0 says:

  Applications are free to use any of these noncharacter code points
   internally but should never attempt to exchange them. If a
   noncharacter is received in open interchange, an application is not
   required to interpret it in any way. It is good practice, however,
   to recognize it as a noncharacter and to take appropriate action,
   such as removing it from the text.

  Noncharacter code points are reserved for internal use, such as for
   sentinel values. They should never be interchanged. They do, however,
   have well-formed representations in Unicode encoding forms and
   survive conversions between encoding forms. This allows sentinel
   values to be preserved internally across Unicode encoding forms, even
   though they are not designed to be used in open interchange.

So... this is not so clear-cut.  For open interchange UTF-8 should
disallow the non-characters.  However, for local storage of Unicode
stuff, non-characters should be allowed.

4. What 'UTF-8' Does


Ill-formed sequences -- fine (mostly):

  * UTF-8 decode treats these as errors, and will stop or use fallback
decoding as required.

The default fallback is:

  - errors for sequence = 0x7FFF_ -- replaced by U+FFFD

*** information is being lost, here :-(

  - anything else: each byte which is not recognised as being part
of a complete 2..6 byte sequence is replaced by U+FFFD

*** so one cannot distinguish ill-formed sequences from
out of range characters.

The PERLQQ, HTMLCREF and XMLCREF fallbacks are:

  - errors for sequence = 0x7FFF_ -- replaced by the
respective escape sequence for the character value.

This ought to work if the data is HTML or XML, where new escape
sequences fit right in if HTMLCREF or XMLCREF is used.

*** PERLQQ, however, may fail if '\' appears in the input and
the sender has not escaped it !

Perhaps PERLQQ should escape '\' that appear in the input ?

*** In all cases, however, all that's been achieved is that
non-UTF-8 characters have been transliterated.  It's still
a puzzle what may be done with these characters !

  - anything else: each byte which is not recognised as being part
of a complete up to 6 byte sequence is replaced by the
respective escape sequence for the byte value.

*** this is impossible to distinguish from escaped values which
could exist in the input !

  * UTF-8 encode will not generate ill-formed sequences and treats out
of ranges character values as errors.  Errors will stop encoding or
cause the fallback encoding to be used.

The default fallback is:

  - errored characters = 0x7FFF_ -- replaced by