Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-13 Thread Chris Hall

On Wed, 12 Mar 2008 Juerd Waalboer wrote

Chris Hall skribis 2008-03-12 20:49 (+):

  a. are you saying that characters in Perl are Unicode ?



Yes. They are called Unicode, at least. This has my preference for
explanation and documentation.



  b. or are you agreeing that characters in Perl take values
 0..0x7FFF_ (or beyond), which are generally interpreted as
 UCS, where required and possible ?



This too. This is the more technically accurate explanation, and has my
preference for implementation.


'This too' ?  Goodness, superimposition !  Perl and quantum mechanics ? 
Suddenly it all becomes clear.  Or at least as clear as the uncertainty 
principle will allow !-)


FWIW, I have tried some of the HTTP, HTML and XML modules.  The warnings 
that pop out every now and then about Unicode or UTF-8 or whatever are 
less than useful and more than irritating !



If (a) then characters with ordinals beyond 0x10_ should throw
warnings (at least) since they clearly are not Unicode !



Perl just has a somewhat broad definition of unicode, that is not
the same as the official unicode character set.


BTW, in 2.14 Conforming to the Unicode Standard I found this gem:

  Unacceptable Behavior

  It is unacceptable for a conforming implementation:

   - To use unassigned codes.

   • U+2073 is unassigned and not usable for ‘3’ (superscript 3) or
 any other character.

This appears to say that unassigned codes should not be transmitted out, 
same like non-characters !  Which looks like hard work.  (On the other 
hand, applications are supposed to cope with future defined code 
points...)


Should 'UTF-8' be strict about unassigned codes as well ?  What should 
chr() and \x{...} etc. do ?


This reinforces my view that chr(n) is (a) wrong to whinge about 
surrogates and non-characters, and (b) wrong to return a character for n 
outside 0x..7FFF_.  IMO:


  - chr() shouldn't worry about strict UCS ...

  - ... and doesn't, in an case, do a complete job
[it does spot all non-characters and surrogates, but ignores
 unassigned codes.]

  - ... however, non-characters are perfectly legal UCS, at least for
internal use.  One can argue for jumping all over these when
outputting (strict) UTF-8 for external exchange.

  - ... and 0x11_FFFE is not defined by UCS to be a non-character,
it's not defined in UCS at all, any more than any other character
code  U+10_ !

  - chr(n) doesn't whinge about characters  U+10_ !  (Except for
the non-characters it has invented !)

  - the answer to chr(-1) is 'not a character at all' -- it isn't 'the
character that stands in place of some unknown character'

  - the utility of characters  0x7FFF_ is not worth (a) the kludge
required to extend utf8, or (b) the interoperability issues.

Even encode/decode 'utf8' take a dim view of chars  0x7FFF_.

I note that utf8::valid() rejects characters  0x7FFF_ !

  - chr(n) accepts characters  0x7FFF_, even though the result
is not valid per utf8::valid() !!

  - chr(n) warns about p + 0xFFFE and p + 0x for every value of 'p',
even those which are beyond the Unicode range !


It has its own utf8, it can have its own unicode too :)


And there was I thinking that things were already sufficiently confused 
:-}


The 'utf8' decode does the Right Thing -- it decodes well-formed UTF-8 
up to 0x7FFF_ and handles errors and incomplete sequences and 
doesn't concern itself with the minutiae of UCS (surrogates, 
non-characters and unassigned codes).


This is nicely consistent with utf8::valid().

[The only thing I would argue about is the separate treatment of each 
byte of an invalid sequence -- I'd be tempted to treat 0x00..0x7F and 
0xC0..0xFF as terminators of an invalid sequence and 0x80..0xBF as 
members of an invalid sequence.]


If 'unicode' were to follow that model, then chr() and friends could 
stop throwing (spurious) warnings around the place.


Sadly, 'utf8' encode is doesn't care, and outputs whatever is in the 
string -- including redundant sequences, invalid sequences, incomplete 
sequences and Perl's extended sequences for  0x7FFF_.  That is, it 
will happily output something that utf8::valid would reject.  Note that 
this encoding is outputting something that 'utf8' decode won't accept.


If you really want what 'utf8' encode currently does you can force 
characters to octets (wax off) and output.  The reverse is to input the 
octets and force to characters (wax on).


Summary of Observations
---

  * chr(n) and friends are broken:

- they winge about things that are none of their business, which is
  not consistent with the notion of (lax) 'unicode'.

- the wingeing about not-(strict)-Unicode is, moreover, incomplete
  (unassigned codes and codes beyond the UCS range are allowed !)

- non-characters are perfectly legal -- just not suitable for
  external exchange.

- 

Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 Thread Juerd Waalboer
Chris Hall skribis 2008-03-12 13:20 (+):
  OK.  In the meantime IMHO chr(n) should be handling utf8 and has no
  business worrying about things which UTF-8 or UCS think aren't
  characters.
 It should do Unicode, not any specific byte encoding, like UTF-?8.
 IMHO chr(n) should do characters, which may be interpreted as per
 Unicode, but may not.
 When I said utf8 I was following the (sloppy) convention that utf8 means
 how Perl handles characters in strings...

I'm working hard to break this convention. I've changed a lot of Perl
documentation, and the result was released with Perl 5.10.

If in any place in Perl's official documentation, it still reads UTF-8
or UTF8 for *characters in text strings*, it's wrong. Let me know and I
will fix it :)

   b. in a Perl string, characters are held in a UTF-8 like form.

I'd say *inside* a Perl string. This is the C implementation, but a Perl
programmer should not have to know the specific *internal* encoding of a
Perl string.

Likewise, in Perl you don't have to know whether your number is
internally encoded as a long integer or a double.

  Where UTF-8 (upper case, with hyphen) means the RFC 3629 
  Unicode Consortium defined byte-wise encoding.

That's the theory, but it's so often not entirely following spec.

  This form is referred to as utf8 (lower case, no hyphen).

Yes, but note that encoding names in Perl are case insensitive. I tend
to call it UTF8 sometimes.

  There is really no need to discuss this, except in the context of
  messing around in guts of Perl.

Exactly.

  String literals are represented by UCS code points.  Which
  reinforces the feeling that characters in Perl are Unicode.

Yes!

  'C' uses 'wide' to refer to characters that may have values
   255.  IMHO it's a shame that Perl did not follow this.

It does in some places, most notably warnings about wide characters.

   d. when exchanging character data with other systems one needs to
  deal with character set and encoding issues.

Not just other systems. All I/O is done in bytes, even with yourself,
for example if you forked.

 Isolated surrogate code units have no interpretation on
  their own.
 (...)
Clearly these are illegal in UTF-8.

They have no interpretation, but this also doesn't say it's illegal.

Compare it with the undefined behavior of multiple ++ in a single
expression. There's no specification of what should happen, but it's not
illegal to do it.

 Applications are free to use any of these noncharacter code
  points internally but should never attempt to exchange
  them.

I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.

 I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
 friends) in the same way as U+ (and friends).

My gut says it's out of ignorance of the rules, and certainly not an
intentional deviation.

 The result is Unicode.
 IMHO the result of chr(n) should just be a character.

We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.

 OK, sure.  I was using utf8 to mean any character value you like, and
 UTF-8 to imply a value which is recognised in UCS -- rather than the
 encoding.

Please use utf8 only for naming the byte encoding that allows any
character value you like, not for the ordinal values themselves.

 FWIW I note that printf %vX is suggested as a means to render IPv6
 addresses.  This implies the use of a string containing eight characters
 0..0x as the packed form of IPv6.  Building one of those using
 chr(n) will generate spurious warnings about 0xFFFE and 0x !

Interesting point.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 Thread Chris Hall

On Wed, 12 Mar 2008 Juerd Waalboer wrote

Chris Hall skribis 2008-03-12 13:20 (+):



 String literals are represented by UCS code points.  Which
 reinforces the feeling that characters in Perl are Unicode.



Yes!


OK.  For the avoidance of doubt:

  a. are you saying that characters in Perl are Unicode ?

  b. or are you agreeing that characters in Perl take values
 0..0x7FFF_ (or beyond), which are generally interpreted as
 UCS, where required and possible ?

If (a) then characters with ordinals beyond 0x10_ should throw 
warnings (at least) since they clearly are not Unicode !


[in the context of U+D800..U+DFFF]

Isolated surrogate code units have no interpretation on
 their own.
(...)
   Clearly these are illegal in UTF-8.



They have no interpretation, but this also doesn't say it's illegal.


The Unicode Standard defines the set of 'Unicode scalar values' which 
consists of U+..U+D7FF and U+E000..U+10_.  All Unicode 
encodings, including UTF-8, encode only the 'Unicode scalar values'.


The code points U+D800..U+DFFF exist, but do not contain any character 
assignments.  Given that no Unicode encoding exists that allows these 
code points, it's unclear how one would ever end up with one of these 
things on its hands !


[in the context of U+FFFE, U+ etc.]

Applications are free to use any of these noncharacter code
 points internally but should never attempt to exchange
 them.



I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.


Well UTF-8 is jumping all over U+ (at least).  The warnings thrown 
by chr() and \x{h...h} suggest that Perl feels that exchanging these 
values ain't kosher.



I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+ (and friends).



My gut says it's out of ignorance of the rules, and certainly not an
intentional deviation.


Well... I'm running some more tests on UTF-8 to see what it thinks is 
illegal.


.

The result is Unicode.
IMHO the result of chr(n) should just be a character.



We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.


OK.  This is the hair which I am splitting.

IMHO the things in strings and the things that chr() and ord() return or 
process should be plain characters (ordinal U_INT) -- so that these are 
general purpose.  Only when it's necessary to attach meaning to the 
characters in a string, should Perl treat them as Unicode code points -- 
I accept that this is most of the time (but not *all* the time).



FWIW I note that printf %vX is suggested as a means to render IPv6
addresses.  This implies the use of a string containing eight characters
0..0x as the packed form of IPv6.  Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0x !



Interesting point.


What's more, the Unicode standard suggests various *internal* uses for 
U+FFFE and U+ (and friends), including, but not limited to, 
terminators and separators.  This will also generate spurious warnings 
from chr() or \x{...} !


Chris
--
Chris Hall   highwayman.com


signature.asc
Description: PGP signature


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-11 Thread Chris Hall

On Tue, 11 Mar 2008 you wrote

Chris Hall skribis 2008-03-11 18:48 (+):

I'm comfortable with the notion that perl characters are unsigned
integers that overlap UCS, and happen to be held internally as a
superset of UTF-8.
I wonder if perl is completely comfortable.



It isn't. There are some very unfortunate features.



chr(n) throws various runtime warnings where 'n' isn't kosher UCS, and
\x{h...h} throws the same ones at compile time.
(...)I'm not sure I see the point of picking on a few values to warn
about.



I don't see the point, but Perl's warnings are arbitrary in several
ways. Abigail has a lightning talk about the interpreted as function
warning, that illustrates this.


OK.  In the meantime IMHO chr(n) should be handling utf8 and has no 
business worrying about things which UTF-8 or UCS think aren't 
characters.


Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
(UTF-8) are happy with.  Unicode defines 0xFFFE and 0x as 
non-characters, not just 0x (which Encode::en/decode do deem 
invalid).



In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
neither.



It's supposed to be neither on the outside. Internally, it's utf8.


One can turn off the warnings and then chr(n) will happily take any +ve 
integer and give you the equivalent character -- so the result is utf8, 
but the warnings are some (very) small subset of checking for UTF-8 :-(


I wonder what happens for n = 2^64.  The encoding runs out at 2^72 !


 If chr(-1) doesn't exist, then undef looks like a reasonable
 return value -- returning \x{FFFD} makes chr(-1)
 indistinguishable from chr(0xFFFD) -- where the first is
 nonsense and the second is entirely proper.



0xFFFD is the Unicode equivalent of undef. I think it makse sense in
this case.


Well...

Unicode says: REPLACEMENT CHARACTER: used to represent an incoming 
character whose value is unknown or unrepresentable in Unicode.


...so it has plenty to do without being used to represent a value which 
is completely beyond the range for characters, and for which perl has a 
perfectly good convention already.


...besides, if I want to see if chr(n) has worked I have to check that 
(a) the result is not \xFFFD and (b) that n is not 0xFFFD.


So we'll have to differ on this :-)

Chris
--
Chris Hall   highwayman.com+44 7970 277 383


signature.asc
Description: PGP signature


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-11 Thread Juerd Waalboer
Chris Hall skribis 2008-03-11 21:09 (+):
 OK.  In the meantime IMHO chr(n) should be handling utf8 and has no 
 business worrying about things which UTF-8 or UCS think aren't 
 characters.

It should do Unicode, not any specific byte encoding, like UTF-?8.

Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.

 Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
 (UTF-8) are happy with.  Unicode defines 0xFFFE and 0x as 
 non-characters, not just 0x (which Encode::en/decode do deem 
 invalid).

Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).

 In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
 neither.
 It's supposed to be neither on the outside. Internally, it's utf8.
 One can turn off the warnings and then chr(n) will happily take any +ve 
 integer and give you the equivalent character -- so the result is utf8, 

The result is Unicode. The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.

Unicode: U+20AC(one character: €)
UTF-8:   E2 82 AC  (three bytes)

I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.

 [replacement character]
 So we'll have to differ on this :-)

Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]