utf8::valid and \x14_000 - \x1F_0000

2008-03-11 Thread Chris Hall

It appears that utf8::valid() disagrees with Encode::encode('utf8', ...)
do not agree for characters 0x14_ - 0x1F_.

I suggest utf8::valid() is broken.

The following:

  use strict ;

  use Encode qw(FB_QUIET LEAVE_SRC) ;

  printf Perl v%vd  Encode %s\n, $^V, $Encode::VERSION ;

  my $c = 0x ;
  while ($c  0x8000_) {
my $s = chr($c) ;

my $v = utf8::valid($s) ? 1 : 0 ;
my $o = Encode::encode('utf8', $s, FB_QUIET() | LEAVE_SRC()) ;

my $r = $o ? 1 : 0 ;

if ($v != $r) {
  printf 0x%04X_%04X: utf8::valid=%d but Encode::encode=%d  ,
($c  16), $c  0x, $v, $r ;
  Encode::_utf8_off($s) ;
  print map { sprintf '\x%02X', ord($_) } split(//, $s) ;
  print \n ;
} ;

if ($c  0x) { $c += 1 ; } else { $c += 0x ; } ;
  } ;

Produces:

  Perl v5.8.8  Encode 2.23
  0x0014_: utf8::valid=0 but Encode::encode=1  \xF5\x80\x80\x80
  0x0014_: utf8::valid=0 but Encode::encode=1  \xF5\x8F\xBF\xBF
  0x0015_: utf8::valid=0 but Encode::encode=1  \xF5\x90\x80\x80
  0x0015_: utf8::valid=0 but Encode::encode=1  \xF5\x9F\xBF\xBF
  0x0016_: utf8::valid=0 but Encode::encode=1  \xF5\xA0\x80\x80
  0x0016_: utf8::valid=0 but Encode::encode=1  \xF5\xAF\xBF\xBF
  0x0017_: utf8::valid=0 but Encode::encode=1  \xF5\xB0\x80\x80
  0x0017_: utf8::valid=0 but Encode::encode=1  \xF5\xBF\xBF\xBF
  0x0018_: utf8::valid=0 but Encode::encode=1  \xF6\x80\x80\x80
  0x0018_: utf8::valid=0 but Encode::encode=1  \xF6\x8F\xBF\xBF
  0x0019_: utf8::valid=0 but Encode::encode=1  \xF6\x90\x80\x80
  0x0019_: utf8::valid=0 but Encode::encode=1  \xF6\x9F\xBF\xBF
  0x001A_: utf8::valid=0 but Encode::encode=1  \xF6\xA0\x80\x80
  0x001A_: utf8::valid=0 but Encode::encode=1  \xF6\xAF\xBF\xBF
  0x001B_: utf8::valid=0 but Encode::encode=1  \xF6\xB0\x80\x80
  0x001B_: utf8::valid=0 but Encode::encode=1  \xF6\xBF\xBF\xBF
  0x001C_: utf8::valid=0 but Encode::encode=1  \xF7\x80\x80\x80
  0x001C_: utf8::valid=0 but Encode::encode=1  \xF7\x8F\xBF\xBF
  0x001D_: utf8::valid=0 but Encode::encode=1  \xF7\x90\x80\x80
  0x001D_: utf8::valid=0 but Encode::encode=1  \xF7\x9F\xBF\xBF
  0x001E_: utf8::valid=0 but Encode::encode=1  \xF7\xA0\x80\x80
  0x001E_: utf8::valid=0 but Encode::encode=1  \xF7\xAF\xBF\xBF
  0x001F_: utf8::valid=0 but Encode::encode=1  \xF7\xB0\x80\x80
  0x001F_: utf8::valid=0 but Encode::encode=1  \xF7\xBF\xBF\xBF

And the same for: Perl v5.10.0  Encode 2.23
-- 
Chris Hall   highwayman.com


signature.asc
Description: PGP signature


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-11 Thread Chris Hall

On Tue, 11 Mar 2008 you wrote

Chris Hall skribis 2008-03-11 18:48 (+):

I'm comfortable with the notion that perl characters are unsigned
integers that overlap UCS, and happen to be held internally as a
superset of UTF-8.
I wonder if perl is completely comfortable.



It isn't. There are some very unfortunate features.



chr(n) throws various runtime warnings where 'n' isn't kosher UCS, and
\x{h...h} throws the same ones at compile time.
(...)I'm not sure I see the point of picking on a few values to warn
about.



I don't see the point, but Perl's warnings are arbitrary in several
ways. Abigail has a lightning talk about the interpreted as function
warning, that illustrates this.


OK.  In the meantime IMHO chr(n) should be handling utf8 and has no 
business worrying about things which UTF-8 or UCS think aren't 
characters.


Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
(UTF-8) are happy with.  Unicode defines 0xFFFE and 0x as 
non-characters, not just 0x (which Encode::en/decode do deem 
invalid).



In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
neither.



It's supposed to be neither on the outside. Internally, it's utf8.


One can turn off the warnings and then chr(n) will happily take any +ve 
integer and give you the equivalent character -- so the result is utf8, 
but the warnings are some (very) small subset of checking for UTF-8 :-(


I wonder what happens for n = 2^64.  The encoding runs out at 2^72 !


 If chr(-1) doesn't exist, then undef looks like a reasonable
 return value -- returning \x{FFFD} makes chr(-1)
 indistinguishable from chr(0xFFFD) -- where the first is
 nonsense and the second is entirely proper.



0xFFFD is the Unicode equivalent of undef. I think it makse sense in
this case.


Well...

Unicode says: REPLACEMENT CHARACTER: used to represent an incoming 
character whose value is unknown or unrepresentable in Unicode.


...so it has plenty to do without being used to represent a value which 
is completely beyond the range for characters, and for which perl has a 
perfectly good convention already.


...besides, if I want to see if chr(n) has worked I have to check that 
(a) the result is not \xFFFD and (b) that n is not 0xFFFD.


So we'll have to differ on this :-)

Chris
--
Chris Hall   highwayman.com+44 7970 277 383


signature.asc
Description: PGP signature


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-11 Thread Juerd Waalboer
Chris Hall skribis 2008-03-11 21:09 (+):
 OK.  In the meantime IMHO chr(n) should be handling utf8 and has no 
 business worrying about things which UTF-8 or UCS think aren't 
 characters.

It should do Unicode, not any specific byte encoding, like UTF-?8.

Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.

 Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
 (UTF-8) are happy with.  Unicode defines 0xFFFE and 0x as 
 non-characters, not just 0x (which Encode::en/decode do deem 
 invalid).

Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).

 In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
 neither.
 It's supposed to be neither on the outside. Internally, it's utf8.
 One can turn off the warnings and then chr(n) will happily take any +ve 
 integer and give you the equivalent character -- so the result is utf8, 

The result is Unicode. The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.

Unicode: U+20AC(one character: €)
UTF-8:   E2 82 AC  (three bytes)

I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.

 [replacement character]
 So we'll have to differ on this :-)

Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]