Re: Character (or byte?) escapes under utf8 pragma

2010-03-11 Thread Juerd Waalboer
Michael Ludwig skribis 2010-03-10 10:34 (+0100):
 Okay. Let me try to see if I have understood correctly. Without the utf8
 pragma in scope, so\xa0ein\xa0Käse with a-Umlaut stored as a sequence
 of two bytes in my source code will be stored internally as a sequence
 of 12 integers. With the utf8 pragma in scope, only 11 integers.

so\xa0ein\xa0Käse must be stored as either:

l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off)

or:

u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on)

Both strings should be semantically equal, and have 11 characters, each
of which has an integer ordinal value.

What happens is the following:

73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on)
  l1  l1 u8

This is wrong. It is a bug.
-- 
Met vriendelijke groet, // Kind regards, // Korajn salutojn,

Juerd Waalboer  ju...@tnx.nl
TNX


Re: Character (or byte?) escapes under utf8 pragma

2010-03-09 Thread Juerd Waalboer
Michael Ludwig skribis 2010-03-08 15:55 (+0100):
  Perl does not distinguish between bytes and characters. (...) You
  cannot tell what kind of data they contain just by looking at them
  and the UTF8 flag doesn’t tell you either.
 Okay. But unless I'm completely misled, you can tell whether a
 string is supposed to contain characters (- Encode::decode) or
 bytes (- Encode::encode)

The result of decode is a character string.

The result of encode is a byte string.

However, apart from looking at the source code and deducing the
intentions of the programmer, there is no way to tell whether a given
string is meant as a character or byte string, simply because there is
no technical representation of this intent in the string or its
metadata.

Note that characters are the general case: a string is made of
characters. When every character value fits in a single byte, the string
can be used as a byte string.

  This is definitely a bug.
 Good. It looked like one to me. Thanks for logging it with the
 Perl maintainers.

This bug forces us to look at the internal encoding and flags to come to
the conclusion that it is indeed a bug. Don't mistake this as a sign
that looking at the internal encoding or flags should ever happen in
actual code. Even if you work around the bug, make sure that you don't
make anything conditional on the current formatting of the string.

Instead, coerce it to whatever you need by using utf8::downgrade or
utf8::upgrade. In your specific case, concatenation of two separate
parts is probably the most sane thing to do.

  Am I mistaken in my expectation that while \xa0 should be
  a byte, \x{a0} and \x{00a0} should be characters?

Yes. These three escapes are supposed to be exactly the same. They
create a U+00A0 character, which happens to be perfectly usable as the
A0 byte when used as such, in a string that doesn't contain any
character greater than U+00FF.

  [perlre:]
  Unicode characters in the range of 128-255 use two hexadecimal
  digits with braces: \x{ab}. Note that this is different than
  \xab, which is just a hexadecimal byte with no Unicode
  significance.
 The documentation I referred to is outdated. Sorry for that.

Indeed this documentation is wrong. Current documentation, as of Perl
version 5.8.9 (december 2008) no longer has this paragraph.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  ##...@juerd.nl  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy sa...@convolution.nl



Re: encode from_to error

2009-09-18 Thread Juerd Waalboer
Yebba, Nick skribis 2009-09-17 13:14 (-0400):
 $sth10=$dbh_pweb-prepare($sql10);

It looks like some part of the code is missing; I can't see where
$dbh_pweb is created.

 Encode::from_to(
 $headingsubject, utf-8,ISO-2022-KR);

If it finds wide characters in the $headingsubject, apparently the
database has already decoded it. All you have to do then, is encode it.
from_to combines decoding and encoding in a single function.

$headingsubject = Encode::encode(ISO-2022-KR, $headingsubject);
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  ##...@juerd.nl  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy sa...@convolution.nl



Re: Unicode characters

2009-05-25 Thread Juerd Waalboer
Andreas J. Koenig skribis 2009-05-25  8:30 (+0200):
  On Sun, 24 May 2009 10:09:25 +0200, Juerd Waalboer 
  ju...@convolution.nl said:
Although it's safe on output, it's better to get used to using
:encoding(utf8) instead of :utf8. Using :utf8 on input can cause
stability and security issues.
 That's new to me. Do you have a link that backs this up?

http://www.perlmonks.org/?node_id=644786
http://www.perlfoundation.org/perl5/index.cgi?the_utf8_perlio_layer
http://perldoc.perl.org/perlunicode.html#Security-Implications-of-Unicode
 (perlunicode doesn't refer to :utf8 but does explain how malformed utf8
 can cause trouble.)


Perl change #32461 updated documentation to reflect the preference for
:encoding
http://perl5.git.perl.org/perl.git/commit/740d4bb23b722729f87a23733be98429529fd900
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  ##...@juerd.nl  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy sa...@convolution.nl
1;


Re: Unicode characters

2009-05-24 Thread Juerd Waalboer
Andreas J. Koenig skribis 2009-05-24  6:44 (+0200):
 binmode $_, :utf8 for *STDOUT, *TEMP_OUT;

Although it's safe on output, it's better to get used to using
:encoding(utf8) instead of :utf8. Using :utf8 on input can cause
stability and security issues.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  ##...@juerd.nl  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy sa...@convolution.nl
1;


Re: /\w/ match with 'use locale' misses letters in utf8 locale

2008-07-11 Thread Juerd Waalboer
Peter Volkov skribis 2008-07-11 10:10 (+0400):
 The problem is that in Linux (Gentoo and Debian I've tried) /\w/ does
 not match Russian letter while I use locale and LC_COLLATE is set to
 ru_RU.UTF-8.

\w should match Cyrillic letters even without use locale. You might be
running into an annoying bug which makes \w lose its unicode support
depending on the *internal* state of a value. To work around this bug,
read Unicode::Semantics on CPAN and use it or utf8::upgrade.

 Linux $ perl -e 'use locale; open(IN,  test-file); while(IN) { print if 
 /\w/; }'
 string with spaces (not only with [:alnum:])
 English;
 hello_привет

Despite the above there's a slightly more important issue here. You're
opening a text file but you don't specify the character encoding.
Likewise, you need to specify the encoding for output.

Assuming utf8 for both:

perl -le'
binmode STDOUT, :encoding(utf8);
open my $in,  :encoding(utf8), test-file;
while ($in) {
print match: [$1] if /(\w+)/;
}
'

Which on my system prints:

match: [слово]
match: [строка]
match: [string]
match: [English]
match: [hello_привет]

I'm not sufficiently familiar with use encoding to say anything about
it, but you shouldn't need it just for this.

 Do I understand correctly that we should always supply encoding of
 streams?

Yes.

 If yes, why in FreeBSD this works without supplying any encoding and is
 it possible (good idea) to do the same in Linux?

I have no idea.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
1;


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 Thread Juerd Waalboer
Chris Hall skribis 2008-03-12 13:20 (+):
  OK.  In the meantime IMHO chr(n) should be handling utf8 and has no
  business worrying about things which UTF-8 or UCS think aren't
  characters.
 It should do Unicode, not any specific byte encoding, like UTF-?8.
 IMHO chr(n) should do characters, which may be interpreted as per
 Unicode, but may not.
 When I said utf8 I was following the (sloppy) convention that utf8 means
 how Perl handles characters in strings...

I'm working hard to break this convention. I've changed a lot of Perl
documentation, and the result was released with Perl 5.10.

If in any place in Perl's official documentation, it still reads UTF-8
or UTF8 for *characters in text strings*, it's wrong. Let me know and I
will fix it :)

   b. in a Perl string, characters are held in a UTF-8 like form.

I'd say *inside* a Perl string. This is the C implementation, but a Perl
programmer should not have to know the specific *internal* encoding of a
Perl string.

Likewise, in Perl you don't have to know whether your number is
internally encoded as a long integer or a double.

  Where UTF-8 (upper case, with hyphen) means the RFC 3629 
  Unicode Consortium defined byte-wise encoding.

That's the theory, but it's so often not entirely following spec.

  This form is referred to as utf8 (lower case, no hyphen).

Yes, but note that encoding names in Perl are case insensitive. I tend
to call it UTF8 sometimes.

  There is really no need to discuss this, except in the context of
  messing around in guts of Perl.

Exactly.

  String literals are represented by UCS code points.  Which
  reinforces the feeling that characters in Perl are Unicode.

Yes!

  'C' uses 'wide' to refer to characters that may have values
   255.  IMHO it's a shame that Perl did not follow this.

It does in some places, most notably warnings about wide characters.

   d. when exchanging character data with other systems one needs to
  deal with character set and encoding issues.

Not just other systems. All I/O is done in bytes, even with yourself,
for example if you forked.

 Isolated surrogate code units have no interpretation on
  their own.
 (...)
Clearly these are illegal in UTF-8.

They have no interpretation, but this also doesn't say it's illegal.

Compare it with the undefined behavior of multiple ++ in a single
expression. There's no specification of what should happen, but it's not
illegal to do it.

 Applications are free to use any of these noncharacter code
  points internally but should never attempt to exchange
  them.

I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.

 I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
 friends) in the same way as U+ (and friends).

My gut says it's out of ignorance of the rules, and certainly not an
intentional deviation.

 The result is Unicode.
 IMHO the result of chr(n) should just be a character.

We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.

 OK, sure.  I was using utf8 to mean any character value you like, and
 UTF-8 to imply a value which is recognised in UCS -- rather than the
 encoding.

Please use utf8 only for naming the byte encoding that allows any
character value you like, not for the ordinal values themselves.

 FWIW I note that printf %vX is suggested as a means to render IPv6
 addresses.  This implies the use of a string containing eight characters
 0..0x as the packed form of IPv6.  Building one of those using
 chr(n) will generate spurious warnings about 0xFFFE and 0x !

Interesting point.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-11 Thread Juerd Waalboer
Chris Hall skribis 2008-03-11 21:09 (+):
 OK.  In the meantime IMHO chr(n) should be handling utf8 and has no 
 business worrying about things which UTF-8 or UCS think aren't 
 characters.

It should do Unicode, not any specific byte encoding, like UTF-?8.

Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.

 Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
 (UTF-8) are happy with.  Unicode defines 0xFFFE and 0x as 
 non-characters, not just 0x (which Encode::en/decode do deem 
 invalid).

Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).

 In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
 neither.
 It's supposed to be neither on the outside. Internally, it's utf8.
 One can turn off the warnings and then chr(n) will happily take any +ve 
 integer and give you the equivalent character -- so the result is utf8, 

The result is Unicode. The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.

Unicode: U+20AC(one character: €)
UTF-8:   E2 82 AC  (three bytes)

I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.

 [replacement character]
 So we'll have to differ on this :-)

Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: Use of encoding/decoding and 3-param open

2007-11-15 Thread Juerd Waalboer
Paul Bijnens skribis 2007-11-15 14:52 (+0100):
 Can you elaborate more on the subtle difference between:
   binmode(STDIN, :utf8);
   binmode(STDIN, :encoding(UTF-8));

http://search.cpan.org/~rgarcia/perl-5.9.5/pod/perlunifaq.pod#Cheat?!_Tell_me,_how_can_I_cheat?

http://www.perlmonks.org/?node_id=644786

 For input, both get the correct characters, assuming the input 
 bytestream was indeed correct.

Yes, but if the bytestream is incorrect, you may have a security issue
if you used :utf8 instead of :encoding.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: silent upgrading situations

2007-11-12 Thread Juerd Waalboer
E R skribis 2007-11-12 14:00 (-0600):
 Are there any other examples? I'm especially interested in cases where
 the non-utf8 string gets upgraded but actually doesn't change.

Although I strongly recommend against fussing over the internal
encoding except for performance reasons, encoding::warnings, available
from CPAN is a valueble module in finding these cases.

Don't forget to remove it before releasing something for production.
Automatic scalar upgrading is a very important feature, and not at all
scary.

(It happens with numbers too: the integer 1 is automatically upgraded
internally to a float, whenever that is needed.)
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: Explaining this behavior (was Re: good name for characters matching [^\0-\377]?)

2007-10-22 Thread Juerd Waalboer
E R skribis 2007-10-22  7:01 (-0500):
 So this raises another interesting point... not only must
 Encode::encode et al. perform the proper encoding (as in translations
 to character ordinals), but they also must return a Perl string whose
 internal representation is, shall we say, the conventional one, i.e.
 one octet per Perl character.
 I'm sure this is already well understood, but it is interesting to
 come to this conclusion.

There's an alternative way of viewing this: there are two types of
strings: binary and text. If you encode text, you get binary.

While in practice there is only one string type, and there's no way for
perl internals to know if a certain string is binary or
text-that-is-encoded-as-latin1-internally, it can help to think of
things in terms of the following picture:
http://perlmonks.org/?node_id=645432
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: Explaining this behavior (was Re: good name for characters matching [^\0-\377]?)

2007-10-19 Thread Juerd Waalboer
E R skribis 2007-10-19 17:14 (-0500):
 The problem I need to understand now is the following:
   # using mod_perl 1.28
   # note: binmode(STDOUT, :utf8) has no effect
   $r-print($x); # emits 1 octet
   $r-print($y); # emits 2 octets
 I get similar behavior when storing $y into an Oracle DB - a string of length 
 2
 is stored. Storing $x, however, results in a length 1 string.

These don't use a filehandle, so :utf8 or :encoding layers don't work.
That leaves two options: either use the encoding functionality by the
module (if any), or encode manually.

AFAIK, mod_perl does not provide transparent encoding for output.
DBD::Oracle does, but you need to enable it. (Don't ask me how; I bailed
out when I saw the complexity of Oracle's charset/encoding support.)

When doing the encoding manually, I strongly suggest that you subclass
the module in question, to prevent that the logic is spread all over the
place. (And please release your subclass to CPAN :))

 So it seems that in light of this one should always use Encode::encode with
 these modules to ensure the data is represented the way you want it.

Encode::encode, Encode::encode_utf8, or utf8::encode.

 Stated another way: if you use a module which converts a Perl string to an 
 octet
 sequence, and there is no provision for specifying an encoding, that should 
 be a
 red flag that you need to encode the string before you send it to the module.

Well stated. I have collected a summary at http://juerd.nl/perluniadvice
that is neither complete nor accurate, but it provides more information
than most documentation does. Unfortunately I lack tuits to send bug
reports and make patches.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread Juerd Waalboer
Georg Bauhaus skribis 2007-10-18 17:01 (+0200):
 Isn't it about time to find a good name for crippled character sets
 with ordinals below 256 only?

These are single byte encodings. I prefer to add the word legacy
too.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread Juerd Waalboer
E R skribis 2007-10-18 16:21 (-0500):
 I should have added that in my presentation I am attempting to present
 Perl strings from a character set agnostic perspective.

That is silly, because Perl itself is not at all character set agnostic.

It has unicode strings and it has binary strings, but those are your
tools.

 So, even though there is a strong bias for Perl to treat character
 ordinals  255 as Unicode code-points

Er, no, all character ordinals, including 0..255, are Unicode
codepoints.

255 is unicode just like 256. There is no actual barrier in between!!

 I don't want people to automatically think Unicode when encountering
 one of these non-legacy characters.

If they don't automatically think of Unicode, they won't be using Perl's
functionality in the most efficient and time saving way. I'm hoping this
is not your desired goal.

To be honest, I'm not sure you know enough about Perl's string model to
be giving a presentation about Unicode in Perl. You just learnt very
important aspects, and from the things you write, I'd say you still have
some other important aspects to learn or accept. No offense meant.

 I'm just wondering if there is an established term. Perhaps
 extended/large character ordinal?

The established term for a character ordinal is code point.

 It would help as in the sentence: If your string contains a ___, Perl
 will assume your string represents Unicode code-points.

If you use your string for text operations, Perl will assume your string
is a Unicode string.

Note that there is a bug in uppercasing/lowercasing, and in some
built-in regular expression character classes, that causes Perl to look
at the internal encoding. This is a leak in the unicode abstraction, and
will probably be fixed with Perl 5.12.

It is very simple (and future proof) to work around this problem by
using the Unicode::Semantics module's up() function, or the built-in
utf8::upgrade().
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: de-utf8-ing a string

2007-10-17 Thread Juerd Waalboer
E R skribis 2007-10-17 15:56 (-0500):
   for (my $i = 0; $i  length($x); $i++) {
 $new .= chr(ord(substr($x, $i, 1)));
   }

utf8::downgrade();
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: questions about encode/decode

2007-10-15 Thread Juerd Waalboer
E R skribis 2007-10-15 16:25 (-0500):
 1. What is the result of Encode::encode(iso-8559-1, $x) if $x is not
 a utf8 string (i.e. Encode::is_utf8($x) returns false.)

utf8 string is already confusing. It can be either one of the
following:

1. byte string with UTF8 encoded text
2. Perl Unicode string that at this point in time is encoded as UTF8
   *internally*

Encode::is_utf8 indicates that the latter is true. You should NOT have
to peek at the status of this internal flag, except for debugging perl
itself.

Encode::encode expects a Unicode string, which can be encoded as
ISO-8859-1 or UTF8 internally. If the Unicode string is ISO-8859-1
internally, is_utf8 returns false, and if it is UTF8 internally, it
returns true.

This is how Encode::encode knows, again: *internally*, how to convert
the string.

Assuming you meant 8859, not 8559, the answer to your question is: a
copy of $x is returned, because the encoding you used happens to equal
the encoding that Perl used internally.

 2. What is the result of $string = decode(iso-8859-1, $octets) if
 $octets is a utf8 string?

Do not use Encode::decode on unicode strings, but use it on bytestrings
only. Every individual byte of the bytestring is seen as a single
ISO-8859-1 character, so a multi-byte UTF8 sequence will *not* be
interpreted as a single character.

Perhaps helpful: http://tnx.nl/perlunitut,perlunifaq
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]


Re: questions about encode/decode

2007-10-15 Thread Juerd Waalboer
In mailing lists, please write your reply below quotation, and cut
quotation to the minimum required for context. Thanks!


E R skribis 2007-10-15 17:01 (-0500):
 As a follow-up, does anyone have any suggestions about optimizing a
 routine such as this: sub escapeHTML {

Probably the best optimization is to use the freely available
HTML::Entities module that comes with LWP.

   $x =~ s//amp;/g; $x =~ s//lt;/g;

Use a single regex, because every regex has to scan the entire string.
See HTML::Entities for inspiration if you don't want to use the module
(e.g. if you don't want the full spectrum of entities that it supports).

   Encode::encode(iso-8859-1, $x);

It's very probably better to standardize on UTF-8 for your output. Doing
that now saves a lot of trouble when you will need it. And sooner or
later, you will.

 Basically I'm concerned about the overhead to constantly look up the
 encoder sub for every fragment of HTML I need to escape.

Encode your output once, when outputting. PerlIO layers help to automate
this and save a lot of development time:

binmode STDOUT, :encoding(UTF-8);
print $foo;  # automatically encoded!
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  [EMAIL PROTECTED]  http://juerd.nl/sig
  Convolution: ICT solutions and consultancy [EMAIL PROTECTED]