Re: Character (or byte?) escapes under utf8 pragma
Michael Ludwig skribis 2010-03-10 10:34 (+0100): Okay. Let me try to see if I have understood correctly. Without the utf8 pragma in scope, so\xa0ein\xa0Käse with a-Umlaut stored as a sequence of two bytes in my source code will be stored internally as a sequence of 12 integers. With the utf8 pragma in scope, only 11 integers. so\xa0ein\xa0Käse must be stored as either: l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off) or: u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on) Both strings should be semantically equal, and have 11 characters, each of which has an integer ordinal value. What happens is the following: 73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on) l1 l1 u8 This is wrong. It is a bug. -- Met vriendelijke groet, // Kind regards, // Korajn salutojn, Juerd Waalboer ju...@tnx.nl TNX
Re: Character (or byte?) escapes under utf8 pragma
Michael Ludwig skribis 2010-03-08 15:55 (+0100): Perl does not distinguish between bytes and characters. (...) You cannot tell what kind of data they contain just by looking at them and the UTF8 flag doesn’t tell you either. Okay. But unless I'm completely misled, you can tell whether a string is supposed to contain characters (- Encode::decode) or bytes (- Encode::encode) The result of decode is a character string. The result of encode is a byte string. However, apart from looking at the source code and deducing the intentions of the programmer, there is no way to tell whether a given string is meant as a character or byte string, simply because there is no technical representation of this intent in the string or its metadata. Note that characters are the general case: a string is made of characters. When every character value fits in a single byte, the string can be used as a byte string. This is definitely a bug. Good. It looked like one to me. Thanks for logging it with the Perl maintainers. This bug forces us to look at the internal encoding and flags to come to the conclusion that it is indeed a bug. Don't mistake this as a sign that looking at the internal encoding or flags should ever happen in actual code. Even if you work around the bug, make sure that you don't make anything conditional on the current formatting of the string. Instead, coerce it to whatever you need by using utf8::downgrade or utf8::upgrade. In your specific case, concatenation of two separate parts is probably the most sane thing to do. Am I mistaken in my expectation that while \xa0 should be a byte, \x{a0} and \x{00a0} should be characters? Yes. These three escapes are supposed to be exactly the same. They create a U+00A0 character, which happens to be perfectly usable as the A0 byte when used as such, in a string that doesn't contain any character greater than U+00FF. [perlre:] Unicode characters in the range of 128-255 use two hexadecimal digits with braces: \x{ab}. Note that this is different than \xab, which is just a hexadecimal byte with no Unicode significance. The documentation I referred to is outdated. Sorry for that. Indeed this documentation is wrong. Current documentation, as of Perl version 5.8.9 (december 2008) no longer has this paragraph. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker ##...@juerd.nl http://juerd.nl/sig Convolution: ICT solutions and consultancy sa...@convolution.nl
Re: encode from_to error
Yebba, Nick skribis 2009-09-17 13:14 (-0400): $sth10=$dbh_pweb-prepare($sql10); It looks like some part of the code is missing; I can't see where $dbh_pweb is created. Encode::from_to( $headingsubject, utf-8,ISO-2022-KR); If it finds wide characters in the $headingsubject, apparently the database has already decoded it. All you have to do then, is encode it. from_to combines decoding and encoding in a single function. $headingsubject = Encode::encode(ISO-2022-KR, $headingsubject); -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker ##...@juerd.nl http://juerd.nl/sig Convolution: ICT solutions and consultancy sa...@convolution.nl
Re: Unicode characters
Andreas J. Koenig skribis 2009-05-25 8:30 (+0200): On Sun, 24 May 2009 10:09:25 +0200, Juerd Waalboer ju...@convolution.nl said: Although it's safe on output, it's better to get used to using :encoding(utf8) instead of :utf8. Using :utf8 on input can cause stability and security issues. That's new to me. Do you have a link that backs this up? http://www.perlmonks.org/?node_id=644786 http://www.perlfoundation.org/perl5/index.cgi?the_utf8_perlio_layer http://perldoc.perl.org/perlunicode.html#Security-Implications-of-Unicode (perlunicode doesn't refer to :utf8 but does explain how malformed utf8 can cause trouble.) Perl change #32461 updated documentation to reflect the preference for :encoding http://perl5.git.perl.org/perl.git/commit/740d4bb23b722729f87a23733be98429529fd900 -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker ##...@juerd.nl http://juerd.nl/sig Convolution: ICT solutions and consultancy sa...@convolution.nl 1;
Re: Unicode characters
Andreas J. Koenig skribis 2009-05-24 6:44 (+0200): binmode $_, :utf8 for *STDOUT, *TEMP_OUT; Although it's safe on output, it's better to get used to using :encoding(utf8) instead of :utf8. Using :utf8 on input can cause stability and security issues. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker ##...@juerd.nl http://juerd.nl/sig Convolution: ICT solutions and consultancy sa...@convolution.nl 1;
Re: /\w/ match with 'use locale' misses letters in utf8 locale
Peter Volkov skribis 2008-07-11 10:10 (+0400): The problem is that in Linux (Gentoo and Debian I've tried) /\w/ does not match Russian letter while I use locale and LC_COLLATE is set to ru_RU.UTF-8. \w should match Cyrillic letters even without use locale. You might be running into an annoying bug which makes \w lose its unicode support depending on the *internal* state of a value. To work around this bug, read Unicode::Semantics on CPAN and use it or utf8::upgrade. Linux $ perl -e 'use locale; open(IN, test-file); while(IN) { print if /\w/; }' string with spaces (not only with [:alnum:]) English; hello_привет Despite the above there's a slightly more important issue here. You're opening a text file but you don't specify the character encoding. Likewise, you need to specify the encoding for output. Assuming utf8 for both: perl -le' binmode STDOUT, :encoding(utf8); open my $in, :encoding(utf8), test-file; while ($in) { print match: [$1] if /(\w+)/; } ' Which on my system prints: match: [слово] match: [строка] match: [string] match: [English] match: [hello_привет] I'm not sufficiently familiar with use encoding to say anything about it, but you shouldn't need it just for this. Do I understand correctly that we should always supply encoding of streams? Yes. If yes, why in FreeBSD this works without supplying any encoding and is it possible (good idea) to do the same in Linux? I have no idea. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED] 1;
Re: utf8::valid and \x14_000 - \x1F_0000
Chris Hall skribis 2008-03-12 13:20 (+): OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. It should do Unicode, not any specific byte encoding, like UTF-?8. IMHO chr(n) should do characters, which may be interpreted as per Unicode, but may not. When I said utf8 I was following the (sloppy) convention that utf8 means how Perl handles characters in strings... I'm working hard to break this convention. I've changed a lot of Perl documentation, and the result was released with Perl 5.10. If in any place in Perl's official documentation, it still reads UTF-8 or UTF8 for *characters in text strings*, it's wrong. Let me know and I will fix it :) b. in a Perl string, characters are held in a UTF-8 like form. I'd say *inside* a Perl string. This is the C implementation, but a Perl programmer should not have to know the specific *internal* encoding of a Perl string. Likewise, in Perl you don't have to know whether your number is internally encoded as a long integer or a double. Where UTF-8 (upper case, with hyphen) means the RFC 3629 Unicode Consortium defined byte-wise encoding. That's the theory, but it's so often not entirely following spec. This form is referred to as utf8 (lower case, no hyphen). Yes, but note that encoding names in Perl are case insensitive. I tend to call it UTF8 sometimes. There is really no need to discuss this, except in the context of messing around in guts of Perl. Exactly. String literals are represented by UCS code points. Which reinforces the feeling that characters in Perl are Unicode. Yes! 'C' uses 'wide' to refer to characters that may have values 255. IMHO it's a shame that Perl did not follow this. It does in some places, most notably warnings about wide characters. d. when exchanging character data with other systems one needs to deal with character set and encoding issues. Not just other systems. All I/O is done in bytes, even with yourself, for example if you forked. Isolated surrogate code units have no interpretation on their own. (...) Clearly these are illegal in UTF-8. They have no interpretation, but this also doesn't say it's illegal. Compare it with the undefined behavior of multiple ++ in a single expression. There's no specification of what should happen, but it's not illegal to do it. Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. I think it's not Perl's job to prevent exchange. Simply because the exchange could be internal, but between processes of the same program. I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and friends) in the same way as U+ (and friends). My gut says it's out of ignorance of the rules, and certainly not an intentional deviation. The result is Unicode. IMHO the result of chr(n) should just be a character. We call that a unicode character in Perl. It is true that Perl allows ordinal values outside the currently existing range, but it is still called unicode by Perl's documentation. OK, sure. I was using utf8 to mean any character value you like, and UTF-8 to imply a value which is recognised in UCS -- rather than the encoding. Please use utf8 only for naming the byte encoding that allows any character value you like, not for the ordinal values themselves. FWIW I note that printf %vX is suggested as a means to render IPv6 addresses. This implies the use of a string containing eight characters 0..0x as the packed form of IPv6. Building one of those using chr(n) will generate spurious warnings about 0xFFFE and 0x ! Interesting point. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: utf8::valid and \x14_000 - \x1F_0000
Chris Hall skribis 2008-03-11 21:09 (+): OK. In the meantime IMHO chr(n) should be handling utf8 and has no business worrying about things which UTF-8 or UCS think aren't characters. It should do Unicode, not any specific byte encoding, like UTF-?8. Internally, a byte encoding is needed. As a programmer I don't want to be bothered with such implementation details. Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode (UTF-8) are happy with. Unicode defines 0xFFFE and 0x as non-characters, not just 0x (which Encode::en/decode do deem invalid). Personally, I think Perl should accept these characters without warning, except the strict UTF-8 encoding is requested (which differs from the non-strict UTF8 encoding). In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it's neither. It's supposed to be neither on the outside. Internally, it's utf8. One can turn off the warnings and then chr(n) will happily take any +ve integer and give you the equivalent character -- so the result is utf8, The result is Unicode. The difference between Unicode and UTF8 is not always clear, but in this case is: the character is Unicode, a single codepoint, the internal implementation is UTF8. Unicode: U+20AC(one character: €) UTF-8: E2 82 AC (three bytes) I am under the impression that you know the difference and made an honest mistake. My detailed expansion is also for lurkers and archives. [replacement character] So we'll have to differ on this :-) Yes, although my opinion on this is not strong. undef or replacement character - both are good options. One argument in favor of the replacement character would be backwards compatibility. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: Use of encoding/decoding and 3-param open
Paul Bijnens skribis 2007-11-15 14:52 (+0100): Can you elaborate more on the subtle difference between: binmode(STDIN, :utf8); binmode(STDIN, :encoding(UTF-8)); http://search.cpan.org/~rgarcia/perl-5.9.5/pod/perlunifaq.pod#Cheat?!_Tell_me,_how_can_I_cheat? http://www.perlmonks.org/?node_id=644786 For input, both get the correct characters, assuming the input bytestream was indeed correct. Yes, but if the bytestream is incorrect, you may have a security issue if you used :utf8 instead of :encoding. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: silent upgrading situations
E R skribis 2007-11-12 14:00 (-0600): Are there any other examples? I'm especially interested in cases where the non-utf8 string gets upgraded but actually doesn't change. Although I strongly recommend against fussing over the internal encoding except for performance reasons, encoding::warnings, available from CPAN is a valueble module in finding these cases. Don't forget to remove it before releasing something for production. Automatic scalar upgrading is a very important feature, and not at all scary. (It happens with numbers too: the integer 1 is automatically upgraded internally to a float, whenever that is needed.) -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: Explaining this behavior (was Re: good name for characters matching [^\0-\377]?)
E R skribis 2007-10-22 7:01 (-0500): So this raises another interesting point... not only must Encode::encode et al. perform the proper encoding (as in translations to character ordinals), but they also must return a Perl string whose internal representation is, shall we say, the conventional one, i.e. one octet per Perl character. I'm sure this is already well understood, but it is interesting to come to this conclusion. There's an alternative way of viewing this: there are two types of strings: binary and text. If you encode text, you get binary. While in practice there is only one string type, and there's no way for perl internals to know if a certain string is binary or text-that-is-encoded-as-latin1-internally, it can help to think of things in terms of the following picture: http://perlmonks.org/?node_id=645432 -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: Explaining this behavior (was Re: good name for characters matching [^\0-\377]?)
E R skribis 2007-10-19 17:14 (-0500): The problem I need to understand now is the following: # using mod_perl 1.28 # note: binmode(STDOUT, :utf8) has no effect $r-print($x); # emits 1 octet $r-print($y); # emits 2 octets I get similar behavior when storing $y into an Oracle DB - a string of length 2 is stored. Storing $x, however, results in a length 1 string. These don't use a filehandle, so :utf8 or :encoding layers don't work. That leaves two options: either use the encoding functionality by the module (if any), or encode manually. AFAIK, mod_perl does not provide transparent encoding for output. DBD::Oracle does, but you need to enable it. (Don't ask me how; I bailed out when I saw the complexity of Oracle's charset/encoding support.) When doing the encoding manually, I strongly suggest that you subclass the module in question, to prevent that the logic is spread all over the place. (And please release your subclass to CPAN :)) So it seems that in light of this one should always use Encode::encode with these modules to ensure the data is represented the way you want it. Encode::encode, Encode::encode_utf8, or utf8::encode. Stated another way: if you use a module which converts a Perl string to an octet sequence, and there is no provision for specifying an encoding, that should be a red flag that you need to encode the string before you send it to the module. Well stated. I have collected a summary at http://juerd.nl/perluniadvice that is neither complete nor accurate, but it provides more information than most documentation does. Unfortunately I lack tuits to send bug reports and make patches. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: good name for characters matching [^\0-\377]?
Georg Bauhaus skribis 2007-10-18 17:01 (+0200): Isn't it about time to find a good name for crippled character sets with ordinals below 256 only? These are single byte encodings. I prefer to add the word legacy too. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: good name for characters matching [^\0-\377]?
E R skribis 2007-10-18 16:21 (-0500): I should have added that in my presentation I am attempting to present Perl strings from a character set agnostic perspective. That is silly, because Perl itself is not at all character set agnostic. It has unicode strings and it has binary strings, but those are your tools. So, even though there is a strong bias for Perl to treat character ordinals 255 as Unicode code-points Er, no, all character ordinals, including 0..255, are Unicode codepoints. 255 is unicode just like 256. There is no actual barrier in between!! I don't want people to automatically think Unicode when encountering one of these non-legacy characters. If they don't automatically think of Unicode, they won't be using Perl's functionality in the most efficient and time saving way. I'm hoping this is not your desired goal. To be honest, I'm not sure you know enough about Perl's string model to be giving a presentation about Unicode in Perl. You just learnt very important aspects, and from the things you write, I'd say you still have some other important aspects to learn or accept. No offense meant. I'm just wondering if there is an established term. Perhaps extended/large character ordinal? The established term for a character ordinal is code point. It would help as in the sentence: If your string contains a ___, Perl will assume your string represents Unicode code-points. If you use your string for text operations, Perl will assume your string is a Unicode string. Note that there is a bug in uppercasing/lowercasing, and in some built-in regular expression character classes, that causes Perl to look at the internal encoding. This is a leak in the unicode abstraction, and will probably be fixed with Perl 5.12. It is very simple (and future proof) to work around this problem by using the Unicode::Semantics module's up() function, or the built-in utf8::upgrade(). -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: de-utf8-ing a string
E R skribis 2007-10-17 15:56 (-0500): for (my $i = 0; $i length($x); $i++) { $new .= chr(ord(substr($x, $i, 1))); } utf8::downgrade(); -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: questions about encode/decode
E R skribis 2007-10-15 16:25 (-0500): 1. What is the result of Encode::encode(iso-8559-1, $x) if $x is not a utf8 string (i.e. Encode::is_utf8($x) returns false.) utf8 string is already confusing. It can be either one of the following: 1. byte string with UTF8 encoded text 2. Perl Unicode string that at this point in time is encoded as UTF8 *internally* Encode::is_utf8 indicates that the latter is true. You should NOT have to peek at the status of this internal flag, except for debugging perl itself. Encode::encode expects a Unicode string, which can be encoded as ISO-8859-1 or UTF8 internally. If the Unicode string is ISO-8859-1 internally, is_utf8 returns false, and if it is UTF8 internally, it returns true. This is how Encode::encode knows, again: *internally*, how to convert the string. Assuming you meant 8859, not 8559, the answer to your question is: a copy of $x is returned, because the encoding you used happens to equal the encoding that Perl used internally. 2. What is the result of $string = decode(iso-8859-1, $octets) if $octets is a utf8 string? Do not use Encode::decode on unicode strings, but use it on bytestrings only. Every individual byte of the bytestring is seen as a single ISO-8859-1 character, so a multi-byte UTF8 sequence will *not* be interpreted as a single character. Perhaps helpful: http://tnx.nl/perlunitut,perlunifaq -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]
Re: questions about encode/decode
In mailing lists, please write your reply below quotation, and cut quotation to the minimum required for context. Thanks! E R skribis 2007-10-15 17:01 (-0500): As a follow-up, does anyone have any suggestions about optimizing a routine such as this: sub escapeHTML { Probably the best optimization is to use the freely available HTML::Entities module that comes with LWP. $x =~ s//amp;/g; $x =~ s//lt;/g; Use a single regex, because every regex has to scan the entire string. See HTML::Entities for inspiration if you don't want to use the module (e.g. if you don't want the full spectrum of entities that it supports). Encode::encode(iso-8859-1, $x); It's very probably better to standardize on UTF-8 for your output. Doing that now saves a lot of trouble when you will need it. And sooner or later, you will. Basically I'm concerned about the overhead to constantly look up the encoder sub for every fragment of HTML I need to escape. Encode your output once, when outputting. PerlIO layers help to automate this and save a lot of development time: binmode STDOUT, :encoding(UTF-8); print $foo; # automatically encoded! -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker [EMAIL PROTECTED] http://juerd.nl/sig Convolution: ICT solutions and consultancy [EMAIL PROTECTED]