Re: select a variable as stdout and utf8 flag behaviour
* Gert Brinkmann <g...@netcologne.de> [2016-11-09 16:00]: > open(my $fh, '>:encoding(UTF-8)', \$html); > my $orig_stdout = select( $fh ); > print "Ümläut Test ßaß; 使用下列语言\n"; Think of it this way: Those three lines of code are an elaborate way of doing this: $html = Encode::encode('UTF-8', "Ümläut Test ßaß; 使用下列语言\n"); If you wrote that code, would you be surprised that $html does not have the UTF8 flag set afterwards? Bonus question if you are not surprised then: what is the difference between these two cases that makes your argument that “perl knows what I put in there so it should know to set the UTF8 flag on it” not apply to this? Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Re: Encode UTF-8 optimizations
* Karl Williamson[2016-08-21 03:12]: > That should be done anyway to make sure we've got less buggy Unicode > handling code available to older modules. I think you meant “available to older perls”?
Re: UTF-8 encoding & decoding
* Pali Rohár <pali.ro...@gmail.com> [2016-05-12 20:23]: > If both functions should do same thing, why we have duplicity? Encode.pm is big and fairly slow, because it handles a zillion encodings and has lots of options for handling invalid input data. Perl needs only UTF-8 transcoding and needs it fast, so it has code for just that. Since that code is there anyway, it can just as well be exposed to Perl space. > And which one is preferred to use? Well, either you need Encode.pm or you don’t. The built-ins are faster and always loaded, but they only do UTF-8 and if you have invalid data then all you get is a false return value and no other help. If you need anything else you pay the memory and take the speed hit of Encode.pm. (If you are working on a large application, chances are high that you have Encode.pm loaded anyway.) Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Re: UTF-8 encoding & decoding
* Pali Rohár <pali.ro...@gmail.com> [2016-05-06 14:50]: > 1. What is difference between those two calls? > > utf8::encode($str); > > and > > $str = Encode::encode('utf8', $str); > > 2. What is difference between those? > > utf8::decode($str); > $str = Encode::decode_utf8($str); They do the same thing with different interfaces. utf8::encode/decode modify a string in-place and return a boolean to signal success or not. Encode.pm returns a copy and can be configured to do a range of things with invalid input, from converting invalid bytes to replacement marks to throwing an exception. > 3. Where is implementation of utf8::encode/decode functions? It is not > in utf8.pm, nor in utf8_heavy.pl and also not in unicore/Heavy.pl. And > what those functions doing? They are part of the perl interpreter and defined in universal.c as thin wrappers around code ultimately from sv.c. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Re: Choice of BOM for UTF-16 encoding
* Geoffrey Leach ge...@hughes.net [2014-02-10 07:35]: Is there a way to force (from my module) the choice to be LE? It turns out that the library I'm supporting (taglib) works in LE. Does it need a BOM prepended? If not, just do the obvious and `encode('UTF-16LE', $str)`. C.f. `perldoc Encode::Unicode`. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Matching upper ASCII characters in RE patterns
* Jonathan Pool p...@utilika.org [2010-11-30 23:50]: As documented in http://rt.perl.org/rt3/Public/Bug/Display.html?id=80030 there seems to be a problem when use encoding 'utf8' is removed and replaced with use utf8, so the problem is not limited to the encoding pragma. ˉ However, you can expect the `utf8` pragma to be fixed – though that won’t help you right now. The `encoding` pragma OTOH is irretrievably broken. (There is also consensus that source files in arbitrary encodings are not a sane idea anyway; if you need more than ASCII, your code should be in UTF-8 and you should `use utf8`. So no there is no replacement for that aspect of the `encoding` pragma coming down the pipe either, now or ever.) Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Am I correct in thinking that the only way to get ord() to return a value over 256 is to send the character as a Unicode string instead of a byte string?
* Dan Muey d...@cpanel.net [2010-10-28 21:55]: For example, note the differences in output between a unicode string and a byte string regarding character 257, as a unicode string it is 257, as a byte string it is 196. That is not what’s going on. $ perl -E'say ord 1234' 49 When you pass a multi-character string to `ord`, you get the code point of the first character. $ perl -E'say chr 49' 1 In your case you get 196. That is 0xC4, or the character Ä. It is not the character ā (U+101 = code point 257). 0xC4 is the value of the first byte in the two-byte UTF-8 sequence that encodes the character 257. You are passing a string containing a representation of those bytes as two characters to `ord`, and `ord` is giving you the code point of the first byte-as-character. You are missing the rest of the bytes from the UTF-8 encoding. You are losing data. If you try this on more code points you will find that there are *lots* of different characters that are reported as 196 – because they get encoded as multi-byte sequences that all start with the byte value 0xC4. -- *AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(,$\/, )[defined wantarray]/e;chop;$_} Just-another-Perl-hack; #Aristotle Pagaltzis // http://plasmasturm.org/
Re: Silence “Wide charact er” warning globally one time
* Michael Ludwig mil...@gmx.de [2010-07-30 01:20]: You need the equivalent of -CO in your script: binmode STDOUT, ':utf8'; Argh, no, unfortunately not. The `:utf8` layer is bad. It does the equivalent of `_utf8_on` on input and `_utf8_off` on output, without actually decoding or (worse) encoding anything. You want `:encoding(UTF-8)`. -- *AUTOLOAD=*_;sub _{s/(.*)::(.*)/print$2,(,$\/, )[defined wantarray]/e;$1} Just-another-Perl-hack; #Aristotle Pagaltzis // http://plasmasturm.org/
Re: Don't use the \C escape in regexes - Why not?
* Michael Ludwig michael.lud...@xing.com [2010-05-04 14:55]: But wait a second: While URIs are meant to be made of characters, they're also meant to go over the wire, and there are no characters on the wire, only bytes. There is no standard encoding defined for the wire, although UTF-8 has come to be seen as the standard encoding for URIs containing non-ASCII characters. Perl having two standard encodings (UTF-8 and ISO-8859-1) for text and relying on the internal flag to tell which one is meant to matter, shouldn't the URI module either only accept bytes or only characters? Or rather, provide two different constructors instead of only one trying to be intelligent? URI-bytes( $bytes ); # byte string URI-chars( $chars ); # character string And, in addition, define the character encoding used for serialization. Yes, exactly. And both methods would use the moral equivalent of a plain `split //` – no trickery such as with `\C`. The only difference between then is that the `chars` method would `encode_utf8` the string first and then encode it blindly, whereas the `bytes` method would leave it as is but then croak if it found a codepoint 0xFF (since the string is supposed to represent an octet sequence already). Notably absent in both cases: any dependence on the state of the UTF8 flag of the string. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Effect of -C command line switch on `warn` and `die`
Hi Michael, * Michael Ludwig michael.lud...@xing.com [2010-04-22 17:00]: Consider the following script, the source of which is encoded in UTF-8: I can’t answer your question, but I do want to suggest that you re-post it to perl5-port...@perl.org – it’s much more likely that someone over there will be able to tell you what’s up with this. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Use case for utf8::upgrade?
* Michael Ludwig michael.lud...@xing.com [2010-04-08 09:25]: since upgrading a string increases memory consumption and can significantly slow down regex matches against it. Is it some copying behind the scenes that increases memory consumption? Just the simple fact that some characters take multiple bytes to encode in the UTF8-based format. Why does that have the potential to significantly slow down regex matches? Because one byte and one character is no longer the same thing, so if you know you want the 17th character in the string, you can’t say where in memory that is. You have to scan the string. This is sort of access pattern is rare in practice – most operations either just copy the entire string or scan over it one character at a time. But the regex engine is one of those things that sometimes needs to jump around in the string rather than merely scanning linearly. (Perl’s regex engine does some caching to avoid the worst penalties with this, but that in itself also causes slowdown, so there’s a balance to strike.) Does that mean that when doing lots of matching, it might be preferable to use byte strings and byte semantics, not character strings and character semantics? Almost all of the time the performance cost is negligible and not worth sweating at the application code level. Trying to work on text using byte semantics is a recipe for massive headaches, and an invitation for bugs. It’s doable if you are careful and disciplined, absolutely. But why punish yourself? You gain little, at significant effort. On 5.12, though, you can get a tiny potential improvement en passant, with basically zero effort. In that case – and only in that case: why not? The gain is small; but the cost is also. In the other direction, that doesn’t translate. Don’t go micro- optimising your code for this. Under older perls, it’s a question of getting the wrong results in less time and memory, so there’s not an option. Wrong results? Could you clarify? Thanks :-) Well, you get Latin-1 semantics, eg. upper-/lowercasing will ignore accented characters that fall outside the Latin-1 charset. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Use case for utf8::upgrade?
* Michael Ludwig michael.lud...@xing.com [2010-04-07 15:00]: Having read Juerd's list of useful advice, I don't understand the reason for its last three items: • utf8::upgrade before doing lc/lcfirst/uc • utf8::upgrade before doing case insensitive matching • utf8::upgrade before matching predefined character classes like w and s Can anyone enlighten me on the background of using utf8::upgrade here? Perl versions up to the upcoming 5.12.0 (I think) are buggy in that they apply ISO-8859-1 semantics to downgraded strings and Unicode semantics to upgraded strings, even when they contain the same data. By upgrading your strings, you make sure that you get Unicode semantics consistently. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Use case for utf8::upgrade?
* Gisle Aas gi...@aas.no [2010-04-08 00:00]: This fix was withdrawn from 5.12.0. Currently you have to use feature 'unicode_strings' to get the sane behaviour in the current lexical scope. Current 'perldoc unicode' also says: The use feature 'unicode_strings' pragma is intended to always, regardless of platform, force Unicode semantics in a particular lexical scope. In release 5.12, it is partially implemented, applying only to case changes. See The Unicode Bug below. This means that the utf8::upgrade() advice also applies to perl-5.12.0. Oh right! That was it. (I couldn’t remember the specifics.) Well, using `use feature 'unicode_strings';` and not upgrading strings is a better strategy for code that doesn’t need to work under earlier perl versions, since upgrading a string increases memory consumption and can significantly slow down regex matches against it. Under older perls, it’s a question of getting the wrong results in less time and memory, so there’s not an option. If you want both, I guess you could do something like use constant UNICODE_BUG = ( $] 5.012 ); use if not UNICODE_BUG, feature = 'unicode_strings'; # ... utf8::upgrade( $some_str ) if UNICODE_BUG; (Note to readers who don’t already know: using a constant here will cause either the conditional or the entire statement to get optimised away depending on its truthiness, so that there won’t be any runtime penalty for the conditionals.) Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Character (or byte?) escapes under utf8 pragma
Hi Michael, I just noticed I never replied to this… * Michael Ludwig michael.lud...@xing.com [2010-03-08 15:50]: Am 07.03.2010 um 07:39 schrieb Aristotle Pagaltzis: Use the \U escape to indicate that you always mean a Unicode code point. Due to other quirks in how \U is implemented, it ends up not triggering the bug that \x would. How would I use that? I only know about the U specifier for pack: my $smiley = pack 'U', 0x263a; Sorry – I meant \N. Eg in that case, my $smiley = \N{U+263A}; Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Character (or byte?) escapes under utf8 pragma
Hi Michael, [ perlbug readers, you will find the nut of the issue in the section marked BUG ] * Michael Ludwig michael.lud...@xing.com [2010-03-03 14:05]: For convenience, I have test script source code in UTF-8. The test also deals with non-breaking spaces, which I prefer to keep as character references since they are not visible and might be mistaken by the casual onlooker for ordinary spaces. So I write them as \xa0. Or \x{a0}, or \x{00a0}. Now I find that they seem to be byte references, not character references. Perl does not distinguish between bytes and characters. It does distinguish between scalars that use a packed byte buffer for storage vs strings that use variable-width integer sequence for storage, but this is an implementation detail and does not mean anything in terms of semantics. Strings are simply strings in Perl. You cannot tell what kind of data they contain just by looking at them and the UTF8 flag doesn’t tell you either. Consider the following test script: use strict; use warnings; use utf8; # source code in UTF-8 (Zurück) use open OUT = ':encoding(UTF-8)', ':std'; my $str1 = \xa0Zurück\n; # byte - bad my $str2 = \x{a0}Zurück\n;# should be character, but isn't my $str3 = \x{00a0}Zurück\n; # ditto my $str4 = \xa0 . Zurück\n; # upgrading hack, works print $str1, $str2, $str3, $str4; $str1 ne $str2 and die won't die; $str1 ne $str3 and die won't die; $str1 ne $str4 and die 'die now, somewhat counter-intuitively'; \x{00a0} does not map to utf8 at t.pl line 11. \xA0Zurück \x{00a0} does not map to utf8 at t.pl line 11. \xA0Zurück \x{00a0} does not map to utf8 at t.pl line 11. \xA0Zurück Zurück die now, somewhat counter-intuitively at t.pl line 15. This is definitely a bug. The correct version of the string uses implicit upgrading of the byte escape \xa0 to a Unicode character. I've read upgrading should rather be avoided, but here it does the job. No, upgrading is perfectly fine. Mixing byte and character data is what should be avoided, because then Perl will assume it’s all characters, which will result in mangling of one of the two kinds of data. Usually the byte data is encoded text, in which case the problem becomes apparent as double-encoded text. But it’s really a problem both ways. Am I mistaken in my expectation that while \xa0 should be a byte, \x{a0} and \x{00a0} should be characters? Note that perlretut(1) seems to support this assumption: Unicode characters in the range of 128-255 use two hexadecimal digits with braces: \x{ab}. Note that this is different than \xab, which is just a hexadecimal byte with no Unicode significance. http://perl.active-venture.com/pod/perlretut-morecharacter.html But maybe this only refers to these escapes inside regular expressions. The documentation appears to be wrong. Unfortunately a lot of the documentation of Perl itself is wrong or confused about Perl’s string model. Or maybe the utf8 pragma breaks things here? Don't think so, though. If I comment it out, I have to recode my script to Latin1 in order for the strings to be valid. Yes. This appears to be a utf8 pragma bug or a bug in the parser that shows up in interaction with the utf8 pragma. == BUG == What happens is that the presence of the ü under the utf8 pragma triggers using the variable-width integer sequence format for the string, but the 0xA0 byte from the \x escape gets written into that buffer verbatim, as if it were a packed byted array string. This is wrong and completely broken. == BUG == Note that the reason I use the utf8 pragma is so I can write Zurück in my source code and automatically have Perl informed that these are characters, not bytes - which is a great convenience. Yeah, it would also work in Latin1, and our editors handle various encodings just fine - but we have a good UTF-8 development environment and there might be characters not representable in Latin1 that I'd like to add to the script source. Writing source in UTF-8 is a perfectly sane practice. No need to justify it. What's your advice for handling this situation more elegantly? Use the \U escape to indicate that you always mean a Unicode code point. Due to other quirks in how \U is implemented, it ends up not triggering the bug that \x would. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: use encoding 'utf8' and \x{00e4} notation
* Michael Ludwig michael.lud...@xing.com [2010-02-02 17:35]: use encoding 'utf8'; The `encoding` pragma is broken. Do not use it. You want use open ':encoding(UTF-8)', ':std'; Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Determining IO layer set on filehandle
* Michael Ludwig michael.lud...@xing.com [2010-01-29 18:30]: It appears you can use that information to restore a filehandle configuration: # Gut: STDOUT duplizieren und Duplikat umstellen. # STDOUT (global) wird nicht verstellt. sub out_bin_good { open my $fh, 'STDOUT' or die dup STDOUT: $!; binmode $fh, ':raw' or die binmode: $!; print $fh BINÄR 3\t, @_; print STDERR * layer: $_\n for PerlIO::get_layers( $fh ); } # Auch gut: IO-Modus sichern und wiederherstellen. sub out_bin_also_good { my @layers = PerlIO::get_layers( STDOUT ); binmode STDOUT, ':raw' or die binmode: $!; print BINÄR 4\t, @_; print STDERR * layer: $_\n for PerlIO::get_layers( STDOUT ); my $layers = join '', map :$_, @layers; binmode STDOUT, $layers; print STDERR reset STDOUT to $layers\n; print STDERR * layer: $_\n for PerlIO::get_layers( STDOUT ); } Considering the relative complexities of the approaches and the fact that conservation of filehandle state is not a concern in your case, I know which solution *I* would favour… Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/