Re: encoding(UTF16-LE) on Windows
Michael Ludwig (mil...@gmx.de) writes: For instance, I use Windows exclusively, so Unicode in file names is no problem. Did a quick test: (v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState) * aââ¬Â¦b.txt * not correct * doesn't have anything with uni or utf in perl -V OK, so the implementation would have to know that on this platform filenames are in UTF-16, on this it is UTF-8 and so on. Not that it is a terribly big deal. In the program where I want to support Unicode names, I've already written a module around Win32API::File, which permits to open a file in Windows, and the associate it with a file handle. -- Erland Sommarskog, Stockholm, esq...@sommarskog.se
Re: encoding(UTF16-LE) on Windows
Michael Ludwig (mil...@gmx.de) writes: Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100): Yes, there certainly seems to be some more stuff to do in the Unicode support in Perl. For instance, support for Unicode filenames in open or opendir. I think there is no portable answer here, as it depends on the filesystem's support for Unicode. Did I say it have to be portable? :-) Obviously, Unicode cannot happen on systems which do not support Unicode. For instance, I use Windows exclusively, so Unicode in file names is no problem. On the other hand, it's a dead case for system() and backticks as far as I can make out. (That is, I have not been able to run Unicode BAT files.) -- Erland Sommarskog, Stockholm, esq...@sommarskog.se
Re: encoding(UTF16-LE) on Windows
Erland Sommarskog schrieb am 31.01.2011 um 23:42 (+0100): Michael Ludwig (mil...@gmx.de) writes: Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100): Yes, there certainly seems to be some more stuff to do in the Unicode support in Perl. For instance, support for Unicode filenames in open or opendir. I think there is no portable answer here, as it depends on the filesystem's support for Unicode. Did I say it have to be portable? :-) No … but Perl did. :-) For instance, I use Windows exclusively, so Unicode in file names is no problem. Did a quick test: \,,,/ (o o) --oOOo-(_)-oOOo-- use strict; use warnings; use utf8; my $fn = 'a…b.txt'; # mit Unicode-Zeichen open my $fh, ':encoding(UTF-8)', $fn or die open $fn: $!; print $fh $fn\n; close $fh; - v5.10.1 (*) built for i686-cygwin-thread-multi-64int * a…b.txt * correct (in Explorer, cmd.exe, MinTTY) * has: CYG17 utf8-paths (which might be responsible) (v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState) * a…b.txt * not correct * doesn't have anything with uni or utf in perl -V -- Michael Ludwig
Re: encoding(UTF16-LE) on Windows
Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100): Yes, there certainly seems to be some more stuff to do in the Unicode support in Perl. For instance, support for Unicode filenames in open or opendir. I think there is no portable answer here, as it depends on the filesystem's support for Unicode. Or what exactly are you referring to? -- Michael Ludwig
RE: encoding(UTF16-LE) on Windows
Jan Dubois (j...@activestate.com) writes: You need to stack the I/O layers in the right order. The :encoding() layer needs to come last (be at the bottom of the stack), *after* the :crlf layer adds the additional carriage returns. The way to pop the default :crlf layer is to start out with the :raw pseudo-layer: open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!; So this works. But this does not: use strict; open F, 'slask.out'; binmode(F, ':raw:encoding(UTF16-LE):crlf'); print F Alfa\nBeta\nGamma\n; Looking at the file in a binary editor, I see: 41 00 6C 00 66 00 61 00 0D 0A 00 42 00 65 00 74 00 61 00 0D 0A 00 47 00 61 00 6D 00 6D 00 61 00 0D 0A 00 In total 35 bytes. Which is a very odd number for a UTF16 file. -- Erland Sommarskog, Stockholm, esq...@sommarskog.se
RE: encoding(UTF16-LE) on Windows
Jan Dubois (j...@activestate.com) writes: Now when you print a string to the filehandle, then it will be passed to the top-most layer first (:crlf), which will s/\n/\r\n/g on the string, and then passes it on to the next lower layer :encoding, which will do the encoding, and when it reaches the bottom of the stack the data is actually written to the filesystem. Files opened on Windows already have the :crlf layer pushed by default, so you somehow need to get the :encoding layer *below* it. If you have it on top, then the crlf substitution happens *after* the encoding, leading to incorrect data. There is still one thing that is not clear to me. The incorrect end-of-line was 0D 00 0A But the way you describe it, I would expect it to be 0D 0A 00 That is, first the string is encoded in UTF-16LE and the newline gets expanded from 0A to 0A 00. Next, the crlf layer jumps in and blindly adds a carriage return, but somehow it does manage to get the \r character correct nevertheless, but loses the high byte of the \n. -- Erland Sommarskog, Stockholm, esq...@sommarskog.se
RE: encoding(UTF16-LE) on Windows
On Fri, 21 Jan 2011, Erland Sommarskog wrote: There is still one thing that is not clear to me. The incorrect end-of-line was 0D 00 0A But the way you describe it, I would expect it to be 0D 0A 00 I went back to the very first message in the thread, where you write: | When I open the output in a hex editor I see | | 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00 | | I would expect to see: | | 31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00 | | That is, I expect \n to be translated to 0D 00 0A 00, now it is translated | to three bytes. ( from http://code.activestate.com/lists/perl-unicode/3256/ ) So it looks like what you saw is exactly what you expect to see based on my explanation. :) I couldn't find any example where you had \r\0\n as a line ending. Cheers, -Jan
RE: encoding(UTF16-LE) on Windows
I wrote: I saw some discussion today that the :raw pseudo-layer in the open() call will also remove the buffering layer (it doesn’t do that when you use it in a binmode() call). I’ll try to remember to send a followup once I actually understand what is going on. That seems indeed to be the case right now. The bug is filed here: http://rt.perl.org/rt3//Public/Bug/Display.html?id=80764 A workaround is to use :raw:perlio instead of :raw to turn to binmode without losing the buffering. Cheers, -Jan
Re: encoding(UTF16-LE) on Windows
Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-): Jan Dubois (j...@activestate.com) writes: You need to stack the I/O layers in the right order. The :encoding() layer needs to come last (be at the bottom of the stack), *after* the :crlf layer adds the additional carriage returns. The way to pop the default :crlf layer is to start out with the :raw pseudo-layer: open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!; Certainly not anywhere close to intuitive. And the explanation is even more muddy. Needs to come last - it is smack in the middle. after the :crlf layer - it comes before. The explanation makes sense; so much so that I overlooked the fact that this is simply not how it works. Luckily, you were being vigilant. :-) What I can imagine is that handling the logical entity \n is a some sort of a post-processing step, which would explain why it needs to come last. Here's a short demo script to show various layer combinations and how they go wrong: \,,,/ (o o) --oOOo-(_)-oOOo-- use strict; my $str = 1\n2\n3\n; # string to print my $fno = 1;# counter for filenames sub out { my $fn = sprintf 'u%02u-%s.txt', $fno++, (join '-', @_) || 'NONE'; my $layers = join '', map :$_, @_; printf STDERR %30s = %-40s\n, $layers, $fn; open my $fh, $layers, $fn or die open $fn: $!; print $fh $str; close $fh; } my $e = 'encoding(UTF-16LE)'; my $r = 'raw'; my $n = 'crlf'; out;# default layers out $r; # reset default layers out $r, $n; # same as default on Windows out $n, $r; # :raw at the end resets *all* layers out $e, $r; # ditto out $n, $e, $r; # ditto out $e, $n, $r; # ditto out $r, $e, $n; # appears illogical, but correct result out $r, $n, $e; # appears logical, but wrong result out $e, $n; out $n, $e; out $n, $r, $e; # :crlf reset -- Michael Ludwig
RE: encoding(UTF16-LE) on Windows
On Thu, 20 Jan 2011, Erland Sommarskog wrote: One can sense some potential for improvements. Not the least in the documentation area. This is open source. Patches welcome! This is how things get better. Cheers, -Jan
Re: encoding(UTF16-LE) on Windows
[RE: encoding(UTF16-LE) on Windows] Jan Dubois schrieb am 20.01.2011 um 12:45 (-0800): On Thu, 20 Jan 2011, Michael Ludwig wrote: Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-): Jan Dubois (j...@activestate.com) writes: You need to stack the I/O layers in the right order. The :encoding() layer needs to come last (be at the bottom of the stack), *after* the :crlf layer adds the additional carriage returns. The way to pop the default :crlf layer is to start out with the :raw pseudo-layer: open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!; Certainly not anywhere close to intuitive. And the explanation is even more muddy. Needs to come last - it is smack in the middle. after the :crlf layer - it comes before. The explanation makes sense; so much so that I overlooked the fact that this is simply not how it works. Luckily, you were being vigilant. :-) Would you mind explaining how it is *not* working the way I described it above? Sorry - it works exactly the way you described above. I didn't read properly. I got confused by the uniform look of real and pseudo layers. The :raw pseudo layer is not a layer, but rather, as you write, an instruction to clear the stack, like this: :raw - clear() :encoding(UTF-16LE) - push( encoding(UTF-16LE) ) :crlf - push( crlf ) I was *wrongly* thinking this, as if :raw were another layer, and not a clearing instruction: :raw - push( raw ) # wrong! :encoding(UTF-16LE) - push( encoding(UTF-16LE) ) :crlf - push( crlf ) Regarding your explanation: I realize that the fact that layers work as a stack may be confusing, which is why I annotated last with bottom of the stack. Of course the one last on the stack is the first in the list of layers passed to open() because stacks are LIFO (last in/first out): :raw- clears the existing :crlf layer from the stack could have used :pop instead, but :raw is more robust :encoding(UTF-16LE) - pushes the :encoding layer to the stack. This makes it the last layer on the stack (and also still the first, for now). :crlf - pushes the :crlf layer on the stack. :encoding is still the last layer, but :crlf is now the first. Now when you print a string to the filehandle, then it will be passed to the top-most layer first (:crlf), which will s/\n/\r\n/g on the string, and then passes it on to the next lower layer :encoding, which will do the encoding, and when it reaches the bottom of the stack the data is actually written to the filesystem. Files opened on Windows already have the :crlf layer pushed by default, so you somehow need to get the :encoding layer *below* it. If you have it on top, then the crlf substitution happens *after* the encoding, leading to incorrect data. I think you've clarified it for all eternity. What would be the best place to add your explanation to the docs? http://perldoc.perl.org/functions/binmode.html http://perldoc.perl.org/functions/open.html http://perldoc.perl.org/perlunicode.html http://perldoc.perl.org/PerlIO.html Judging from existing content, I think PerlIO would be a good place for this addition. It already has a lot of great information. However, it starts going medias in res instead of first providing an overview and introducing the stack picture. This could be improved. On the downside, it is buried in the Modules Section. And the title [1] is just too technical and might scare novice readers away. Can you think of a better place for your user-friendly doc addition? You obviously know the docs far better than I do … :-) [1] PerlIO - On demand loader for PerlIO layers and root of PerlIO::* name space -- Michael Ludwig
Re: encoding(UTF16-LE) on Windows
Jan Dubois wrote: Files opened on Windows already have the :crlf layer pushed by default, so you somehow need to get the :encoding layer*below* it. Is it possible to re-write the working statement open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!; in a way that works correctly on any platform (without referring to $^O) ? Bob
Re: encoding(UTF16-LE) on Windows
Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-): I'm on Windows and I have this small script: use strict; open F, ':encoding(UTF-16LE)', slask2.txt; print F 1\n2\n3\n; close F; When I open the output in a hex editor I see 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00 In other words (od -c): 1 \0 \r \n \0 2 \0 \r \n \0 3 \0 \r \n \0 I would expect to see: 31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00 Guess you would even expect: … 33 00 OD 00 OA 00 That is, I expect \n to be translated to 0D 00 0A 00, now it is translated to three bytes. It looks like a bug to me. I'm getting the same result as you for: * ActivePerl 5.10.1 * ActivePerl 5.12.1 * Strawberry 5.12.0 All three participants show correspondingly wrong results for UTF-16BE. And also for UTF-16, which just adds the BOM. Perl/Cygwin 5.10.1 does fine because its OS is cygwin, so it doesn't translate \n to CRLF. -- Michael Ludwig
RE: encoding(UTF16-LE) on Windows
On Wed, 19 Jan 2011, Michael Ludwig wrote: Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-): I'm on Windows and I have this small script: use strict; open F, ':encoding(UTF-16LE)', slask2.txt; print F 1\n2\n3\n; close F; When I open the output in a hex editor I see 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00 It looks like a bug to me. I'm getting the same result as you for: * ActivePerl 5.10.1 * ActivePerl 5.12.1 * Strawberry 5.12.0 All three participants show correspondingly wrong results for UTF-16BE. And also for UTF-16, which just adds the BOM. Perl/Cygwin 5.10.1 does fine because its OS is cygwin, so it doesn't translate \n to CRLF. You need to stack the I/O layers in the right order. The :encoding() layer needs to come last (be at the bottom of the stack), *after* the :crlf layer adds the additional carriage returns. The way to pop the default :crlf layer is to start out with the :raw pseudo-layer: open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!; Cheers, -Jan
Re: encoding(UTF16-LE) on Windows
Jan Dubois schrieb am 19.01.2011 um 11:08 (-0800): You need to stack the I/O layers in the right order. The :encoding() layer needs to come last (be at the bottom of the stack), *after* the :crlf layer adds the additional carriage returns. The way to pop the default :crlf layer is to start out with the :raw pseudo-layer: open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!; Cool, that works. thanks! :-) -- Michael Ludwig