Re: encoding(UTF16-LE) on Windows

2011-02-02 Thread Erland Sommarskog
Michael Ludwig (mil...@gmx.de) writes:
 For instance, I use Windows exclusively, so Unicode in file names is
 no problem.
 
 Did a quick test:
 
 (v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState)
 
 * a…b.txt
 * not correct
 * doesn't have anything with uni or utf in perl -V
 
OK, so the implementation would have to know that on this platform 
filenames are in UTF-16, on this it is UTF-8 and so on.

Not that it is a terribly big deal. In the program where I want to 
support Unicode names, I've already written a module around Win32API::File,
which permits to open a file in Windows, and the associate it with 
a file handle.


-- 
Erland Sommarskog, Stockholm, esq...@sommarskog.se


Re: encoding(UTF16-LE) on Windows

2011-01-31 Thread Erland Sommarskog
Michael Ludwig (mil...@gmx.de) writes:
 Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):
 
 Yes, there certainly seems to be some more stuff to do in the Unicode
 support in Perl. For instance, support for Unicode filenames in open
 or opendir.
 
 I think there is no portable answer here, as it depends on the
 filesystem's support for Unicode.
 
Did I say it have to be portable? :-)

Obviously, Unicode cannot happen on systems which do not support Unicode.

For instance, I use Windows exclusively, so Unicode in file names is no 
problem. On the other hand, it's a dead case for system() and backticks 
as far as I can make out. (That is, I have not been able to run Unicode 
BAT files.)



-- 
Erland Sommarskog, Stockholm, esq...@sommarskog.se


Re: encoding(UTF16-LE) on Windows

2011-01-31 Thread Michael Ludwig
Erland Sommarskog schrieb am 31.01.2011 um 23:42 (+0100):
 Michael Ludwig (mil...@gmx.de) writes:
  Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):
  
  Yes, there certainly seems to be some more stuff to do in the
  Unicode support in Perl. For instance, support for Unicode
  filenames in open or opendir.
  
  I think there is no portable answer here, as it depends on the
  filesystem's support for Unicode.
  
 Did I say it have to be portable? :-)

No … but Perl did. :-)

 For instance, I use Windows exclusively, so Unicode in file names is
 no problem.

Did a quick test:

  \,,,/
  (o o)
--oOOo-(_)-oOOo--
use strict;
use warnings;
use utf8;
my $fn = 'a…b.txt'; # mit Unicode-Zeichen
open my $fh, ':encoding(UTF-8)', $fn or die open $fn: $!;
print $fh $fn\n;
close $fh;
-

v5.10.1 (*) built for i686-cygwin-thread-multi-64int

* a…b.txt
* correct (in Explorer, cmd.exe, MinTTY)
* has: CYG17 utf8-paths (which might be responsible)

(v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState)

* a…b.txt
* not correct
* doesn't have anything with uni or utf in perl -V

-- 
Michael Ludwig


Re: encoding(UTF16-LE) on Windows

2011-01-30 Thread Michael Ludwig
Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):

 Yes, there certainly seems to be some more stuff to do in the Unicode
 support in Perl. For instance, support for Unicode filenames in open
 or opendir.

I think there is no portable answer here, as it depends on the
filesystem's support for Unicode.

Or what exactly are you referring to?

-- 
Michael Ludwig


RE: encoding(UTF16-LE) on Windows

2011-01-23 Thread Erland Sommarskog
Jan Dubois (j...@activestate.com) writes:
 You need to stack the I/O layers in the right order.  The :encoding()
 layer needs to come last (be at the bottom of the stack), *after* the
 :crlf layer adds the additional carriage returns.  The way to pop the
 default :crlf layer is to start out with the :raw pseudo-layer: 
 
   open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!;
 
So this works. But this does not:

   use strict;

   open F, 'slask.out';
   binmode(F, ':raw:encoding(UTF16-LE):crlf');
   print F Alfa\nBeta\nGamma\n;

Looking at the file in a binary editor, I see:

  41 00 6C 00 66 00 61 00  0D 0A 00 42 00 65 00 74
  00 61 00 0D 0A 00 47 00  61 00 6D 00 6D 00 61 00
  0D 0A 00

In total 35 bytes. Which is a very odd number for a UTF16 file.


-- 
Erland Sommarskog, Stockholm, esq...@sommarskog.se


RE: encoding(UTF16-LE) on Windows

2011-01-21 Thread Erland Sommarskog
Jan Dubois (j...@activestate.com) writes:
 Now when you print a string to the filehandle, then it will be passed
 to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
 string, and then passes it on to the next lower layer :encoding, which
 will do the encoding, and when it reaches the bottom of the stack the
 data is actually written to the filesystem.
 
 Files opened on Windows already have the :crlf layer pushed by default,
 so you somehow need to get the :encoding layer *below* it.  If
 you have it on top, then the crlf substitution happens *after* the
 encoding, leading to incorrect data.
 
There is still one thing that is not clear to me. The incorrect end-of-line
was

  0D 00 0A

But the way you describe it, I would expect it to be 

  0D 0A 00

That is, first the string is encoded in UTF-16LE and the newline gets
expanded from 0A to 0A 00. 

Next, the crlf layer jumps in and blindly adds a carriage return, but 
somehow it does manage to get the \r character correct nevertheless, but 
loses the high byte of the \n.

-- 
Erland Sommarskog, Stockholm, esq...@sommarskog.se


RE: encoding(UTF16-LE) on Windows

2011-01-21 Thread Jan Dubois
On Fri, 21 Jan 2011, Erland Sommarskog wrote:
 
 There is still one thing that is not clear to me. The incorrect end-of-line
 was
 
   0D 00 0A
 
 But the way you describe it, I would expect it to be
 
   0D 0A 00

I went back to the very first message in the thread, where you write:

| When I open the output in a hex editor I see
|
|   31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
|
| I would expect to see:
|
|   31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00
|
| That is, I expect \n to be translated to 0D 00 0A 00, now it is translated 
| to three bytes.

  ( from http://code.activestate.com/lists/perl-unicode/3256/ )

So it looks like what you saw is exactly what you expect to see
based on my explanation. :)

I couldn't find any example where you had \r\0\n as a line ending.

Cheers,
-Jan




RE: encoding(UTF16-LE) on Windows

2011-01-21 Thread Jan Dubois
I wrote:
 I saw some discussion today that the :raw pseudo-layer in the open()
 call will also remove the buffering layer (it doesn’t do that when you
 use it in a binmode() call). I’ll try to remember to send a followup
 once I actually understand what is going on.

That seems indeed to be the case right now.  The bug is filed here:

http://rt.perl.org/rt3//Public/Bug/Display.html?id=80764

A workaround is to use :raw:perlio instead of :raw to turn to
binmode without losing the buffering.

Cheers,
-Jan




Re: encoding(UTF16-LE) on Windows

2011-01-20 Thread Michael Ludwig
Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-):
 Jan Dubois (j...@activestate.com) writes:
  You need to stack the I/O layers in the right order.  The :encoding()
  layer needs to come last (be at the bottom of the stack), *after* the
  :crlf layer adds the additional carriage returns.  The way to pop the
  default :crlf layer is to start out with the :raw pseudo-layer: 
  
open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!;
 
 Certainly not anywhere close to intuitive. And the explanation is even
 more muddy. Needs to come last - it is smack in the middle. after
 the :crlf layer - it comes before.

The explanation makes sense; so much so that I overlooked the fact that
this is simply not how it works. Luckily, you were being vigilant. :-)

What I can imagine is that handling the logical entity \n is a some sort
of a post-processing step, which would explain why it needs to come last.

Here's a short demo script to show various layer combinations and how
they go wrong:

  \,,,/
  (o o)
--oOOo-(_)-oOOo--
use strict;

my $str = 1\n2\n3\n;  # string to print
my $fno = 1;# counter for filenames

sub out {
  my $fn = sprintf 'u%02u-%s.txt', $fno++, (join '-', @_) || 'NONE';
  my $layers = join '', map :$_, @_;
  printf STDERR %30s = %-40s\n, $layers, $fn;
  open my $fh, $layers, $fn or die open $fn: $!;
  print $fh $str;
  close $fh;
}

my $e = 'encoding(UTF-16LE)';
my $r = 'raw';
my $n = 'crlf';

out;# default layers
out $r; # reset default layers
out $r, $n; # same as default on Windows
out $n, $r; # :raw at the end resets *all* layers
out $e, $r; # ditto
out $n, $e, $r; # ditto
out $e, $n, $r; # ditto
out $r, $e, $n; # appears illogical, but correct result
out $r, $n, $e; # appears logical, but wrong result
out $e, $n;
out $n, $e;
out $n, $r, $e; # :crlf reset

-- 
Michael Ludwig


RE: encoding(UTF16-LE) on Windows

2011-01-20 Thread Jan Dubois
On Thu, 20 Jan 2011, Erland Sommarskog wrote:
 One can sense some potential for improvements. Not the least in the
 documentation area.

This is open source.  Patches welcome!  This is how things get better.

Cheers,
-Jan



Re: encoding(UTF16-LE) on Windows

2011-01-20 Thread 'Michael Ludwig'
[RE: encoding(UTF16-LE) on Windows]
Jan Dubois schrieb am 20.01.2011 um 12:45 (-0800):
 On Thu, 20 Jan 2011, Michael Ludwig wrote:
  Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-):
   Jan Dubois (j...@activestate.com) writes:
You need to stack the I/O layers in the right order.  The :encoding()
layer needs to come last (be at the bottom of the stack), *after* the
:crlf layer adds the additional carriage returns.  The way to pop the
default :crlf layer is to start out with the :raw pseudo-layer:
   
  open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!;
  
   Certainly not anywhere close to intuitive. And the explanation is even
   more muddy. Needs to come last - it is smack in the middle. after
   the :crlf layer - it comes before.
  
  The explanation makes sense; so much so that I overlooked the fact that
  this is simply not how it works. Luckily, you were being vigilant. :-)
 
 Would you mind explaining how it is *not* working the way I
 described it above?

Sorry - it works exactly the way you described above. I didn't read
properly. I got confused by the uniform look of real and pseudo layers.
The :raw pseudo layer is not a layer, but rather, as you write, an
instruction to clear the stack, like this:

  :raw  - clear()
  :encoding(UTF-16LE)   - push( encoding(UTF-16LE) )
  :crlf - push( crlf )

I was *wrongly* thinking this, as if :raw were another layer, and not a
clearing instruction:

  :raw  - push( raw )  # wrong!
  :encoding(UTF-16LE)   - push( encoding(UTF-16LE) )
  :crlf - push( crlf )

Regarding your explanation:

 I realize that the fact that layers work as a stack may be
 confusing, which is why I annotated last with bottom of the stack.
 Of course the one last on the stack is the first in the list of layers
 passed to open() because stacks are LIFO (last in/first out):
 
:raw- clears the existing :crlf layer from the stack
  could have used :pop instead, but :raw is more robust
 
:encoding(UTF-16LE) - pushes the :encoding layer to the stack.  This makes
  it the last layer on the stack (and also still the
  first, for now).
 
:crlf   - pushes the :crlf layer on the stack.  :encoding is
  still the last layer, but :crlf is now the first.
 
 Now when you print a string to the filehandle, then it will be passed
 to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
 string, and then passes it on to the next lower layer :encoding, which
 will do the encoding, and when it reaches the bottom of the stack the
 data is actually written to the filesystem.
 
 Files opened on Windows already have the :crlf layer pushed by default,
 so you somehow need to get the :encoding layer *below* it.  If
 you have it on top, then the crlf substitution happens *after* the
 encoding, leading to incorrect data.

I think you've clarified it for all eternity.

What would be the best place to add your explanation to the docs?

http://perldoc.perl.org/functions/binmode.html
http://perldoc.perl.org/functions/open.html
http://perldoc.perl.org/perlunicode.html
http://perldoc.perl.org/PerlIO.html

Judging from existing content, I think PerlIO would be a good place for
this addition. It already has a lot of great information. However, it
starts going medias in res instead of first providing an overview and
introducing the stack picture. This could be improved.

On the downside, it is buried in the Modules Section. And the title [1]
is just too technical and might scare novice readers away.

Can you think of a better place for your user-friendly doc addition? You
obviously know the docs far better than I do … :-)

[1] PerlIO - On demand loader for PerlIO layers
and root of PerlIO::* name space

-- 
Michael Ludwig


Re: encoding(UTF16-LE) on Windows

2011-01-20 Thread Bob Hallissy


Jan Dubois wrote:

Files opened on Windows already have the :crlf layer pushed by default,
so you somehow need to get the :encoding layer*below*  it.


Is it possible to re-write the working statement

  open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!;

in a way that works correctly on any platform (without referring to $^O) ?

Bob


Re: encoding(UTF16-LE) on Windows

2011-01-19 Thread Michael Ludwig
Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-):
 I'm on Windows and I have this small script:
 
use strict;
open F, ':encoding(UTF-16LE)', slask2.txt;
print F 1\n2\n3\n;
close F;
 
 When I open the output in a hex editor I see
 
   31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00

In other words (od -c):

1  \0  \r  \n  \0   2  \0  \r  \n  \0   3  \0  \r  \n  \0

 I would expect to see:
 
   31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00

Guess you would even expect:

…   33 00 OD 00 OA 00

 That is, I expect \n to be translated to 0D 00 0A 00, now it is
 translated to three bytes.

It looks like a bug to me. I'm getting the same result as you for:

* ActivePerl 5.10.1
* ActivePerl 5.12.1
* Strawberry 5.12.0

All three participants show correspondingly wrong results for UTF-16BE.
And also for UTF-16, which just adds the BOM.

Perl/Cygwin 5.10.1 does fine because its OS is cygwin, so it doesn't
translate \n to CRLF.

-- 
Michael Ludwig


RE: encoding(UTF16-LE) on Windows

2011-01-19 Thread Jan Dubois
On Wed, 19 Jan 2011, Michael Ludwig wrote:
 Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-):
  I'm on Windows and I have this small script:
 
 use strict;
 open F, ':encoding(UTF-16LE)', slask2.txt;
 print F 1\n2\n3\n;
 close F;
 
  When I open the output in a hex editor I see
 
31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
 
 
 It looks like a bug to me. I'm getting the same result as you for:
 
 * ActivePerl 5.10.1
 * ActivePerl 5.12.1
 * Strawberry 5.12.0
 
 All three participants show correspondingly wrong results for UTF-16BE.
 And also for UTF-16, which just adds the BOM.
 
 Perl/Cygwin 5.10.1 does fine because its OS is cygwin, so it doesn't
 translate \n to CRLF.

You need to stack the I/O layers in the right order.  The :encoding() layer
needs to come last (be at the bottom of the stack), *after* the :crlf layer
adds the additional carriage returns.  The way to pop the default :crlf
layer is to start out with the :raw pseudo-layer:

  open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!;

Cheers,
-Jan



Re: encoding(UTF16-LE) on Windows

2011-01-19 Thread 'Michael Ludwig'
Jan Dubois schrieb am 19.01.2011 um 11:08 (-0800):

 You need to stack the I/O layers in the right order.  The :encoding()
 layer needs to come last (be at the bottom of the stack), *after* the
 :crlf layer adds the additional carriage returns.  The way to pop the
 default :crlf layer is to start out with the :raw pseudo-layer:
 
   open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!;

Cool, that works. thanks! :-)

-- 
Michael Ludwig