RE: encoding(UTF16-LE) on Windows

2011-01-21 Thread Jan Dubois
On Fri, 21 Jan 2011, Erland Sommarskog wrote:
 
 There is still one thing that is not clear to me. The incorrect end-of-line
 was
 
   0D 00 0A
 
 But the way you describe it, I would expect it to be
 
   0D 0A 00

I went back to the very first message in the thread, where you write:

| When I open the output in a hex editor I see
|
|   31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
|
| I would expect to see:
|
|   31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00
|
| That is, I expect \n to be translated to 0D 00 0A 00, now it is translated 
| to three bytes.

  ( from http://code.activestate.com/lists/perl-unicode/3256/ )

So it looks like what you saw is exactly what you expect to see
based on my explanation. :)

I couldn't find any example where you had \r\0\n as a line ending.

Cheers,
-Jan




RE: encoding(UTF16-LE) on Windows

2011-01-21 Thread Jan Dubois
I wrote:
 I saw some discussion today that the :raw pseudo-layer in the open()
 call will also remove the buffering layer (it doesn’t do that when you
 use it in a binmode() call). I’ll try to remember to send a followup
 once I actually understand what is going on.

That seems indeed to be the case right now.  The bug is filed here:

http://rt.perl.org/rt3//Public/Bug/Display.html?id=80764

A workaround is to use :raw:perlio instead of :raw to turn to
binmode without losing the buffering.

Cheers,
-Jan




RE: encoding(UTF16-LE) on Windows

2011-01-20 Thread Jan Dubois
On Thu, 20 Jan 2011, Erland Sommarskog wrote:
 One can sense some potential for improvements. Not the least in the
 documentation area.

This is open source.  Patches welcome!  This is how things get better.

Cheers,
-Jan



RE: encoding(UTF16-LE) on Windows

2011-01-19 Thread Jan Dubois
On Wed, 19 Jan 2011, Michael Ludwig wrote:
 Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-):
  I'm on Windows and I have this small script:
 
 use strict;
 open F, ':encoding(UTF-16LE)', slask2.txt;
 print F 1\n2\n3\n;
 close F;
 
  When I open the output in a hex editor I see
 
31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00
 
 
 It looks like a bug to me. I'm getting the same result as you for:
 
 * ActivePerl 5.10.1
 * ActivePerl 5.12.1
 * Strawberry 5.12.0
 
 All three participants show correspondingly wrong results for UTF-16BE.
 And also for UTF-16, which just adds the BOM.
 
 Perl/Cygwin 5.10.1 does fine because its OS is cygwin, so it doesn't
 translate \n to CRLF.

You need to stack the I/O layers in the right order.  The :encoding() layer
needs to come last (be at the bottom of the stack), *after* the :crlf layer
adds the additional carriage returns.  The way to pop the default :crlf
layer is to start out with the :raw pseudo-layer:

  open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!;

Cheers,
-Jan



RE: utf8 pragma, lexical scope

2010-09-09 Thread Jan Dubois
On Thu, 09 Sep 2010, Michael Ludwig wrote:
 
 What does not work, however, is to have a variable $käse under utf8
 and then try to refer to it from inside a no utf8 block, using either
 encoding. Without the utf8 pragma, identifiers are not allowed to have
 funny characters. (Yes, it was a stupid exercise.)

The Perl parser is internally not UTF8-clean, so I would recommend not
to use non-ASCII characters in variable names for now, even if it looks
like it mostly works under utf8.

From perltodo.pod:

| =head2 Properly Unicode safe tokeniser and pads.
|
| The tokeniser isn't actually very UTF-8 clean. Cuse utf8; is a hack -
| variable names are stored in stashes as raw bytes, without the utf-8 flag
| set. The pad API only takes a Cchar * pointer, so that's all bytes too. The
| tokeniser ignores the UTF-8-ness of CPL_rsfp, or any SVs returned from
| source filters.  All this could be fixed.

Cheers,
-Jan




RE: Win32 *W functions and old -C behavior

2006-10-31 Thread Jan Dubois
Oleg writes:
 Once upon a time using -C on Win32 made Perl use *W functions, but
 after several versions it was removed, causing all kind of headache
 to people who used it in their programs and hoped that they won't
 have problems any longer with accessing filenames written in different
 scripts. Right now I'm writing a module that have to do all kind
 of unperlish stuff like direct access to memory, pointer arithmetics
 and API calls to have such functionality back and I often wonder
 just why it was removed without any alternative way to ask Perl
 to use native calls (since all *A calls on any NT system is
 just wrappers around *W).
[...]

This has already been discussed some time ago:

http://www.mhonarc.org/archive/html/perl-unicode/2004-02/msg00016.html

Please read the whole thread to see why the code was first disabled
and then removed, and why it is a big task to make Perl work properly
with Unicode filenames (not just on Windows, on all platforms).

Here is a message that shows how you can open a file whose
name can only be represented in Unicode:

http://www.mhonarc.org/archive/html/perl-unicode/2005-02/msg00010.html

Cheers,
-Jan




RE: Perl and unicode file names

2005-02-24 Thread Jan Dubois
On Thu, 24 Feb 2005, Ed Batutis wrote:
 So the problem I have is how to proceed. Should I give up with
 Perl and use Java or C? Any suggestions gratefully received.


 I started a really 'fun' flame war on this topic several months ago,
 so I hesitate to say anything more. But, yes, you should give up on
 Perl - or run your script on Linux with a utf-8 locale. On Win32, Perl
 internals are converting the filename characters to the system default
 code page. So, you are SOL for what you are trying to do.

Actually, you *can* work around the problems on Windows by using the
Win32API::File and the Encode module.  Here is a sample program
Gisle came up with:

#!perl -w

use strict;
use Fcntl qw(O_RDONLY);

use Win32API::File qw(CreateFileW OsFHandleOpenFd :FILE_ OPEN_EXISTING);
use Encode qw(encode);

binmode(STDOUT, :utf8);

my $h = CreateFileW(encode(UTF-16LE, \x{2030}.txt\0), FILE_READ_DATA,
   0, [], OPEN_EXISTING, 0, []);

my $fd = OsFHandleOpenFd($h, O_RDONLY);
die if $fd  0;
open(my $fh, =$fd);
binmode($fh, :encoding(UTF-16LE));
while ($fh) {
print $_;
}
close($fh) || die;
__END__

It may be possible to do similar readdir() emulation as well.

Win32::APIFile is part of libwin32 and already included in ActivePerl.

Cheers,
-Jan




RE: Unicode filenames on Windows with Perl = 5.8.2

2004-06-21 Thread Jan Dubois
 I'm trying to figure out if I can handle Unicode filenames on 
 Windows using Perl 5.8.4, and if so, how.

[...]
 
 So my question is: How can I deal with these files?
 
 I've tried using Perl scalars containing UTF-8, UTF-16LE and 
 UTF-16BE encodings of the filenames, but none of them work 
 either.  Indeed, if I try to write a new file with a name 
 constructed in those ways, then the name of the file actually 
 created is simply the sequence of bytes that make up those encodings.

I don't think this is possible from Perl code right now.  You need to
call CreateFileW() to open a file with a Unicode name.  If you want to
hack something, then I would suggest to write a little XS module that
just swaps out the file handle in a PerlIO* structure.  Look at
PerlIOWin32_open() in win32/win32io.c to see how Perl currently opens
a file.

Another quick-and-dirty solution would be to build a custom Perl
by hacking win32/win32.h.  If you change the USING_WIDE definition
to 1 then you end up with a version of Perl that has the old -C
behavior hardcoded.  Remember that this is not really compatible with
Perl's Unicode handling.

Cheers,
-Jan




RE: Unicode filenames on Windows with Perl = 5.8.2

2004-06-21 Thread Jan Dubois
On Mon, 21 Jun 2004, Steve Hay wrote:
 Jan Dubois wrote:
 You need to call CreateFileW() to open a file with a Unicode name. If
 you want to hack something, then I would suggest to write a little XS
 module that just swaps out the file handle in a PerlIO* structure.
 Look at PerlIOWin32_open() in win32/win32io.c to see how Perl
 currently opens a file.

 I really need all of Perl's filename handling to be Unicode-savvy, not
 just open(). Or have I mis-understood you?

No, you are correct.  I assumed you just wanted to solve a specific problem,
reading a bunch of file with Unicode names.

 Another quick-and-dirty solution would be to build a custom Perl by
 hacking win32/win32.h. If you change the USING_WIDE definition to 1
 then you end up with a version of Perl that has the old -C behavior
 hardcoded. Remember that this is not really compatible with Perl's
 Unicode handling.

 Reading a previous e-mail from you on this subject
 (http://www.mail-archive.com/[EMAIL PROTECTED]/msg02127.html), it
 seems that there are at least four issues with the old -C behaviour:

 1. It didn't do anything with the UTF8 flag in SV's;
 2. There are no wide API functions on Win95/98/ME;
 3. Some core Perl API's take char *'s, not SV *'s;
 4. Non-core modules would be affected too.

 I would guess that 1 is maybe not too much work? (Just a wild guess -
 I don't really know.)

Probably, but it relies on 3 being implemented first. The char* doesn't
carry the UTF8 flag.

 I must confess that 2 doesn't really bother me since the 9x type
 systems are now a thing of the past (XP onwards are all NT type
 systems, even XP Home Edition).

While I also wish that Win 9x would just cease to exist, I don't think
any core Perl patches would be accepted if they would render Perl
inoperable on those systems. You would have to provide at least a
fallback solution, even if it means creating separate binaries for 9x
and NT Windows systems.

 How much work is invovled in 3?

Perl internals pretend to use C runtime routines like open(), fstat()
etc. but reimplements them on some systems to get a consistent behavior.
You will need to define a different API that uses SV* instead of char*
for all file/directory name arguments and use that one exclusively. Of
course they all need to be indirected through the PERL_IMPLICIT_SYS
system so that they can be redefined for individual operating systems.

I'm not sure how much work the implementation is, but I'm afraid you
would also need to spend significant time arguing about it.

 Regarding 4, is it only Win32 modules that would be affected (where
 A functions would need replacing with W functions), or would
 others be affected too?

Others would be too, because they use the redefined open() function from
Perl if they are opening a file. Of course they would need to be changed
anyways to make them support Unicode filenames, so maybe this isn't an
issue. You would still need to provide an ASCII interface for the new
Perl Unicode API so that you could funnel the modules open() call
through it.

 Given that 3 at least would probably break binary compatibility, I
 guess this sort of thing won't be done any sooner than 5.10 at the
 earliest, but having something done in time for that would be great.
 Is that a realistic possibility, or just wishful thinking?

I think it is possible, but it requires someone to both do the work and
to argue for it on P5P. Without this champion, I don't see it
happening at all.

Cheers,
-Jan




Re: encoding...

2003-11-02 Thread Jan Dubois
On Sun, 2 Nov 2003 23:24:41 +, John Delacour [EMAIL PROTECTED] wrote:

Question 1.

In this script I would like for convenience' sake to use variables in 
the second line, but I don't seem to be able to do so.  Am I missing 
something or is is simply not possible?


$source = 'MacRoman';  # I want to use this in the next line
use encoding qw(  MacRoman  ), STDOUT = qw(  utf-8 ) ;

Should work if you initialize the variable in a BEGIN block:

BEGIN { $source = 'MacRoman'; }
use encoding $source, STDOUT = 'utf-8';

use is executed at compile time, so variables initialized at runtime
won't be usable.

$text = café ;
print $text ;


Question 2

Is there a way, without using q(), to single-quote a block of text as 
one can double-quote it this way:

$text = EOT;

Yes, put single quotes around your EOT marker:

$text = 'EOT';

$ome$tuff
$ome$tuff
$ome$tuff
EOT
#

I want to be able to quote a block of JIS-encoded stuff (which 
contains lots of $)

Cheers,
-Jan