RE: encoding(UTF16-LE) on Windows
On Fri, 21 Jan 2011, Erland Sommarskog wrote: There is still one thing that is not clear to me. The incorrect end-of-line was 0D 00 0A But the way you describe it, I would expect it to be 0D 0A 00 I went back to the very first message in the thread, where you write: | When I open the output in a hex editor I see | | 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00 | | I would expect to see: | | 31 00 0D 00 0A 00 32 00 0D 00 0A 00 33 0D 00 0A 00 | | That is, I expect \n to be translated to 0D 00 0A 00, now it is translated | to three bytes. ( from http://code.activestate.com/lists/perl-unicode/3256/ ) So it looks like what you saw is exactly what you expect to see based on my explanation. :) I couldn't find any example where you had \r\0\n as a line ending. Cheers, -Jan
RE: encoding(UTF16-LE) on Windows
I wrote: I saw some discussion today that the :raw pseudo-layer in the open() call will also remove the buffering layer (it doesn’t do that when you use it in a binmode() call). I’ll try to remember to send a followup once I actually understand what is going on. That seems indeed to be the case right now. The bug is filed here: http://rt.perl.org/rt3//Public/Bug/Display.html?id=80764 A workaround is to use :raw:perlio instead of :raw to turn to binmode without losing the buffering. Cheers, -Jan
RE: encoding(UTF16-LE) on Windows
On Thu, 20 Jan 2011, Erland Sommarskog wrote: One can sense some potential for improvements. Not the least in the documentation area. This is open source. Patches welcome! This is how things get better. Cheers, -Jan
RE: encoding(UTF16-LE) on Windows
On Wed, 19 Jan 2011, Michael Ludwig wrote: Erland Sommarskog schrieb am 17.01.2011 um 13:57 (-): I'm on Windows and I have this small script: use strict; open F, ':encoding(UTF-16LE)', slask2.txt; print F 1\n2\n3\n; close F; When I open the output in a hex editor I see 31 00 0D 0A 00 32 00 0D 0A 00 33 0D 0A 00 It looks like a bug to me. I'm getting the same result as you for: * ActivePerl 5.10.1 * ActivePerl 5.12.1 * Strawberry 5.12.0 All three participants show correspondingly wrong results for UTF-16BE. And also for UTF-16, which just adds the BOM. Perl/Cygwin 5.10.1 does fine because its OS is cygwin, so it doesn't translate \n to CRLF. You need to stack the I/O layers in the right order. The :encoding() layer needs to come last (be at the bottom of the stack), *after* the :crlf layer adds the additional carriage returns. The way to pop the default :crlf layer is to start out with the :raw pseudo-layer: open(my $fh, :raw:encoding(UTF-16LE):crlf, $filename) or die $!; Cheers, -Jan
RE: utf8 pragma, lexical scope
On Thu, 09 Sep 2010, Michael Ludwig wrote: What does not work, however, is to have a variable $käse under utf8 and then try to refer to it from inside a no utf8 block, using either encoding. Without the utf8 pragma, identifiers are not allowed to have funny characters. (Yes, it was a stupid exercise.) The Perl parser is internally not UTF8-clean, so I would recommend not to use non-ASCII characters in variable names for now, even if it looks like it mostly works under utf8. From perltodo.pod: | =head2 Properly Unicode safe tokeniser and pads. | | The tokeniser isn't actually very UTF-8 clean. Cuse utf8; is a hack - | variable names are stored in stashes as raw bytes, without the utf-8 flag | set. The pad API only takes a Cchar * pointer, so that's all bytes too. The | tokeniser ignores the UTF-8-ness of CPL_rsfp, or any SVs returned from | source filters. All this could be fixed. Cheers, -Jan
RE: Win32 *W functions and old -C behavior
Oleg writes: Once upon a time using -C on Win32 made Perl use *W functions, but after several versions it was removed, causing all kind of headache to people who used it in their programs and hoped that they won't have problems any longer with accessing filenames written in different scripts. Right now I'm writing a module that have to do all kind of unperlish stuff like direct access to memory, pointer arithmetics and API calls to have such functionality back and I often wonder just why it was removed without any alternative way to ask Perl to use native calls (since all *A calls on any NT system is just wrappers around *W). [...] This has already been discussed some time ago: http://www.mhonarc.org/archive/html/perl-unicode/2004-02/msg00016.html Please read the whole thread to see why the code was first disabled and then removed, and why it is a big task to make Perl work properly with Unicode filenames (not just on Windows, on all platforms). Here is a message that shows how you can open a file whose name can only be represented in Unicode: http://www.mhonarc.org/archive/html/perl-unicode/2005-02/msg00010.html Cheers, -Jan
RE: Perl and unicode file names
On Thu, 24 Feb 2005, Ed Batutis wrote: So the problem I have is how to proceed. Should I give up with Perl and use Java or C? Any suggestions gratefully received. I started a really 'fun' flame war on this topic several months ago, so I hesitate to say anything more. But, yes, you should give up on Perl - or run your script on Linux with a utf-8 locale. On Win32, Perl internals are converting the filename characters to the system default code page. So, you are SOL for what you are trying to do. Actually, you *can* work around the problems on Windows by using the Win32API::File and the Encode module. Here is a sample program Gisle came up with: #!perl -w use strict; use Fcntl qw(O_RDONLY); use Win32API::File qw(CreateFileW OsFHandleOpenFd :FILE_ OPEN_EXISTING); use Encode qw(encode); binmode(STDOUT, :utf8); my $h = CreateFileW(encode(UTF-16LE, \x{2030}.txt\0), FILE_READ_DATA, 0, [], OPEN_EXISTING, 0, []); my $fd = OsFHandleOpenFd($h, O_RDONLY); die if $fd 0; open(my $fh, =$fd); binmode($fh, :encoding(UTF-16LE)); while ($fh) { print $_; } close($fh) || die; __END__ It may be possible to do similar readdir() emulation as well. Win32::APIFile is part of libwin32 and already included in ActivePerl. Cheers, -Jan
RE: Unicode filenames on Windows with Perl = 5.8.2
I'm trying to figure out if I can handle Unicode filenames on Windows using Perl 5.8.4, and if so, how. [...] So my question is: How can I deal with these files? I've tried using Perl scalars containing UTF-8, UTF-16LE and UTF-16BE encodings of the filenames, but none of them work either. Indeed, if I try to write a new file with a name constructed in those ways, then the name of the file actually created is simply the sequence of bytes that make up those encodings. I don't think this is possible from Perl code right now. You need to call CreateFileW() to open a file with a Unicode name. If you want to hack something, then I would suggest to write a little XS module that just swaps out the file handle in a PerlIO* structure. Look at PerlIOWin32_open() in win32/win32io.c to see how Perl currently opens a file. Another quick-and-dirty solution would be to build a custom Perl by hacking win32/win32.h. If you change the USING_WIDE definition to 1 then you end up with a version of Perl that has the old -C behavior hardcoded. Remember that this is not really compatible with Perl's Unicode handling. Cheers, -Jan
RE: Unicode filenames on Windows with Perl = 5.8.2
On Mon, 21 Jun 2004, Steve Hay wrote: Jan Dubois wrote: You need to call CreateFileW() to open a file with a Unicode name. If you want to hack something, then I would suggest to write a little XS module that just swaps out the file handle in a PerlIO* structure. Look at PerlIOWin32_open() in win32/win32io.c to see how Perl currently opens a file. I really need all of Perl's filename handling to be Unicode-savvy, not just open(). Or have I mis-understood you? No, you are correct. I assumed you just wanted to solve a specific problem, reading a bunch of file with Unicode names. Another quick-and-dirty solution would be to build a custom Perl by hacking win32/win32.h. If you change the USING_WIDE definition to 1 then you end up with a version of Perl that has the old -C behavior hardcoded. Remember that this is not really compatible with Perl's Unicode handling. Reading a previous e-mail from you on this subject (http://www.mail-archive.com/[EMAIL PROTECTED]/msg02127.html), it seems that there are at least four issues with the old -C behaviour: 1. It didn't do anything with the UTF8 flag in SV's; 2. There are no wide API functions on Win95/98/ME; 3. Some core Perl API's take char *'s, not SV *'s; 4. Non-core modules would be affected too. I would guess that 1 is maybe not too much work? (Just a wild guess - I don't really know.) Probably, but it relies on 3 being implemented first. The char* doesn't carry the UTF8 flag. I must confess that 2 doesn't really bother me since the 9x type systems are now a thing of the past (XP onwards are all NT type systems, even XP Home Edition). While I also wish that Win 9x would just cease to exist, I don't think any core Perl patches would be accepted if they would render Perl inoperable on those systems. You would have to provide at least a fallback solution, even if it means creating separate binaries for 9x and NT Windows systems. How much work is invovled in 3? Perl internals pretend to use C runtime routines like open(), fstat() etc. but reimplements them on some systems to get a consistent behavior. You will need to define a different API that uses SV* instead of char* for all file/directory name arguments and use that one exclusively. Of course they all need to be indirected through the PERL_IMPLICIT_SYS system so that they can be redefined for individual operating systems. I'm not sure how much work the implementation is, but I'm afraid you would also need to spend significant time arguing about it. Regarding 4, is it only Win32 modules that would be affected (where A functions would need replacing with W functions), or would others be affected too? Others would be too, because they use the redefined open() function from Perl if they are opening a file. Of course they would need to be changed anyways to make them support Unicode filenames, so maybe this isn't an issue. You would still need to provide an ASCII interface for the new Perl Unicode API so that you could funnel the modules open() call through it. Given that 3 at least would probably break binary compatibility, I guess this sort of thing won't be done any sooner than 5.10 at the earliest, but having something done in time for that would be great. Is that a realistic possibility, or just wishful thinking? I think it is possible, but it requires someone to both do the work and to argue for it on P5P. Without this champion, I don't see it happening at all. Cheers, -Jan
Re: encoding...
On Sun, 2 Nov 2003 23:24:41 +, John Delacour [EMAIL PROTECTED] wrote: Question 1. In this script I would like for convenience' sake to use variables in the second line, but I don't seem to be able to do so. Am I missing something or is is simply not possible? $source = 'MacRoman'; # I want to use this in the next line use encoding qw( MacRoman ), STDOUT = qw( utf-8 ) ; Should work if you initialize the variable in a BEGIN block: BEGIN { $source = 'MacRoman'; } use encoding $source, STDOUT = 'utf-8'; use is executed at compile time, so variables initialized at runtime won't be usable. $text = café ; print $text ; Question 2 Is there a way, without using q(), to single-quote a block of text as one can double-quote it this way: $text = EOT; Yes, put single quotes around your EOT marker: $text = 'EOT'; $ome$tuff $ome$tuff $ome$tuff EOT # I want to be able to quote a block of JIS-encoded stuff (which contains lots of $) Cheers, -Jan