Parsing UTF8 files with wide characters

2005-06-15 Thread Robin
I thought I'd understood how to use unicode support in perl, but 
evidently not. In the script below, I'm stumped as to:


1) why the regex won't match ''.
2) why the substitution is carried out, but the result isn't in UTF8, 
nor is it UTF8 re-encoded in UTF8 (uncomment #require Encode; 
... #Encode::decode_utf8($_); to test this )




TIA


Robin



 #!/usr/bin/perl -w

use strict;
use diagnostics-verbose;
#require Encode;


binmode (DATA,:utf8);


binmode (STDOUT,:utf8);


for (DATA){

if (/([EMAIL PROTECTED])/gs){
print match: ,$1,\n;
#Encode::decode_utf8($_);
s/$1//gs;

}elsif(/()/gs){
print match: ,$1,\n;
s/$1/12/gs;


}

print;

}   




__DATA__
!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
http://www.w3.org/TR/html4/loose.dtd;
HTML
HEAD
META HTTP-EQUIV=content-type CONTENT=text/html; charset=utf-8
TITLE A Web Page/TITLE  
/HEAD
BODY
BLOCKQUOTE
H3newsFONT COLOR=#FF33001/FONT/H3
... and this is a web page.
P
IMG ALT=A Filler WIDTH=450 HEIGHT=296
P
hidden marker here -FONT 
COLOR=#FF3300[EMAIL PROTECTED]/FONT--BR

/BLOCKQUOTE
/BODY
/HTML




Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Andrew Mace

Try use utf8 - it lets Perl know that your script contains utf8 chars.

More info: http://perlpod.com/5.9.1/lib/utf8.html


Andrew



On Jun 15, 2005, at 2:48 PM, Robin wrote:

I thought I'd understood how to use unicode support in perl, but  
evidently not. In the script below, I'm stumped as to:


1) why the regex won't match ''.
2) why the substitution is carried out, but the result isn't in  
UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require  
Encode; ... #Encode::decode_utf8($_); to test this )




TIA


Robin



 #!/usr/bin/perl -w

use strict;
use diagnostics-verbose;
#require Encode;


binmode (DATA,:utf8);


binmode (STDOUT,:utf8);


for (DATA){

if (/([EMAIL PROTECTED])/gs){
print match: ,$1,\n;
#Encode::decode_utf8($_);
s/$1//gs;

}elsif(/()/gs){
print match: ,$1,\n;
s/$1/12/gs;


}

print;

}




__DATA__
!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
http://www.w3.org/TR/html4/loose.dtd;
HTML
HEAD
META HTTP-EQUIV=content-type CONTENT=text/html; charset=utf-8
TITLE A Web Page/TITLE
/HEAD
BODY
BLOCKQUOTE
H3newsFONT COLOR=#FF33001/FONT/H3
... and this is a web page.
P
IMG ALT=A Filler WIDTH=450 HEIGHT=296
P
hidden marker here -FONT COLOR=#FF3300[EMAIL PROTECTED]/ 
FONT--BR

/BLOCKQUOTE
/BODY
/HTML







Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Sherm Pendley

On Jun 15, 2005, at 2:48 PM, Robin wrote:

I thought I'd understood how to use unicode support in perl, but  
evidently not. In the script below, I'm stumped as to:


1) why the regex won't match ''.
2) why the substitution is carried out, but the result isn't in  
UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require  
Encode; ... #Encode::decode_utf8($_); to test this )


The binmode() calls you've included tell Perl that the data coming  
from and going to those file handles is UTF8 encoded.


But, you have UTF8-encoded text in your code, too. To tell Perl about  
that, you need to use the use utf8; pragma.


sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org



Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Robin

thanks Andrew and Sherm

I went back to look at perluniintro because I was sure I could remember 
reading that the use utf8 pragma was no longer needed, right under 
where it says this it continues Only one case remains where an 
explicit use utf8 is needed: if your Perl script itself is encoded in 
UTF-8


*sigh*

Robin



Re: Parsing UTF8 files with wide characters

2005-06-15 Thread John Delacour

At 4:26 am +0900 16/6/05, Robin wrote:

I went back to look at perluniintro because I was sure I could 
remember reading that the use utf8 pragma was no longer needed, 
right under where it says this it continues Only one case remains 
where an explicit use utf8 is needed: if your Perl script itself 
is encoded in UTF-8


Nevertheless (Perl 5.8.6) if you simply comment

#binmode (DATA,:utf8);
#binmode (STDOUT,:utf8);

provided your script is UTF-8 encoded, there is no need for 'use 
utf8'.  The script you posted works fine in that case, as does



$f = $ENV{HOME}/junk.txt;
open F, $f;
print F ;
close F;
open F, $f;
for (F) {// and print}

JD


Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Joel Rees


On 2005.6.16, at 05:13 AM, John Delacour wrote:

 At 4:26 am +0900 16/6/05, Robin wrote:

 I went back to look at perluniintro because I was sure I could 
 remember reading that the "use utf8" pragma was no longer needed, 
 right under where it says this it continues "Only one case remains 
 where an explicit "use utf8" is needed: if your Perl script itself is 
 encoded in UTF-8"


 Nevertheless (Perl 5.8.6) if you simply comment

 #binmode (DATA,":utf8");
 #binmode (STDOUT,":utf8");

 provided your script is UTF-8 encoded, there is no need for 'use 
 utf8'.  The script you posted works fine in that case,


Not a good idea. For the time being, and until UTF-8 is established as 
the default encoding for perl (should that ever happen), when your 
source code includes multibyte characters tell perl so.


I suppose, in a context where you have automatic encoding conversion 
taking place whenever you move code from one environment to another, 
this rule of thumb would not be a rule of thumb. But otherwise, you 
want to do what you can to tell the various things that interpret your 
code what the encoding is. (And blind automatic conversion has its own 
set of problems.)


  as does


 $f = "$ENV{HOME}/junk.txt";
 open F, "$f";
 print F "月";
 close F;
 open F, $f;
 for (F) {/月/ and print}

 JD

--
Joel Rees
I've already left the building. You don't really see me here.