subject:"Parsing UTF8 files with wide characters"

Parsing UTF8 files with wide characters

2005-06-15 Thread Robin

I thought I'd understood how to use unicode support in perl, but 
evidently not. In the script below, I'm stumped as to:


1) why the regex won't match ''.
2) why the substitution is carried out, but the result isn't in UTF8, 
nor is it UTF8 re-encoded in UTF8 (uncomment #require Encode; 
... #Encode::decode_utf8($_); to test this )




TIA


Robin



 #!/usr/bin/perl -w

use strict;
use diagnostics-verbose;
#require Encode;


binmode (DATA,:utf8);


binmode (STDOUT,:utf8);


for (DATA){

if (/([EMAIL PROTECTED])/gs){
print match: ,$1,\n;
#Encode::decode_utf8($_);
s/$1//gs;

}elsif(/()/gs){
print match: ,$1,\n;
s/$1/12/gs;


}

print;

}   




__DATA__
!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
http://www.w3.org/TR/html4/loose.dtd;
HTML
HEAD
META HTTP-EQUIV=content-type CONTENT=text/html; charset=utf-8
TITLE A Web Page/TITLE  
/HEAD
BODY
BLOCKQUOTE
H3newsFONT COLOR=#FF33001/FONT/H3
... and this is a web page.
P
IMG ALT=A Filler WIDTH=450 HEIGHT=296
P
hidden marker here -FONT 
COLOR=#FF3300[EMAIL PROTECTED]/FONT--BR

/BLOCKQUOTE
/BODY
/HTML

Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Andrew Mace


Try use utf8 - it lets Perl know that your script contains utf8 chars.

More info: http://perlpod.com/5.9.1/lib/utf8.html


Andrew



On Jun 15, 2005, at 2:48 PM, Robin wrote:

I thought I'd understood how to use unicode support in perl, but  
evidently not. In the script below, I'm stumped as to:


1) why the regex won't match ''.
2) why the substitution is carried out, but the result isn't in  
UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require  
Encode; ... #Encode::decode_utf8($_); to test this )




TIA


Robin



 #!/usr/bin/perl -w

use strict;
use diagnostics-verbose;
#require Encode;


binmode (DATA,:utf8);


binmode (STDOUT,:utf8);


for (DATA){

if (/([EMAIL PROTECTED])/gs){
print match: ,$1,\n;
#Encode::decode_utf8($_);
s/$1//gs;

}elsif(/()/gs){
print match: ,$1,\n;
s/$1/12/gs;


}

print;

}




__DATA__
!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
http://www.w3.org/TR/html4/loose.dtd;
HTML
HEAD
META HTTP-EQUIV=content-type CONTENT=text/html; charset=utf-8
TITLE A Web Page/TITLE
/HEAD
BODY
BLOCKQUOTE
H3newsFONT COLOR=#FF33001/FONT/H3
... and this is a web page.
P
IMG ALT=A Filler WIDTH=450 HEIGHT=296
P
hidden marker here -FONT COLOR=#FF3300[EMAIL PROTECTED]/ 
FONT--BR

/BLOCKQUOTE
/BODY
/HTML

Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Sherm Pendley


On Jun 15, 2005, at 2:48 PM, Robin wrote:

I thought I'd understood how to use unicode support in perl, but  
evidently not. In the script below, I'm stumped as to:


1) why the regex won't match ''.
2) why the substitution is carried out, but the result isn't in  
UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require  
Encode; ... #Encode::decode_utf8($_); to test this )


The binmode() calls you've included tell Perl that the data coming  
from and going to those file handles is UTF8 encoded.


But, you have UTF8-encoded text in your code, too. To tell Perl about  
that, you need to use the use utf8; pragma.


sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org

Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Robin


thanks Andrew and Sherm

I went back to look at perluniintro because I was sure I could remember 
reading that the use utf8 pragma was no longer needed, right under 
where it says this it continues Only one case remains where an 
explicit use utf8 is needed: if your Perl script itself is encoded in 
UTF-8


*sigh*

Robin

Re: Parsing UTF8 files with wide characters

2005-06-15 Thread John Delacour


At 4:26 am +0900 16/6/05, Robin wrote:

I went back to look at perluniintro because I was sure I could 
remember reading that the use utf8 pragma was no longer needed, 
right under where it says this it continues Only one case remains 
where an explicit use utf8 is needed: if your Perl script itself 
is encoded in UTF-8


Nevertheless (Perl 5.8.6) if you simply comment

#binmode (DATA,:utf8);
#binmode (STDOUT,:utf8);

provided your script is UTF-8 encoded, there is no need for 'use 
utf8'.  The script you posted works fine in that case, as does



$f = $ENV{HOME}/junk.txt;
open F, $f;
print F ;
close F;
open F, $f;
for (F) {// and print}

JD

Re: Parsing UTF8 files with wide characters

2005-06-15 Thread Joel Rees



(BOn 2005.6.16, at 05:13 AM, John Delacour wrote:
(B
(B At 4:26 am +0900 16/6/05, Robin wrote:
(B
(B I went back to look at perluniintro because I was sure I could 
(B remember reading that the "use utf8" pragma was no longer needed, 
(B right under where it says this it continues "Only one case remains 
(B where an explicit "use utf8" is needed: if your Perl script itself is 
(B encoded in UTF-8"

(B
(B Nevertheless (Perl 5.8.6) if you simply comment
(B
(B #binmode (DATA,":utf8");
(B #binmode (STDOUT,":utf8");
(B
(B provided your script is UTF-8 encoded, there is no need for 'use 
(B utf8'.  The script you posted works fine in that case,

(B
(BNot a good idea. For the time being, and until UTF-8 is established as 
(Bthe default encoding for perl (should that ever happen), when your 
(Bsource code includes multibyte characters tell perl so.

(B
(BI suppose, in a context where you have automatic encoding conversion 
(Btaking place whenever you move code from one environment to another, 
(Bthis rule of thumb would not be a rule of thumb. But otherwise, you 
(Bwant to do what you can to tell the various things that interpret your 
(Bcode what the encoding is. (And blind automatic conversion has its own 
(Bset of problems.)

(B
(B  as does
(B
(B
(B $f = "$ENV{HOME}/junk.txt";
(B open F, "$f";
(B print F "$B7n(B";
(B close F;
(B open F, $f;
(B for (F) {/$B7n(B/ and print}
(B
(B JD
(B
(B--
(BJoel Rees
(BI've already left the building. You don't really see me here.

Parsing UTF8 files with wide characters

Re: Parsing UTF8 files with wide characters

Re: Parsing UTF8 files with wide characters

Re: Parsing UTF8 files with wide characters

Re: Parsing UTF8 files with wide characters

Re: Parsing UTF8 files with wide characters

6 matches

Site Navigation

Mail list logo

Footer information