Parsing UTF8 files with wide characters
I thought I'd understood how to use unicode support in perl, but evidently not. In the script below, I'm stumped as to: 1) why the regex won't match ''. 2) why the substitution is carried out, but the result isn't in UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require Encode; ... #Encode::decode_utf8($_); to test this ) TIA Robin #!/usr/bin/perl -w use strict; use diagnostics-verbose; #require Encode; binmode (DATA,:utf8); binmode (STDOUT,:utf8); for (DATA){ if (/([EMAIL PROTECTED])/gs){ print match: ,$1,\n; #Encode::decode_utf8($_); s/$1//gs; }elsif(/()/gs){ print match: ,$1,\n; s/$1/12/gs; } print; } __DATA__ !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN http://www.w3.org/TR/html4/loose.dtd; HTML HEAD META HTTP-EQUIV=content-type CONTENT=text/html; charset=utf-8 TITLE A Web Page/TITLE /HEAD BODY BLOCKQUOTE H3newsFONT COLOR=#FF33001/FONT/H3 ... and this is a web page. P IMG ALT=A Filler WIDTH=450 HEIGHT=296 P hidden marker here -FONT COLOR=#FF3300[EMAIL PROTECTED]/FONT--BR /BLOCKQUOTE /BODY /HTML
Re: Parsing UTF8 files with wide characters
Try use utf8 - it lets Perl know that your script contains utf8 chars. More info: http://perlpod.com/5.9.1/lib/utf8.html Andrew On Jun 15, 2005, at 2:48 PM, Robin wrote: I thought I'd understood how to use unicode support in perl, but evidently not. In the script below, I'm stumped as to: 1) why the regex won't match ''. 2) why the substitution is carried out, but the result isn't in UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require Encode; ... #Encode::decode_utf8($_); to test this ) TIA Robin #!/usr/bin/perl -w use strict; use diagnostics-verbose; #require Encode; binmode (DATA,:utf8); binmode (STDOUT,:utf8); for (DATA){ if (/([EMAIL PROTECTED])/gs){ print match: ,$1,\n; #Encode::decode_utf8($_); s/$1//gs; }elsif(/()/gs){ print match: ,$1,\n; s/$1/12/gs; } print; } __DATA__ !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN http://www.w3.org/TR/html4/loose.dtd; HTML HEAD META HTTP-EQUIV=content-type CONTENT=text/html; charset=utf-8 TITLE A Web Page/TITLE /HEAD BODY BLOCKQUOTE H3newsFONT COLOR=#FF33001/FONT/H3 ... and this is a web page. P IMG ALT=A Filler WIDTH=450 HEIGHT=296 P hidden marker here -FONT COLOR=#FF3300[EMAIL PROTECTED]/ FONT--BR /BLOCKQUOTE /BODY /HTML
Re: Parsing UTF8 files with wide characters
On Jun 15, 2005, at 2:48 PM, Robin wrote: I thought I'd understood how to use unicode support in perl, but evidently not. In the script below, I'm stumped as to: 1) why the regex won't match ''. 2) why the substitution is carried out, but the result isn't in UTF8, nor is it UTF8 re-encoded in UTF8 (uncomment #require Encode; ... #Encode::decode_utf8($_); to test this ) The binmode() calls you've included tell Perl that the data coming from and going to those file handles is UTF8 encoded. But, you have UTF8-encoded text in your code, too. To tell Perl about that, you need to use the use utf8; pragma. sherm-- Cocoa programming in Perl: http://camelbones.sourceforge.net Hire me! My resume: http://www.dot-app.org
Re: Parsing UTF8 files with wide characters
thanks Andrew and Sherm I went back to look at perluniintro because I was sure I could remember reading that the use utf8 pragma was no longer needed, right under where it says this it continues Only one case remains where an explicit use utf8 is needed: if your Perl script itself is encoded in UTF-8 *sigh* Robin
Re: Parsing UTF8 files with wide characters
At 4:26 am +0900 16/6/05, Robin wrote: I went back to look at perluniintro because I was sure I could remember reading that the use utf8 pragma was no longer needed, right under where it says this it continues Only one case remains where an explicit use utf8 is needed: if your Perl script itself is encoded in UTF-8 Nevertheless (Perl 5.8.6) if you simply comment #binmode (DATA,:utf8); #binmode (STDOUT,:utf8); provided your script is UTF-8 encoded, there is no need for 'use utf8'. The script you posted works fine in that case, as does $f = $ENV{HOME}/junk.txt; open F, $f; print F ; close F; open F, $f; for (F) {// and print} JD
Re: Parsing UTF8 files with wide characters
On 2005.6.16, at 05:13 AM, John Delacour wrote: At 4:26 am +0900 16/6/05, Robin wrote: I went back to look at perluniintro because I was sure I could remember reading that the "use utf8" pragma was no longer needed, right under where it says this it continues "Only one case remains where an explicit "use utf8" is needed: if your Perl script itself is encoded in UTF-8" Nevertheless (Perl 5.8.6) if you simply comment #binmode (DATA,":utf8"); #binmode (STDOUT,":utf8"); provided your script is UTF-8 encoded, there is no need for 'use utf8'. The script you posted works fine in that case, Not a good idea. For the time being, and until UTF-8 is established as the default encoding for perl (should that ever happen), when your source code includes multibyte characters tell perl so. I suppose, in a context where you have automatic encoding conversion taking place whenever you move code from one environment to another, this rule of thumb would not be a rule of thumb. But otherwise, you want to do what you can to tell the various things that interpret your code what the encoding is. (And blind automatic conversion has its own set of problems.) as does $f = "$ENV{HOME}/junk.txt"; open F, "$f"; print F "月"; close F; open F, $f; for (F) {/月/ and print} JD -- Joel Rees I've already left the building. You don't really see me here.