Re: removing accents

2004-01-02 Thread Eric Cholet
Le 28 déc. 03, à 04:45, SADAHIRO Tomoyuki a écrit : On Sat, 27 Dec 2003 13:30:19 +0100 Eric Cholet [EMAIL PROTECTED] wrote: Here's another naive question from a unicode newbie: Is there a way, using perl's unicode support, to remove accents from a string? I looked at \pM but can't figure out how

Re: \W and [\W]

2004-01-02 Thread Nick Ing-Simmons
Eric Cholet [EMAIL PROTECTED] writes: Le 1 janv. 04, 17:50, Rafael Garcia-Suarez a crit : +(However, and as a limitation of the current implementation, using +C\w or C\W Iinside a C[...] character class will still match +with byte semantics.) I don't think it applies to \w, only \W. \x{df}

Re: \W and [\W]

2004-01-02 Thread Jarkko Hietaniemi
Do negated classes work at all ? What does /[^\w]/ do ? (I looked at this stuff ages ago and I thought unicode classes (including negated ones worked, if that is true then fix may just be the magical \W expander expanding to wrong thing...) I think it's the evil characters in the 0x80..0xFF

Keeping byte-wise processing as an option

2004-01-02 Thread Martin Duerst
Dear Perl Unicode experts, http://www.perldoc.com/perl5.8.0/pod/perlunicode.html says: In future, Perl-level operations will be expected to work with characters rather than bytes. I very much appreciate all your hard work on the internationalization of Perl. However, recently I have been

Re: Keeping byte-wise processing as an option

2004-01-02 Thread Jarkko Hietaniemi
In future, Perl-level operations will be expected to work with characters rather than bytes. I very much appreciate all your hard work on the internationalization of Perl. However, recently I have been working on some things that let me think that the above statement, if taken directly, may be

Re: Invalid Uicode characters

2004-01-02 Thread John Delacour
At 11:31 am +0100 16/9/03, [EMAIL PROTECTED] wrote: I am running Perl 5.8. and trying to filter out some invalid Unicode characters from Unicoded texts of some South Asian languages. There are 28 such characters in my data (all control characters): 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15,

Re: Invalid Uicode characters

2004-01-02 Thread John Delacour
At 11:31 am +0100 16/9/03, [EMAIL PROTECTED] wrote: I am running Perl 5.8. and trying to filter out some invalid Unicode characters from Unicoded texts of some South Asian languages. There are 28 such characters in my data (all control characters): 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15,

Sorry for the noise (was Re: Invalid Uicode characters

2004-01-02 Thread John Delacour
At 11:47 pm + 2/1/04, I wrote: $f = /tmp/zili.txt; open F, $f ;... Sorry. I had my mailbox sorted by sender rather than by date, so this message appeared at the bottom unread. My memory's not good enough to recall I'd read it and actually replied 4 months ago :) Happy new year! JD

Re: Keeping byte-wise processing as an option

2004-01-02 Thread Martin Duerst
Hello Jarkko, Many thanks for your very quick answer. At 00:31 04/01/03 +0200, Jarkko Hietaniemi wrote: In future, Perl-level operations will be expected to work with characters rather than bytes. I very much appreciate all your hard work on the internationalization of Perl. However, recently

Re: Keeping byte-wise processing as an option

2004-01-02 Thread Andreas J Koenig
On Fri, 02 Jan 2004 18:17:13 -0500, Martin Duerst [EMAIL PROTECTED] said: Jungshik has also reported that it fails with Perl 5.8.0 with an UTF-8 locale. Perl 5.8.0 was very broken with UTF-8 locales since it auto-PERL_UNICODEd. We saw (keep seeing) a lot of that since RedHat 8 and 9

Re: Keeping byte-wise processing as an option

2004-01-02 Thread Daisuke Maki
if (eval use bytes;) { use bytes; } That would be use if $] = 5.006, bytes; But you would have to make sure that if.pm is available, no option IMO. I think the was used in AxKit by the Matt/axkit-dev folks was to put this line $INC{ bytes.pm }++ if $] 5.006; before any mention of

Re: removing accents

2004-01-02 Thread SADAHIRO Tomoyuki
On Fri, 2 Jan 2004 11:56:12 +0100 Eric Cholet [EMAIL PROTECTED] wrote: Thanks for your detailed reply. I looked into this and found that I can use Unicode::Normalize to decompose a string in NFD form and then remove the accents with a regex removing /pM/. I wonder if I overlooked a