Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread John Delacour
At 00:27 +0100 18/6/10, I wrote: If I save the file and undo the second decoding I get the proper output In this case all talk of iso-8859-1 and cp1252 is a red herring. I read several Italian websites where this same problem is manifest in external material such as ads. The news page

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread John Delacour
At 13:24 -0700 17/6/10, David E. Wheeler wrote: On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: So the original character \x{010d} is represented by the bytes \x{c4} and \x{8d}, an application thinks those are in fact characters and encodes them again as \x{c3} + \x{84} and

Re: good name for characters matching [^\0-\377]?

2007-10-18 Thread John Delacour
Juerd Waalboer wrote: E R skribis 2007-10-18 9:50 (-0500): I'm preparing a presentation about Perl and Unicode support, and I'd like to give a name for characters with ordinals above 255. Is there a good name for that class? They are characters outside the latin-1 range. Latin-1 has

Re: Problem with Encode module

2006-07-07 Thread John Delacour
At 10:31 am -0700 23/6/06, Jianyang Tai wrote: I encountered some problem with the Encode module when I convert some Japanese contents from shift-jis to utf-8. Basically I am using the from_to subroutine to do the job. All work well except for those number inside a circle characters (8740 ~

RE: Problem with Encode module

2006-07-07 Thread John Delacour
At 2:40 pm -0700 7/7/06, Jianyang Tai wrote: Thanks for the reply. Are you sure those characters don't exist n shift-jis? Please take a look at the attached text file. It contains two characters (1 in a circle and 2 in a circle). The file is in shift-jis encoding. Not possible. Here is an

RE: Problem with Encode module

2006-07-07 Thread John Delacour
At 4:20 pm -0700 7/7/06, Jianyang Tai wrote: Thanks John. The original characters came from Japan, don't know if they use some proprietary extension of shift_jis. Attached is the zipped file. Hope it come across correctly this time. It should contains characters 0x 87408741. That's windows

Re: Encode the subject line in MIME header using Perl 5.6

2005-12-30 Thread John Delacour
At 1:01 pm + 30/12/05, Nick Ing-Simmons wrote: That isn't quite right. MIME::QuotedPrint does NOT encode space or tab. All the more reason to forget about QP, which is a great way to triple the size of any message in non-european languages, and use base64. QP is designed for text that

Re: Encode the subject line in MIME header using Perl 5.6

2005-12-29 Thread John Delacour
At 11:44 am +0800 28/12/05, wing wrote: Thanks for your prompt reply. The subject line contains some Chinese or Japanese characters in UTF8. Can they be encoded as UTF8 with MIME:Base64?? The script below creates a file containing the following 4 characters 谷神不死 as utf8 bytes

Re: Encode the subject line in MIME header using Perl 5.6

2005-12-27 Thread John Delacour
At 12:42 am +0800 28/12/05, wing wrote: I need to encode the subject line in a MIME header in UTF8 (something like Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=). I know that this can be done by using Encode in Perl 5.8. However, in my production environment, we can only use Perl

Re: Matching encoded strings and file names

2005-12-20 Thread John Delacour
At 10:46 am +0100 20/12/05, [EMAIL PROTECTED] wrote: ...Let's say I have a txt file which contains a list of strings. Some of these strings contain characters encoded in this fashion: R\xC3\xA9union (\xC3\xA9 is one character - e with an accent). ...Now, this fails, even though when I look

Re: Bug in Encode::encode('MIME-Q', $iso_8859_1_string)

2005-12-19 Thread John Delacour
At 3:54 pm +0100 23/11/05, Sven Neuhaus wrote: this seems to be a bug: b) perl -MHTML::Entities -MEncode -e '$a=abcAuml; print encode(MIME-Q, HTML::Entities::decode($a)), \n;' Result: =?UTF-8?Q?abc=C4def?= What about this: perl -MHTML::Entities -MEncode -e 'use encoding (iso-8859-1);

Re: is it utf8 or unicode?

2005-03-16 Thread John Delacour
At 8:03 pm + 9/3/05, [EMAIL PROTECTED] wrote: here's my perl -V Summary of my perl5 (revision 5 version 8 subversion 6) configuration: So ignore anything you've been told about previous versions. Basically I have xC3 x84 and let perl think it is utf-8. It is valid utf-8 ie A with diaresis.

Re: filtering out non-Japanese

2004-12-15 Thread John Delacour
At 10:22 am +0100 15/12/04, Marco Baroni wrote: I have a long text ostensibly in utf-8, and I would like to get rid of all the lines that contain anything BUT kanji, katakana or hiragana (thus, throwing away Latin, but also digits, punctuation, etc.) There's probably a better way to do it but

Re: filtering out non-Japanese

2004-12-15 Thread John Delacour
At 12:39 pm +0100 15/12/04, Marco Baroni wrote: ... where can I find the hexadecimal hiragana, katakana and kanj ranges? Get UnicodeChecker: http://www.earthlingsoft.net/UnicodeChecker/index.html Freeware AND you won't regret it! eg. Do command-f and type hirag JD

Re: About HTML unicode

2004-12-03 Thread John Delacour
At 12:31 am +0800 3/12/04, He Zhiqiang wrote: Now i encountered another problem, there are a few files contains not only one charset but also two or more, for example, file1 contains japanese and chinese, if i use open() to load the data into memory, ord and length etc.. can't correctly work!

Re: Website encoding

2004-11-27 Thread John Delacour
At 10:33 am +1100 18/11/04, Rick Measham wrote: That being the case, I grab the charset and use Encode's decode function to turn it into 'perl's internal format' .. which in 5.8.5 is utf8 right? I then store that in the db. What happens if you do something like this? : my $uri =

Re: Converting string to UTF-16LE

2004-02-29 Thread John Delacour
At 6:19 pm +0100 25/2/04, Sebastian Lehmann wrote: Can anybody tell me how to work with UTF8 and UTF16 in the same script? Any help would be greatly appreciated. Suppose that /tmp/iba.txt contains the text ibañez in UCS-2, preceded by the BOM, then this works here (Perl 5.8.3) use Encode

Re: How to convert base64 string to utf-8

2004-02-05 Thread John Delacour
At 4:21 pm +0200 5/2/04, ALexander N. Treyner wrote: Hi John, Your code works perfect. But I found one strange thing. For example I have next string: hello ˜ÏÂÌ hello world that converted by the mail client to hello =?windows-1255?Q?=F9=EC=E5=ED_hello_world?= After converting it by

Re: How to convert base64 string to utf-8

2004-02-02 Thread John Delacour
At 5:14 pm +0200 2/2/04, ALexander N. Treyner wrote: Hello All, I'm using utf-8 Postgres database, where I save strings in many languages. I have to match the database with strings encoded in mime base64 or quoted-printable format. Like next: =?utf-8?B?15TXoNeUINee16nXlNeZINeR16LXkdeo15nXqi4=?=

Re: How to convert base64 string to utf-8

2004-02-02 Thread John Delacour
At 7:36 pm +0100 2/2/04, Guido Flohr wrote: Unfortunately, you will be out of luck for the somewhat common case of UTF-7 (unless it is available in Encode by now). It is: use Encode; for ( Encode-encodings(:all) ) { print $_$/ } 7bit-jis AdobeStandardEncoding AdobeSymbol AdobeZdingbat ascii

Re: Invalid Uicode characters

2004-01-02 Thread John Delacour
At 11:31 am +0100 16/9/03, [EMAIL PROTECTED] wrote: I am running Perl 5.8. and trying to filter out some invalid Unicode characters from Unicoded texts of some South Asian languages. There are 28 such characters in my data (all control characters): 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15,

Re: Invalid Uicode characters

2004-01-02 Thread John Delacour
At 11:31 am +0100 16/9/03, [EMAIL PROTECTED] wrote: I am running Perl 5.8. and trying to filter out some invalid Unicode characters from Unicoded texts of some South Asian languages. There are 28 such characters in my data (all control characters): 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15,

Sorry for the noise (was Re: Invalid Uicode characters

2004-01-02 Thread John Delacour
At 11:47 pm + 2/1/04, I wrote: $f = /tmp/zili.txt; open F, $f ;... Sorry. I had my mailbox sorted by sender rather than by date, so this message appeared at the bottom unread. My memory's not good enough to recall I'd read it and actually replied 4 months ago :) Happy new year! JD

Re: unicode on windows

2003-11-21 Thread John Delacour
[ sent as utf8 ] At 6:13 pm -0800 20/11/03, Neelima Bandla wrote: I am trying to create a japanese file on a windows machine, Below is the code I am using to do so. my @array = (0x5f89 ,0x623f,0x5f89,0x623f); my $str1 = pack(U*, @array); open(FD, $filepath\\$str1) or die(

encoding...

2003-11-02 Thread John Delacour
Question 1. In this script I would like for convenience' sake to use variables in the second line, but I don't seem to be able to do so. Am I missing something or is is simply not possible? $source = 'MacRoman'; # I want to use this in the next line use encoding qw( MacRoman ), STDOUT =

Re: encoding...

2003-11-02 Thread John Delacour
At 3:36 pm -0800 2/11/03, Jan Dubois wrote: Should work if you initialize the variable in a BEGIN block: BEGIN { $source = 'MacRoman'; } use encoding $source, STDOUT = 'utf-8'; Ah! Yes, put single quotes around your EOT marker: $text = 'EOT'; $ome$tuff $ome$tuff $ome$tuff

Re: Malformed UTF-8 character

2003-10-26 Thread John Delacour
At 1:12 am +0200 26/10/03, Marco Baroni wrote: I am new to (explicit) unicode handling, and right now I am facing this problem. I have some data (lots of data) that in theory should be in ascii (with entity references in place of non-ascii characters). I have no easy way to get to know

Re: Invalid Uicode characters

2003-09-17 Thread John Delacour
At 11:31 am +0100 16/9/03, [EMAIL PROTECTED] wrote: Dear PERLists, I am running Perl 5.8. and trying to filter out some invalid Unicode characters from Unicoded texts of some South Asian languages. There are 28 such characters in my data (all control characters): 0x1, 0x10, 0x11, 0x12, 0x13,

Re: bytes::substr() ?

2003-09-02 Thread John Delacour
At 9:07 am -0500 27/8/03, [EMAIL PROTECTED] wrote: I'm working with a byte oriented protocol, and need to extract byte n1 through byte n2 from a string. Problem is, the string can be UTF8, and substr() is character oriented. What (if anything) is the best way to do this in Perl? Untitled 3.txt