Re: auto-detecting file encoding

2006-06-19 Thread
From: Axel Mock Sent: 6/19/2006 7:59:48 AM To: [EMAIL PROTECTED] Cc: activeperl@listserv.ActiveState.com Subject: Re: auto-detecting file encoding > Hi, > > i was just reading this thread concerning detecting/guessing Unicode, while I > was debugging > my little module that, among

Re: auto-detecting file encoding

2006-06-19 Thread Axel Mock
Hi, i was just reading this thread concerning detecting/guessing Unicode, while I was debugging my little module that, among other file releated things, should read in some file, convert it to internal UTF8. Things I came across: Encode::Guess was obviously written with non-UTF input in mind

Re: auto-detecting file encoding

2006-06-19 Thread DZ-Jay
Hello: The problem is that Unicode files do not have contain the byte mark header, and in fact, the ones I'm attempting to decode, don't. As I said before, most of them are Windows-1252, some of them are Latin-1, and still some are UTF-8 -- without a special byte mark. The problem is that o

Re: auto-detecting file encoding

2006-06-19 Thread DZ-Jay
On Jun 18, 2006, at 22:06, Jerry Yang wrote: Hi, The file in UTF-8 should have a BOM like this "EF BB BF" Bytes Encoding Form 00 00 FE FF UTF-32, big-endian FF FE 00 00 UTF-32, little-endian FE FF UTF-16, big-endian FF FE UTF-16, little-endian EF BB BF UTF-8 Should, but don't have t

Re: auto-detecting file encoding

2006-06-19 Thread Torsten . Werner
Hi, in Windows has each unicode file a special header. The following headers are in use: UTF-16: \xFF\xFE UTF-16BE: \xFE\XFF utf8: \xEF\xBB\xBF For a automatic check, open the file in binary mode, read the first 3 bytes and compare it with the given pattern. If it is not matching the patterns, win

Re: auto-detecting file encoding

2006-06-18 Thread Jerry Yang
Hi, The file in UTF-8 should have a BOM like this "EF BB BF"Bytes Encoding Form 00 00 FE FF UTF-32, big-endian FF FE 00 00 UTF-32, little-endian