Re: encoding sniffing

Philippe Verdy Mon, 14 Jul 2003 17:51:03 -0700

On Tuesday, July 15, 2003 12:51 AM, Patrick Andries <[EMAIL PROTECTED]> wrote:

> ----- Message d'origine -----
> De: "Philippe Verdy" <[EMAIL PROTECTED]>
> 
> > On Monday, July 14, 2003 11:42 PM, Patrick Andries
> <[EMAIL PROTECTED]> wrote:
> > 
> > > In any case, I believe Peter has an idea how these libraries work
> > > and their limitations, he is rather looking for one with its
> > > limitations. 
> > 
> > Including the Chinese limitations? It will become tricky if
> > managing with 
> traditional or scientific texts with many rare ideographs, because
> it's difficult to create an exhaustive morphological analysis with
> Chinese, 
> 
> This product does no morphological analysis but uses a hidden Markov
> Model. Did you try it ? (I just checked http://www.gov.tw/sars/ with
> http://quebec.alis.com/castil/essai_silc.cgi gave me Chinese, Big-5).

I could find a few technical plain-text documents that are obviously in
English ASCII and that were also identified as Chinese GBK. This includes
some technical pages that I have (such as SpamCop.net abuse analysis
pages that are also sometimes interpreted as Chinese in Internet Explorer,
or some Unicode text tables that are containing pure US English ASCII
among other numeric data).

I admit that such errors occur mostly on very technical documents, but
technical documents that need correct identification of their encoding
are also database table exports in flat files, in which it is sometimes hard
for a humane to find inaccuracies for its encoding identification, when it
contains lots of digits and separators, or people names and addresses,
with a large majority of lines using a familiar script or orthograph.

For this reason, now, I am used to import data from multiple files in a joint
database, by adding a tracking import id, that allows discovering later if
some batches requires special full export and reencoding: there's nothing
worse that feeding a database from data incorrectly interpreted from multiple
encodings, notably when the database is also very active and is used in
parallel with internal applications whose encoding is well controled. I know
that there are some products that allow identifying a language/encoding
pair on a fragment of a file or from a subselection in a database, but they
are expensive. Using a humane to review each record in a large database
is also costly, long and errorprone.

Re: encoding sniffing

Reply via email to