On Tuesday, July 15, 2003 12:51 AM, Patrick Andries <[EMAIL PROTECTED]> wrote:
> ----- Message d'origine ----- > De: "Philippe Verdy" <[EMAIL PROTECTED]> > > > On Monday, July 14, 2003 11:42 PM, Patrick Andries > <[EMAIL PROTECTED]> wrote: > > > > > In any case, I believe Peter has an idea how these libraries work > > > and their limitations, he is rather looking for one with its > > > limitations. > > > > Including the Chinese limitations? It will become tricky if > > managing with > traditional or scientific texts with many rare ideographs, because > it's difficult to create an exhaustive morphological analysis with > Chinese, > > This product does no morphological analysis but uses a hidden Markov > Model. Did you try it ? (I just checked http://www.gov.tw/sars/ with > http://quebec.alis.com/castil/essai_silc.cgi gave me Chinese, Big-5). I could find a few technical plain-text documents that are obviously in English ASCII and that were also identified as Chinese GBK. This includes some technical pages that I have (such as SpamCop.net abuse analysis pages that are also sometimes interpreted as Chinese in Internet Explorer, or some Unicode text tables that are containing pure US English ASCII among other numeric data). I admit that such errors occur mostly on very technical documents, but technical documents that need correct identification of their encoding are also database table exports in flat files, in which it is sometimes hard for a humane to find inaccuracies for its encoding identification, when it contains lots of digits and separators, or people names and addresses, with a large majority of lines using a familiar script or orthograph. For this reason, now, I am used to import data from multiple files in a joint database, by adding a tracking import id, that allows discovering later if some batches requires special full export and reencoding: there's nothing worse that feeding a database from data incorrectly interpreted from multiple encodings, notably when the database is also very active and is used in parallel with internal applications whose encoding is well controled. I know that there are some products that allow identifying a language/encoding pair on a fragment of a file or from a subselection in a database, but they are expensive. Using a humane to review each record in a large database is also costly, long and errorprone.

