Re: encoding sniffing

Philippe Verdy Mon, 14 Jul 2003 15:50:53 -0700

On Monday, July 14, 2003 11:42 PM, Patrick Andries <[EMAIL PROTECTED]> wrote:


> In any case, I believe Peter has an idea how these libraries work and
> their limitations, he is rather looking for one with its limitations.

Including the Chinese limitations? It will become tricky if managing with traditional 
or scientific texts with many rare ideographs, because it's difficult to create an 
exhaustive morphological analysis with Chinese, even with the three steps approach. So 
a simple recognizer without any morphological or lexical database would be even more 
likely to fail if the recognizer is not helped to include hints about the language or 
at least the main script (for example excluding the Han script from the statistic 
results).

With GB18030 encoding, this would be a real challenge due to its even larger overlap 
with the ASCII space. However its quite easy to determine which encoding a Chinese 
text uses with just the byte or double-byte statistics.

Re: encoding sniffing

Reply via email to