At 06:43 PM 4/5/2002 -0800, you wrote:

>I'm working on a multi language spider, and I've come to a point where I'm
>not sure what assumption to make.

<BIG SNIP>

The solution to your problem is to use a language identifier.
A language identifier is capable of recognizing not only what
language it is but also what character set is in use.  So all you
need to do is to download the page and throw it at a language
identifier and it will tell you what language and character set
it is.  Or, you could do it at a paragraph at a time just in case
you are dealing with a mixed language document.

Just so happens we market one. ;-)  It supports ~230 languages
in a variety of different character sets in addition to UTF-8, and
Unicode Big/Little Endian.  You can play with a simple demo at:
http://www.languageidentifier.com/ (Though Chinese isn't included
in the demo.)

We developed it originally to assist with doing language specific
crawling among other things.  Interestingly enough, we are
finishing up work on a Chinese text segmentation system.
(This puts the spaces into Chinese text so that you can index it
and search it more efficiently.)

If interested, please contact me at: [EMAIL PROTECTED]

-Art
-- 
Art Pollard
http://www.lextek.com/
Suppliers of High Performance Text Retrieval Engines.


Reply via email to