Re: [Tracker] Automatic Language Detection

Edward Duffy Mon, 05 Mar 2007 18:55:20 -0800

On 3/5/07, jamie <[EMAIL PROTECTED]> wrote:
> On Mon, 2007-03-05 at 18:19 -0500, Edward Duffy wrote:
> > Hi Guys -
> >
> > I just wrote a patch for #377891[1], could I get some of you to test
> > it.  I ran some pdfs I found with google.fr and google.it, and it
> > seems to be working correctly...but more eyes the better.
>


Both from http://software.wise-guys.nl/libtextcat/languages.html
> great stuff but we only support utf-8 - are all those language modules
> utf-8 based?
>
"""Our main focus will be on compiling a list of fingerprints of UTF-8
encoded languages, since Unicode is clearly the way to go and UTF-8 is
usually the best way to do Unicode."""

It works (for my tests) if I encode the buffer to UTF-8 first, and
I've been able to get away with just sending the first 1K of the file.

> Also of interest is detecting CJK langs so we can automatically use
> pango to word break them.
>
After running about a dozen or so (supposedly) japanesse pdf through
it with no luck, I saw this:

"""We were told that the East Asian language models (notably Chinese,
Korean, Japanese) may be less than adequate because of white space
issues. If you are a native speaker, you might be able to shed some
light on this issue."""

So..no, for now.



> jamie.
>
>
>
>
_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] Automatic Language Detection

Reply via email to