On 3/5/07, jamie <[EMAIL PROTECTED]> wrote: > On Mon, 2007-03-05 at 18:19 -0500, Edward Duffy wrote: > > Hi Guys - > > > > I just wrote a patch for #377891[1], could I get some of you to test > > it. I ran some pdfs I found with google.fr and google.it, and it > > seems to be working correctly...but more eyes the better. >
Both from http://software.wise-guys.nl/libtextcat/languages.html > great stuff but we only support utf-8 - are all those language modules > utf-8 based? > """Our main focus will be on compiling a list of fingerprints of UTF-8 encoded languages, since Unicode is clearly the way to go and UTF-8 is usually the best way to do Unicode.""" It works (for my tests) if I encode the buffer to UTF-8 first, and I've been able to get away with just sending the first 1K of the file. > Also of interest is detecting CJK langs so we can automatically use > pango to word break them. > After running about a dozen or so (supposedly) japanesse pdf through it with no luck, I saw this: """We were told that the East Asian language models (notably Chinese, Korean, Japanese) may be less than adequate because of white space issues. If you are a native speaker, you might be able to shed some light on this issue.""" So..no, for now. > jamie. > > > > _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
