Re: Fails to detect language for UTF-8 file, but it works for ISO-latin

Paul Jakubik Tue, 24 Aug 2010 08:25:18 -0700

+1

Hopefully it will be easier to find an efficient algorithm for detecting
character encoding if language estimates are dropped.


On Tue, Aug 24, 2010 at 10:03 AM, Mattmann, Chris A (388J) <
[email protected]> wrote:

>  +1...
>
>
>
> On 8/24/10 8:00 AM, "Jukka Zitting" <[email protected]> wrote:
>
> Hi,
>
> On Sat, Aug 21, 2010 at 5:55 PM, Jan Høydahl / Cominvent
> <[email protected]> wrote:
> > Detected as english. The same is true for the other test language files.
> > It does not detect language for UTF-8 encoded files.
>
> The tika-app jar doesn't do language detection by default. The
> language metadata you're seeing is a result of the encoding-based
> language estimate that we get from the ICU4J code we're using.
> Apparently that data set categorizes ISO-8859-1 as an English-specific
> character encoding.
>
> We already dropped encoding-based language estimates from the HTML
> parser, and I think we should do the same also for plain text
> documents.
>
> BR,
>
> Jukka Zitting
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: *[email protected]
> *WWW:   *http://sunset.usc.edu/~mattmann/
> *++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Fails to detect language for UTF-8 file, but it works for ISO-latin

Reply via email to