+1 Hopefully it will be easier to find an efficient algorithm for detecting character encoding if language estimates are dropped.
On Tue, Aug 24, 2010 at 10:03 AM, Mattmann, Chris A (388J) < [email protected]> wrote: > +1... > > > > On 8/24/10 8:00 AM, "Jukka Zitting" <[email protected]> wrote: > > Hi, > > On Sat, Aug 21, 2010 at 5:55 PM, Jan Høydahl / Cominvent > <[email protected]> wrote: > > Detected as english. The same is true for the other test language files. > > It does not detect language for UTF-8 encoded files. > > The tika-app jar doesn't do language detection by default. The > language metadata you're seeing is a result of the encoding-based > language estimate that we get from the ICU4J code we're using. > Apparently that data set categorizes ISO-8859-1 as an English-specific > character encoding. > > We already dropped encoding-based language estimates from the HTML > parser, and I think we should do the same also for plain text > documents. > > BR, > > Jukka Zitting > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: *[email protected] > *WWW: *http://sunset.usc.edu/~mattmann/ > *++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >
