Tika language extraction

2010-06-10 Thread Sandhya Agarwal
Hello,

It is observed that TIKA does not extract the Content-Language for documents 
encoded in UTF-8. For natively encoded documents, it works fine. Any idea on 
how we can resolve this ?

Thanks,
Sandhya


Re: Tika language extraction

2010-06-10 Thread Ken Krugler

Hi Sandhya,

It is observed that TIKA does not extract the Content-Language for  
documents encoded in UTF-8. For natively encoded documents, it works  
fine. Any idea on how we can resolve this ?


I would post this question to the u...@tika.apache.org mailing list,  
and include more details on what type of document.


The Tika language detection is fairly weak, and when the encoding is  
universal (language independent) such as UTF-8, the resulting  
confidence level is often low enough that Tika doesn't assume it has a  
good match, and thus doesn't report a language.


-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g