Html Parser autodetect charset

Dave French Thu, 20 Jun 2013 08:34:07 -0700

Hey,

In my use case of tika, I am rendering a webpage, taking the contents of the 
page and feeding this into tika.  The contents of the webpage are encoded in 
UTF-8 when I feed it into tika, but the HtmlParser is using the 
AutoDetectReader to try and determine the charset.  This means tika is using 
the meta-data tag of the page to determine the charset.


Is there a way to not use this AutoDetectReader and just specify the charset?  
Or better yet, inject the Detector that will be used?

Thanks for your help,
Dave

Html Parser autodetect charset

Reply via email to