Hi Markus,
  
My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what 
is actually being used for encoding detection.  The HTMLEncodingDetector only 
looks in the first 8,192 characters, and the other encoding detectors have 
similar (but longer?) restrictions.
 
At some point, I had a dev version of a stripper that removed contents of 
<script/> and <style/> before trying to detect the encoding[0]...perhaps it is 
time to resurrect that code and integrate it?

Or, given that HTML has been, um, blossoming, perhaps, more simply, we should 
expand how far we look into a stream for detection?

Cheers,

               Tim

[0] https://issues.apache.org/jira/browse/TIKA-2038
   

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Friday, October 27, 2017 8:39 AM
To: [email protected]
Subject: Incorrect encoding detected

Hello,

We have a problem with Tika, encoding and pages on this website: 
https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser

Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the 
regular HTML parser does a fine job, but our TikaParser has a tough job dealing 
with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is 
what this webpage says it is, instead the page identifies itself properly as 
UTF-8.

Of all websites we index, this is so far the only one giving trouble indexing 
accents, getting fÃ¥ instead of a regular få.

Any tips to spare? 

Many many thanks!
Markus

Reply via email to