Hi Markus,
My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what
is actually being used for encoding detection. The HTMLEncodingDetector only
looks in the first 8,192 characters, and the other encoding detectors have
similar (but longer?) restrictions.
At some point, I had a dev version of a stripper that removed contents of
<script/> and <style/> before trying to detect the encoding[0]...perhaps it is
time to resurrect that code and integrate it?
Or, given that HTML has been, um, blossoming, perhaps, more simply, we should
expand how far we look into a stream for detection?
Cheers,
Tim
[0] https://issues.apache.org/jira/browse/TIKA-2038
-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Friday, October 27, 2017 8:39 AM
To: [email protected]
Subject: Incorrect encoding detected
Hello,
We have a problem with Tika, encoding and pages on this website:
https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the
regular HTML parser does a fine job, but our TikaParser has a tough job dealing
with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is
what this webpage says it is, instead the page identifies itself properly as
UTF-8.
Of all websites we index, this is so far the only one giving trouble indexing
accents, getting fÃ¥ instead of a regular få.
Any tips to spare?
Many many thanks!
Markus