Hello,

We have a problem with Tika, encoding and pages on this website: 
https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser

Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the 
regular HTML parser does a fine job, but our TikaParser has a tough job dealing 
with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is 
what this webpage says it is, instead the page identifies itself properly as 
UTF-8.

Of all websites we index, this is so far the only one giving trouble indexing 
accents, getting fÃ¥ instead of a regular få.

Any tips to spare? 

Many many thanks!
Markus

Reply via email to