Unfortunately there is no way to do this now. _I think_ we could make this configurable, though, fairly easily. Please open a ticket.
The next RC for PDFBox might be out next week, and we'll try to release Tika 1.17 shortly after that...so there should be time to get this in. -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Friday, October 27, 2017 9:12 AM To: [email protected] Subject: RE: Incorrect encoding detected Hello Tim, Getting rid of script and style contents sounds plausible indeed. But to work around the problem for now, can i instruct HTMLEncodingDetector from within Nutch to look beyond the limit? Thanks! Markus -----Original message----- > From:Allison, Timothy B. <[email protected]> > Sent: Friday 27th October 2017 14:53 > To: [email protected] > Subject: RE: Incorrect encoding detected > > Hi Markus, > > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are > what is actually being used for encoding detection. The HTMLEncodingDetector > only looks in the first 8,192 characters, and the other encoding detectors > have similar (but longer?) restrictions. > > At some point, I had a dev version of a stripper that removed contents of > <script/> and <style/> before trying to detect the encoding[0]...perhaps it > is time to resurrect that code and integrate it? > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we should > expand how far we look into a stream for detection? > > Cheers, > > Tim > > [0] https://issues.apache.org/jira/browse/TIKA-2038 > > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Friday, October 27, 2017 8:39 AM > To: [email protected] > Subject: Incorrect encoding detected > > Hello, > > We have a problem with Tika, encoding and pages on this website: > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the > regular HTML parser does a fine job, but our TikaParser has a tough job > dealing with this HTML. For some reason Tika thinks > Content-Encoding=windows-1252 is what this webpage says it is, instead the > page identifies itself properly as UTF-8. > > Of all websites we index, this is so far the only one giving trouble indexing > accents, getting fÃ¥ instead of a regular få. > > Any tips to spare? > > Many many thanks! > Markus >
