I hadn't the time to dig into the problem. Neither how to pass a tika-config file nor why actually parse-html is detecting the encoding although it's also only looking for the first 8192 characters (see CHUNK_SIZE).
Just one point: for the MIME detection we also pass the Content-Type sent by the web server to Tika. Could this also be help to pass it as additional glue? In the concrete example the server sends Content-Type: text/html; charset=utf-8 Sebastian On 11/01/2017 07:06 PM, Markus Jelsma wrote: > Any ideas? > > Thanks! > > > > -----Original message----- >> From:Markus Jelsma <[email protected]> >> Sent: Tuesday 31st October 2017 13:14 >> To: User <[email protected]> >> Subject: FW: Incorrect encoding detected >> >> I actually don't know, can we specify a tika-config file in Nutch? >> >> Thanks, >> Markus >> >> -----Original message----- >>> From:Allison, Timothy B. <[email protected]> >>> Sent: Tuesday 31st October 2017 13:11 >>> To: [email protected] >>> Subject: RE: Incorrect encoding detected >>> >>> For 1.17, the simplest solution, I think, is to allow users to configure >>> extending the detection limit via our @Field config methods, that is, via >>> tika-config.xml. >>> >>> To confirm, Nutch will allow users to specify a tika-config file? Will >>> this work for you and Nutch? >>> >>> -----Original Message----- >>> From: Markus Jelsma [mailto:[email protected]] >>> Sent: Tuesday, October 31, 2017 5:47 AM >>> To: [email protected] >>> Subject: RE: Incorrect encoding detected >>> >>> Hello Timothy - what would be your preferred solution? Increase detection >>> limit or skip inline styles and possibly other useless head information? >>> >>> Thanks, >>> Markus >>> >>> >>> >>> -----Original message----- >>>> From:Markus Jelsma <[email protected]> >>>> Sent: Friday 27th October 2017 15:37 >>>> To: [email protected] >>>> Subject: RE: Incorrect encoding detected >>>> >>>> Hi Tim, >>>> >>>> I have opened TIKA-2485 to track the problem. >>>> >>>> Thank you very very much! >>>> Markus >>>> >>>> >>>> >>>> -----Original message----- >>>>> From:Allison, Timothy B. <[email protected]> >>>>> Sent: Friday 27th October 2017 15:33 >>>>> To: [email protected] >>>>> Subject: RE: Incorrect encoding detected >>>>> >>>>> Unfortunately there is no way to do this now. _I think_ we could make >>>>> this configurable, though, fairly easily. Please open a ticket. >>>>> >>>>> The next RC for PDFBox might be out next week, and we'll try to release >>>>> Tika 1.17 shortly after that...so there should be time to get this in. >>>>> >>>>> -----Original Message----- >>>>> From: Markus Jelsma [mailto:[email protected]] >>>>> Sent: Friday, October 27, 2017 9:12 AM >>>>> To: [email protected] >>>>> Subject: RE: Incorrect encoding detected >>>>> >>>>> Hello Tim, >>>>> >>>>> Getting rid of script and style contents sounds plausible indeed. But to >>>>> work around the problem for now, can i instruct HTMLEncodingDetector from >>>>> within Nutch to look beyond the limit? >>>>> >>>>> Thanks! >>>>> Markus >>>>> >>>>> >>>>> >>>>> -----Original message----- >>>>>> From:Allison, Timothy B. <[email protected]> >>>>>> Sent: Friday 27th October 2017 14:53 >>>>>> To: [email protected] >>>>>> Subject: RE: Incorrect encoding detected >>>>>> >>>>>> Hi Markus, >>>>>> >>>>>> My guess is that the ~32,000 characters of mostly ascii-ish <script/> >>>>>> are what is actually being used for encoding detection. The >>>>>> HTMLEncodingDetector only looks in the first 8,192 characters, and the >>>>>> other encoding detectors have similar (but longer?) restrictions. >>>>>> >>>>>> At some point, I had a dev version of a stripper that removed contents >>>>>> of <script/> and <style/> before trying to detect the >>>>>> encoding[0]...perhaps it is time to resurrect that code and integrate it? >>>>>> >>>>>> Or, given that HTML has been, um, blossoming, perhaps, more simply, we >>>>>> should expand how far we look into a stream for detection? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Tim >>>>>> >>>>>> [0] https://issues.apache.org/jira/browse/TIKA-2038 >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Markus Jelsma [mailto:[email protected]] >>>>>> Sent: Friday, October 27, 2017 8:39 AM >>>>>> To: [email protected] >>>>>> Subject: Incorrect encoding detected >>>>>> >>>>>> Hello, >>>>>> >>>>>> We have a problem with Tika, encoding and pages on this website: >>>>>> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser >>>>>> >>>>>> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that >>>>>> the regular HTML parser does a fine job, but our TikaParser has a tough >>>>>> job dealing with this HTML. For some reason Tika thinks >>>>>> Content-Encoding=windows-1252 is what this webpage says it is, instead >>>>>> the page identifies itself properly as UTF-8. >>>>>> >>>>>> Of all websites we index, this is so far the only one giving trouble >>>>>> indexing accents, getting fÃ¥ instead of a regular få. >>>>>> >>>>>> Any tips to spare? >>>>>> >>>>>> Many many thanks! >>>>>> Markus >>>>>> >>>>> >>>> >>> >>

