Hello Sebastian, I just spotted tika.config.file in the TikaParser, so that's how we can instruct a specific config.
Meanwhile Timothy Allison committed a fix. I will try the nightly build tomorrow. Thanks, Markus -----Original message----- > From:Sebastian Nagel <[email protected]> > Sent: Thursday 2nd November 2017 13:32 > To: [email protected] > Subject: Re: Incorrect encoding detected > > I hadn't the time to dig into the problem. > Neither how to pass a tika-config file nor why > actually parse-html is detecting the encoding > although it's also only looking for the first 8192 > characters (see CHUNK_SIZE). > > Just one point: for the MIME detection we also > pass the Content-Type sent by the web server to Tika. > Could this also be help to pass it as additional glue? > In the concrete example the server sends > Content-Type: text/html; charset=utf-8 > > Sebastian > > On 11/01/2017 07:06 PM, Markus Jelsma wrote: > > Any ideas? > > > > Thanks! > > > > > > > > -----Original message----- > >> From:Markus Jelsma <[email protected]> > >> Sent: Tuesday 31st October 2017 13:14 > >> To: User <[email protected]> > >> Subject: FW: Incorrect encoding detected > >> > >> I actually don't know, can we specify a tika-config file in Nutch? > >> > >> Thanks, > >> Markus > >> > >> -----Original message----- > >>> From:Allison, Timothy B. <[email protected]> > >>> Sent: Tuesday 31st October 2017 13:11 > >>> To: [email protected] > >>> Subject: RE: Incorrect encoding detected > >>> > >>> For 1.17, the simplest solution, I think, is to allow users to configure > >>> extending the detection limit via our @Field config methods, that is, via > >>> tika-config.xml. > >>> > >>> To confirm, Nutch will allow users to specify a tika-config file? Will > >>> this work for you and Nutch? > >>> > >>> -----Original Message----- > >>> From: Markus Jelsma [mailto:[email protected]] > >>> Sent: Tuesday, October 31, 2017 5:47 AM > >>> To: [email protected] > >>> Subject: RE: Incorrect encoding detected > >>> > >>> Hello Timothy - what would be your preferred solution? Increase detection > >>> limit or skip inline styles and possibly other useless head information? > >>> > >>> Thanks, > >>> Markus > >>> > >>> > >>> > >>> -----Original message----- > >>>> From:Markus Jelsma <[email protected]> > >>>> Sent: Friday 27th October 2017 15:37 > >>>> To: [email protected] > >>>> Subject: RE: Incorrect encoding detected > >>>> > >>>> Hi Tim, > >>>> > >>>> I have opened TIKA-2485 to track the problem. > >>>> > >>>> Thank you very very much! > >>>> Markus > >>>> > >>>> > >>>> > >>>> -----Original message----- > >>>>> From:Allison, Timothy B. <[email protected]> > >>>>> Sent: Friday 27th October 2017 15:33 > >>>>> To: [email protected] > >>>>> Subject: RE: Incorrect encoding detected > >>>>> > >>>>> Unfortunately there is no way to do this now. _I think_ we could make > >>>>> this configurable, though, fairly easily. Please open a ticket. > >>>>> > >>>>> The next RC for PDFBox might be out next week, and we'll try to release > >>>>> Tika 1.17 shortly after that...so there should be time to get this in. > >>>>> > >>>>> -----Original Message----- > >>>>> From: Markus Jelsma [mailto:[email protected]] > >>>>> Sent: Friday, October 27, 2017 9:12 AM > >>>>> To: [email protected] > >>>>> Subject: RE: Incorrect encoding detected > >>>>> > >>>>> Hello Tim, > >>>>> > >>>>> Getting rid of script and style contents sounds plausible indeed. But > >>>>> to work around the problem for now, can i instruct HTMLEncodingDetector > >>>>> from within Nutch to look beyond the limit? > >>>>> > >>>>> Thanks! > >>>>> Markus > >>>>> > >>>>> > >>>>> > >>>>> -----Original message----- > >>>>>> From:Allison, Timothy B. <[email protected]> > >>>>>> Sent: Friday 27th October 2017 14:53 > >>>>>> To: [email protected] > >>>>>> Subject: RE: Incorrect encoding detected > >>>>>> > >>>>>> Hi Markus, > >>>>>> > >>>>>> My guess is that the ~32,000 characters of mostly ascii-ish <script/> > >>>>>> are what is actually being used for encoding detection. The > >>>>>> HTMLEncodingDetector only looks in the first 8,192 characters, and the > >>>>>> other encoding detectors have similar (but longer?) restrictions. > >>>>>> > >>>>>> At some point, I had a dev version of a stripper that removed contents > >>>>>> of <script/> and <style/> before trying to detect the > >>>>>> encoding[0]...perhaps it is time to resurrect that code and integrate > >>>>>> it? > >>>>>> > >>>>>> Or, given that HTML has been, um, blossoming, perhaps, more simply, we > >>>>>> should expand how far we look into a stream for detection? > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Tim > >>>>>> > >>>>>> [0] https://issues.apache.org/jira/browse/TIKA-2038 > >>>>>> > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: Markus Jelsma [mailto:[email protected]] > >>>>>> Sent: Friday, October 27, 2017 8:39 AM > >>>>>> To: [email protected] > >>>>>> Subject: Incorrect encoding detected > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> We have a problem with Tika, encoding and pages on this website: > >>>>>> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser > >>>>>> > >>>>>> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that > >>>>>> the regular HTML parser does a fine job, but our TikaParser has a > >>>>>> tough job dealing with this HTML. For some reason Tika thinks > >>>>>> Content-Encoding=windows-1252 is what this webpage says it is, instead > >>>>>> the page identifies itself properly as UTF-8. > >>>>>> > >>>>>> Of all websites we index, this is so far the only one giving trouble > >>>>>> indexing accents, getting fÃ¥ instead of a regular få. > >>>>>> > >>>>>> Any tips to spare? > >>>>>> > >>>>>> Many many thanks! > >>>>>> Markus > >>>>>> > >>>>> > >>>> > >>> > >> > >

