Hello Tim, Thanks! I will try the nightly build tomorrow!
Nutch probably already has support for tika-config. I couldn't find it in the config, but in the code i spotted support for tika.config.file. Many, many thanks! Markus -----Original message----- > From:Allison, Timothy B. <[email protected]> > Sent: Thursday 2nd November 2017 14:56 > To: [email protected] > Subject: RE: Incorrect encoding detected > > Hi Markus, > I just committed TIKA-2485. See the issue for the commit, if you have to >make these changes on your local Tika build. > > Looks like tika-config.xml was not added here: > https://issues.apache.org/jira/browse/NUTCH-577. > I wonder if it was added to Nutch later. If it wasn't, I'd highly encourage > re-opening this issue and adding it back in! > > To build an AutoDetectParser from a tika-config.xml file, do something like > this (but with correct exception handling/closing!!!): > > TikaConfig tikaConfig = new TikaConfig( > getResourceAsStream("/org/apache/tika/config/TIKA-2485-encoding-detector-mark-limits.xml")); > > AutoDetectParser p = new AutoDetectParser(tikaConfig); > > Note that the order of the encoding detectors matters! The first one that > returns a non-null result is the one that Tika uses. The default encoding > detector order is as I specified it in > "TIKA-2485-encoding-detector-mark-limits.xml": HTML, Universal, ICU4j. The > default order is specified via SPI here in > tika-parsers/src/main/resources/META-INF/services/o.a.t.detect.EncodingDetector > > Let us know if there's anything else we can do. > > Best, > > Tim > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Wednesday, November 1, 2017 5:32 PM > To: [email protected] > Subject: RE: Incorrect encoding detected > > Alright, the Nutch list could not provide an answer and myself i don't know. > But, if Nutch can't, we can make it happen. Can you direct me to a page that > explains how tika-config has to be passed to Tika? We have full control over > what we put into the Parser, e.g. ContentHandler, Context, etc. > > If we can do that, i just need to know what to set to increase the limit. I > am unaware of Tika having @Field config methods, its new to me. > > But you said it was not supported yet, so that would mean the content limit > would not adhere to @Field config? > > That is fine too, but i really need a short time solution. I needed i can > manually patch Tika and have our (i am not speaking as a Nutch committer > right now) parser use in-house compiler Tika. > > I checked the encoding package in tika-core. There are many detection classes > there but i really have no idea which detector Nutch (or Tika by default) > uses under the hood. I could not easily find the file in which i could > increase the limit. > > I am happy with this hack for a brief time, until it is supported by a new > Tika version. Can you direct me to the class i should modify? > > Many, many thanks! > Markus > > -----Original message----- > > From:Allison, Timothy B. <[email protected]> > > Sent: Tuesday 31st October 2017 13:11 > > To: [email protected] > > Subject: RE: Incorrect encoding detected > > > > For 1.17, the simplest solution, I think, is to allow users to configure > > extending the detection limit via our @Field config methods, that is, via > > tika-config.xml. > > > > To confirm, Nutch will allow users to specify a tika-config file? Will > > this work for you and Nutch? > > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: Tuesday, October 31, 2017 5:47 AM > > To: [email protected] > > Subject: RE: Incorrect encoding detected > > > > Hello Timothy - what would be your preferred solution? Increase detection > > limit or skip inline styles and possibly other useless head information? > > > > Thanks, > > Markus > > > > > > > > -----Original message----- > > > From:Markus Jelsma <[email protected]> > > > Sent: Friday 27th October 2017 15:37 > > > To: [email protected] > > > Subject: RE: Incorrect encoding detected > > > > > > Hi Tim, > > > > > > I have opened TIKA-2485 to track the problem. > > > > > > Thank you very very much! > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:Allison, Timothy B. <[email protected]> > > > > Sent: Friday 27th October 2017 15:33 > > > > To: [email protected] > > > > Subject: RE: Incorrect encoding detected > > > > > > > > Unfortunately there is no way to do this now. _I think_ we could make > > > > this configurable, though, fairly easily. Please open a ticket. > > > > > > > > The next RC for PDFBox might be out next week, and we'll try to release > > > > Tika 1.17 shortly after that...so there should be time to get this in. > > > > > > > > -----Original Message----- > > > > From: Markus Jelsma [mailto:[email protected]] > > > > Sent: Friday, October 27, 2017 9:12 AM > > > > To: [email protected] > > > > Subject: RE: Incorrect encoding detected > > > > > > > > Hello Tim, > > > > > > > > Getting rid of script and style contents sounds plausible indeed. But > > > > to work around the problem for now, can i instruct HTMLEncodingDetector > > > > from within Nutch to look beyond the limit? > > > > > > > > Thanks! > > > > Markus > > > > > > > > > > > > > > > > -----Original message----- > > > > > From:Allison, Timothy B. <[email protected]> > > > > > Sent: Friday 27th October 2017 14:53 > > > > > To: [email protected] > > > > > Subject: RE: Incorrect encoding detected > > > > > > > > > > Hi Markus, > > > > > > > > > > My guess is that the ~32,000 characters of mostly ascii-ish <script/> > > > > > are what is actually being used for encoding detection. The > > > > > HTMLEncodingDetector only looks in the first 8,192 characters, and > > > > > the other encoding detectors have similar (but longer?) restrictions. > > > > > > > > > > At some point, I had a dev version of a stripper that removed > > > > > contents of <script/> and <style/> before trying to detect the > > > > > encoding[0]...perhaps it is time to resurrect that code and integrate > > > > > it? > > > > > > > > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, > > > > > we should expand how far we look into a stream for detection? > > > > > > > > > > Cheers, > > > > > > > > > > Tim > > > > > > > > > > [0] https://issues.apache.org/jira/browse/TIKA-2038 > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Markus Jelsma [mailto:[email protected]] > > > > > Sent: Friday, October 27, 2017 8:39 AM > > > > > To: [email protected] > > > > > Subject: Incorrect encoding detected > > > > > > > > > > Hello, > > > > > > > > > > We have a problem with Tika, encoding and pages on this website: > > > > > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser > > > > > > > > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out > > > > > that the regular HTML parser does a fine job, but our TikaParser has > > > > > a tough job dealing with this HTML. For some reason Tika thinks > > > > > Content-Encoding=windows-1252 is what this webpage says it is, > > > > > instead the page identifies itself properly as UTF-8. > > > > > > > > > > Of all websites we index, this is so far the only one giving trouble > > > > > indexing accents, getting fÃ¥ instead of a regular få. > > > > > > > > > > Any tips to spare? > > > > > > > > > > Many many thanks! > > > > > Markus > > > > > > > > > > > > > > >
