RE: Incorrect encoding detected

Markus Jelsma Thu, 02 Nov 2017 07:00:58 -0700

Hello Tim,

Thanks! I will try the nightly build tomorrow!


Nutch probably already has support for tika-config. I couldn't find it in the 
config, but in the code i spotted support for tika.config.file. 

Many, many thanks!
Markus
 
 
-----Original message-----
> From:Allison, Timothy B. <[email protected]>
> Sent: Thursday 2nd November 2017 14:56
> To: [email protected]
> Subject: RE: Incorrect encoding detected
> 
> Hi Markus,
>   I just committed TIKA-2485.  See the issue for the commit, if you have to 
>make these changes on your local Tika build.
> 
> Looks like tika-config.xml was not added here: 
> https://issues.apache.org/jira/browse/NUTCH-577.  
> I wonder if it was added to Nutch later.  If it wasn't, I'd highly encourage 
> re-opening this issue and adding it back in!
> 
> To build an AutoDetectParser from a tika-config.xml file, do something like 
> this (but with correct exception handling/closing!!!):
> 
> TikaConfig tikaConfig = new TikaConfig( 
> getResourceAsStream("/org/apache/tika/config/TIKA-2485-encoding-detector-mark-limits.xml"));
> 
> AutoDetectParser p = new AutoDetectParser(tikaConfig);
> 
> Note that the order of the encoding detectors matters!  The first one that 
> returns a non-null result is the one that Tika uses.  The default encoding 
> detector order is as I specified it in 
> "TIKA-2485-encoding-detector-mark-limits.xml": HTML, Universal, ICU4j.  The 
> default order is specified via SPI here in 
> tika-parsers/src/main/resources/META-INF/services/o.a.t.detect.EncodingDetector
> 
> Let us know if there's anything else we can do.
> 
> Best,
> 
>                Tim
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: Wednesday, November 1, 2017 5:32 PM
> To: [email protected]
> Subject: RE: Incorrect encoding detected
> 
> Alright, the Nutch list could not provide an answer and myself i don't know. 
> But, if Nutch can't, we can make it happen. Can you direct me to a page that 
> explains how tika-config has to be passed to Tika? We have full control over 
> what we put into the Parser, e.g. ContentHandler, Context, etc. 
> 
> If we can do that, i just need to know what to set to increase the limit. I 
> am unaware of Tika having @Field config methods, its new to me.
> 
> But you said it was not supported yet, so that would mean the content limit 
> would not adhere to @Field config?
> 
> That is fine too, but i really need a short time solution. I needed i can 
> manually patch Tika and have our (i am not speaking as a Nutch committer 
> right now) parser use in-house compiler Tika.
> 
> I checked the encoding package in tika-core. There are many detection classes 
> there but i really have no idea which detector Nutch (or Tika by default) 
> uses under the hood. I could not easily find the file in which i could 
> increase the limit.
> 
> I am happy with this hack for a brief time, until it is supported by a new 
> Tika version. Can you direct me to the class i should modify?
> 
> Many, many thanks!
> Markus
> 
> -----Original message-----
> > From:Allison, Timothy B. <[email protected]>
> > Sent: Tuesday 31st October 2017 13:11
> > To: [email protected]
> > Subject: RE: Incorrect encoding detected
> > 
> > For 1.17, the simplest solution, I think, is to allow users to configure 
> > extending the detection limit via our @Field config methods, that is, via 
> > tika-config.xml.
> > 
> > To confirm, Nutch will allow users to specify a tika-config file?  Will 
> > this work for you and Nutch?
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]] 
> > Sent: Tuesday, October 31, 2017 5:47 AM
> > To: [email protected]
> > Subject: RE: Incorrect encoding detected
> > 
> > Hello Timothy - what would be your preferred solution? Increase detection 
> > limit or skip inline styles and possibly other useless head information?
> > 
> > Thanks,
> > Markus
> > 
> >  
> >  
> > -----Original message-----
> > > From:Markus Jelsma <[email protected]>
> > > Sent: Friday 27th October 2017 15:37
> > > To: [email protected]
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hi Tim,
> > > 
> > > I have opened TIKA-2485 to track the problem. 
> > > 
> > > Thank you very very much!
> > > Markus
> > > 
> > >  
> > >  
> > > -----Original message-----
> > > > From:Allison, Timothy B. <[email protected]>
> > > > Sent: Friday 27th October 2017 15:33
> > > > To: [email protected]
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Unfortunately there is no way to do this now.  _I think_ we could make 
> > > > this configurable, though, fairly easily.  Please open a ticket.
> > > > 
> > > > The next RC for PDFBox might be out next week, and we'll try to release 
> > > > Tika 1.17 shortly after that...so there should be time to get this in.
> > > > 
> > > > -----Original Message-----
> > > > From: Markus Jelsma [mailto:[email protected]] 
> > > > Sent: Friday, October 27, 2017 9:12 AM
> > > > To: [email protected]
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Hello Tim,
> > > > 
> > > > Getting rid of script and style contents sounds plausible indeed. But 
> > > > to work around the problem for now, can i instruct HTMLEncodingDetector 
> > > > from within Nutch to look beyond the limit?
> > > > 
> > > > Thanks!
> > > > Markus
> > > > 
> > > >  
> > > >  
> > > > -----Original message-----
> > > > > From:Allison, Timothy B. <[email protected]>
> > > > > Sent: Friday 27th October 2017 14:53
> > > > > To: [email protected]
> > > > > Subject: RE: Incorrect encoding detected
> > > > > 
> > > > > Hi Markus,
> > > > >   
> > > > > My guess is that the ~32,000 characters of mostly ascii-ish <script/> 
> > > > > are what is actually being used for encoding detection.  The 
> > > > > HTMLEncodingDetector only looks in the first 8,192 characters, and 
> > > > > the other encoding detectors have similar (but longer?) restrictions.
> > > > >  
> > > > > At some point, I had a dev version of a stripper that removed 
> > > > > contents of <script/> and <style/> before trying to detect the 
> > > > > encoding[0]...perhaps it is time to resurrect that code and integrate 
> > > > > it?
> > > > > 
> > > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, 
> > > > > we should expand how far we look into a stream for detection?
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > >                Tim
> > > > > 
> > > > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > > > >    
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Markus Jelsma [mailto:[email protected]] 
> > > > > Sent: Friday, October 27, 2017 8:39 AM
> > > > > To: [email protected]
> > > > > Subject: Incorrect encoding detected
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > We have a problem with Tika, encoding and pages on this website: 
> > > > > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > > > 
> > > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out 
> > > > > that the regular HTML parser does a fine job, but our TikaParser has 
> > > > > a tough job dealing with this HTML. For some reason Tika thinks 
> > > > > Content-Encoding=windows-1252 is what this webpage says it is, 
> > > > > instead the page identifies itself properly as UTF-8.
> > > > > 
> > > > > Of all websites we index, this is so far the only one giving trouble 
> > > > > indexing accents, getting fÃ¥ instead of a regular få.
> > > > > 
> > > > > Any tips to spare? 
> > > > > 
> > > > > Many many thanks!
> > > > > Markus
> > > > > 
> > > > 
> > > 
> > 
>

RE: Incorrect encoding detected

Reply via email to