RE: Incorrect encoding detected

Allison, Timothy B. Tue, 31 Oct 2017 05:12:02 -0700

For 1.17, the simplest solution, I think, is to allow users to configure 
extending the detection limit via our @Field config methods, that is, via 
tika-config.xml.


To confirm, Nutch will allow users to specify a tika-config file?  Will this 
work for you and Nutch?

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Tuesday, October 31, 2017 5:47 AM
To: [email protected]
Subject: RE: Incorrect encoding detected

Hello Timothy - what would be your preferred solution? Increase detection limit 
or skip inline styles and possibly other useless head information?

Thanks,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Friday 27th October 2017 15:37
> To: [email protected]
> Subject: RE: Incorrect encoding detected
> 
> Hi Tim,
> 
> I have opened TIKA-2485 to track the problem. 
> 
> Thank you very very much!
> Markus
> 
>  
>  
> -----Original message-----
> > From:Allison, Timothy B. <[email protected]>
> > Sent: Friday 27th October 2017 15:33
> > To: [email protected]
> > Subject: RE: Incorrect encoding detected
> > 
> > Unfortunately there is no way to do this now.  _I think_ we could make this 
> > configurable, though, fairly easily.  Please open a ticket.
> > 
> > The next RC for PDFBox might be out next week, and we'll try to release 
> > Tika 1.17 shortly after that...so there should be time to get this in.
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]] 
> > Sent: Friday, October 27, 2017 9:12 AM
> > To: [email protected]
> > Subject: RE: Incorrect encoding detected
> > 
> > Hello Tim,
> > 
> > Getting rid of script and style contents sounds plausible indeed. But to 
> > work around the problem for now, can i instruct HTMLEncodingDetector from 
> > within Nutch to look beyond the limit?
> > 
> > Thanks!
> > Markus
> > 
> >  
> >  
> > -----Original message-----
> > > From:Allison, Timothy B. <[email protected]>
> > > Sent: Friday 27th October 2017 14:53
> > > To: [email protected]
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hi Markus,
> > >   
> > > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are 
> > > what is actually being used for encoding detection.  The 
> > > HTMLEncodingDetector only looks in the first 8,192 characters, and the 
> > > other encoding detectors have similar (but longer?) restrictions.
> > >  
> > > At some point, I had a dev version of a stripper that removed contents of 
> > > <script/> and <style/> before trying to detect the encoding[0]...perhaps 
> > > it is time to resurrect that code and integrate it?
> > > 
> > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we 
> > > should expand how far we look into a stream for detection?
> > > 
> > > Cheers,
> > > 
> > >                Tim
> > > 
> > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > >    
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:[email protected]] 
> > > Sent: Friday, October 27, 2017 8:39 AM
> > > To: [email protected]
> > > Subject: Incorrect encoding detected
> > > 
> > > Hello,
> > > 
> > > We have a problem with Tika, encoding and pages on this website: 
> > > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > 
> > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that 
> > > the regular HTML parser does a fine job, but our TikaParser has a tough 
> > > job dealing with this HTML. For some reason Tika thinks 
> > > Content-Encoding=windows-1252 is what this webpage says it is, instead 
> > > the page identifies itself properly as UTF-8.
> > > 
> > > Of all websites we index, this is so far the only one giving trouble 
> > > indexing accents, getting fÃ¥ instead of a regular få.
> > > 
> > > Any tips to spare? 
> > > 
> > > Many many thanks!
> > > Markus
> > > 
> > 
>

RE: Incorrect encoding detected

Reply via email to