RE: Incorrect encoding detected

Markus Jelsma Thu, 02 Nov 2017 06:58:45 -0700

Hello Sebastian,

I just spotted tika.config.file in the TikaParser, so that's how we can 
instruct a specific config.


Meanwhile Timothy Allison committed a fix. I will try the nightly build 
tomorrow.

Thanks,
Markus 
 
-----Original message-----
> From:Sebastian Nagel <[email protected]>
> Sent: Thursday 2nd November 2017 13:32
> To: [email protected]
> Subject: Re: Incorrect encoding detected
> 
> I hadn't the time to dig into the problem.
> Neither how to pass a tika-config file nor why
> actually parse-html is detecting the encoding
> although it's also only looking for the first 8192
> characters (see CHUNK_SIZE).
> 
> Just one point: for the MIME detection we also
> pass the Content-Type sent by the web server to Tika.
> Could this also be help to pass it as additional glue?
> In the concrete example the server sends
>   Content-Type: text/html; charset=utf-8
> 
> Sebastian
> 
> On 11/01/2017 07:06 PM, Markus Jelsma wrote:
> > Any ideas?
> > 
> > Thanks!
> > 
> >  
> >  
> > -----Original message-----
> >> From:Markus Jelsma <[email protected]>
> >> Sent: Tuesday 31st October 2017 13:14
> >> To: User <[email protected]>
> >> Subject: FW: Incorrect encoding detected
> >>
> >> I actually don't know, can we specify a tika-config file in Nutch?
> >>
> >> Thanks,
> >> Markus
> >>  
> >> -----Original message-----
> >>> From:Allison, Timothy B. <[email protected]>
> >>> Sent: Tuesday 31st October 2017 13:11
> >>> To: [email protected]
> >>> Subject: RE: Incorrect encoding detected
> >>>
> >>> For 1.17, the simplest solution, I think, is to allow users to configure 
> >>> extending the detection limit via our @Field config methods, that is, via 
> >>> tika-config.xml.
> >>>
> >>> To confirm, Nutch will allow users to specify a tika-config file?  Will 
> >>> this work for you and Nutch?
> >>>
> >>> -----Original Message-----
> >>> From: Markus Jelsma [mailto:[email protected]] 
> >>> Sent: Tuesday, October 31, 2017 5:47 AM
> >>> To: [email protected]
> >>> Subject: RE: Incorrect encoding detected
> >>>
> >>> Hello Timothy - what would be your preferred solution? Increase detection 
> >>> limit or skip inline styles and possibly other useless head information?
> >>>
> >>> Thanks,
> >>> Markus
> >>>
> >>>  
> >>>  
> >>> -----Original message-----
> >>>> From:Markus Jelsma <[email protected]>
> >>>> Sent: Friday 27th October 2017 15:37
> >>>> To: [email protected]
> >>>> Subject: RE: Incorrect encoding detected
> >>>>
> >>>> Hi Tim,
> >>>>
> >>>> I have opened TIKA-2485 to track the problem. 
> >>>>
> >>>> Thank you very very much!
> >>>> Markus
> >>>>
> >>>>  
> >>>>  
> >>>> -----Original message-----
> >>>>> From:Allison, Timothy B. <[email protected]>
> >>>>> Sent: Friday 27th October 2017 15:33
> >>>>> To: [email protected]
> >>>>> Subject: RE: Incorrect encoding detected
> >>>>>
> >>>>> Unfortunately there is no way to do this now.  _I think_ we could make 
> >>>>> this configurable, though, fairly easily.  Please open a ticket.
> >>>>>
> >>>>> The next RC for PDFBox might be out next week, and we'll try to release 
> >>>>> Tika 1.17 shortly after that...so there should be time to get this in.
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Markus Jelsma [mailto:[email protected]] 
> >>>>> Sent: Friday, October 27, 2017 9:12 AM
> >>>>> To: [email protected]
> >>>>> Subject: RE: Incorrect encoding detected
> >>>>>
> >>>>> Hello Tim,
> >>>>>
> >>>>> Getting rid of script and style contents sounds plausible indeed. But 
> >>>>> to work around the problem for now, can i instruct HTMLEncodingDetector 
> >>>>> from within Nutch to look beyond the limit?
> >>>>>
> >>>>> Thanks!
> >>>>> Markus
> >>>>>
> >>>>>  
> >>>>>  
> >>>>> -----Original message-----
> >>>>>> From:Allison, Timothy B. <[email protected]>
> >>>>>> Sent: Friday 27th October 2017 14:53
> >>>>>> To: [email protected]
> >>>>>> Subject: RE: Incorrect encoding detected
> >>>>>>
> >>>>>> Hi Markus,
> >>>>>>   
> >>>>>> My guess is that the ~32,000 characters of mostly ascii-ish <script/> 
> >>>>>> are what is actually being used for encoding detection.  The 
> >>>>>> HTMLEncodingDetector only looks in the first 8,192 characters, and the 
> >>>>>> other encoding detectors have similar (but longer?) restrictions.
> >>>>>>  
> >>>>>> At some point, I had a dev version of a stripper that removed contents 
> >>>>>> of <script/> and <style/> before trying to detect the 
> >>>>>> encoding[0]...perhaps it is time to resurrect that code and integrate 
> >>>>>> it?
> >>>>>>
> >>>>>> Or, given that HTML has been, um, blossoming, perhaps, more simply, we 
> >>>>>> should expand how far we look into a stream for detection?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>>                Tim
> >>>>>>
> >>>>>> [0] https://issues.apache.org/jira/browse/TIKA-2038
> >>>>>>    
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Markus Jelsma [mailto:[email protected]] 
> >>>>>> Sent: Friday, October 27, 2017 8:39 AM
> >>>>>> To: [email protected]
> >>>>>> Subject: Incorrect encoding detected
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> We have a problem with Tika, encoding and pages on this website: 
> >>>>>> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> >>>>>>
> >>>>>> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that 
> >>>>>> the regular HTML parser does a fine job, but our TikaParser has a 
> >>>>>> tough job dealing with this HTML. For some reason Tika thinks 
> >>>>>> Content-Encoding=windows-1252 is what this webpage says it is, instead 
> >>>>>> the page identifies itself properly as UTF-8.
> >>>>>>
> >>>>>> Of all websites we index, this is so far the only one giving trouble 
> >>>>>> indexing accents, getting fÃ¥ instead of a regular få.
> >>>>>>
> >>>>>> Any tips to spare? 
> >>>>>>
> >>>>>> Many many thanks!
> >>>>>> Markus
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> 
>

RE: Incorrect encoding detected

Reply via email to