Re: Incorrect encoding detected

Conal Tuohy Wed, 01 Nov 2017 00:25:03 -0700

Can Nutch not supply Tika with the value of the HTTP response header?
i.e. "text/html;
charset=utf-8"




On 1 November 2017 at 00:49, Allison, Timothy B. <[email protected]> wrote:

> Conal,
>
>
>
>   I think I largely agree with this, but the problem with the example file
> that Markus shared with us is that the page doesn’t get around to
> claiming UTF-8 until 32,000 characters in. 😊
>
>
>
> Longer term, I want to incorporate my script/style stripper, but I don’t
> want to do that without some serious testing on our regression corpus or
> perhaps an expanded corpus that focuses on a greater diversity of
> languages/encodings along the lines of where we are/were headed on
> TIKA-2038.
>
>
>
>
>
> *From:* Conal Tuohy [mailto:[email protected]]
> *Sent:* Tuesday, October 31, 2017 9:55 AM
> *To:* [email protected]
> *Subject:* Re: Incorrect encoding detected
>
>
>
>  If the parser found some non-UTF data in the first 8k bytes then it would
> make sense to guess a different encoding, but I think that if the page
> claims to be UTF-8, and the parser can find nothing that contradicts that
> in the first 8k bytes, it should assume that it really is in UTF-8?
>
>
>
>
>
>
>
> On 31 October 2017 at 22:11, Allison, Timothy B. <[email protected]>
> wrote:
>
> For 1.17, the simplest solution, I think, is to allow users to configure
> extending the detection limit via our @Field config methods, that is, via
> tika-config.xml.
>
> To confirm, Nutch will allow users to specify a tika-config file?  Will
> this work for you and Nutch?
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
>
> Sent: Tuesday, October 31, 2017 5:47 AM
> To: [email protected]
> Subject: RE: Incorrect encoding detected
>
> Hello Timothy - what would be your preferred solution? Increase detection
> limit or skip inline styles and possibly other useless head information?
>
> Thanks,
> Markus
>
>
>
> -----Original message-----
> > From:Markus Jelsma <[email protected]>
> > Sent: Friday 27th October 2017 15:37
> > To: [email protected]
> > Subject: RE: Incorrect encoding detected
> >
> > Hi Tim,
> >
> > I have opened TIKA-2485 to track the problem.
> >
> > Thank you very very much!
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Allison, Timothy B. <[email protected]>
> > > Sent: Friday 27th October 2017 15:33
> > > To: [email protected]
> > > Subject: RE: Incorrect encoding detected
> > >
> > > Unfortunately there is no way to do this now.  _I think_ we could make
> this configurable, though, fairly easily.  Please open a ticket.
> > >
> > > The next RC for PDFBox might be out next week, and we'll try to
> release Tika 1.17 shortly after that...so there should be time to get this
> in.
> > >
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:[email protected]]
> > > Sent: Friday, October 27, 2017 9:12 AM
> > > To: [email protected]
> > > Subject: RE: Incorrect encoding detected
> > >
> > > Hello Tim,
> > >
> > > Getting rid of script and style contents sounds plausible indeed. But
> to work around the problem for now, can i instruct HTMLEncodingDetector
> from within Nutch to look beyond the limit?
> > >
> > > Thanks!
> > > Markus
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Allison, Timothy B. <[email protected]>
> > > > Sent: Friday 27th October 2017 14:53
> > > > To: [email protected]
> > > > Subject: RE: Incorrect encoding detected
> > > >
> > > > Hi Markus,
> > > >
> > > > My guess is that the ~32,000 characters of mostly ascii-ish
> <script/> are what is actually being used for encoding detection.  The
> HTMLEncodingDetector only looks in the first 8,192 characters, and the
> other encoding detectors have similar (but longer?) restrictions.
> > > >
> > > > At some point, I had a dev version of a stripper that removed
> contents of <script/> and <style/> before trying to detect the
> encoding[0]...perhaps it is time to resurrect that code and integrate it?
> > > >
> > > > Or, given that HTML has been, um, blossoming, perhaps, more simply,
> we should expand how far we look into a stream for detection?
> > > >
> > > > Cheers,
> > > >
> > > >                Tim
> > > >
> > > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Markus Jelsma [mailto:[email protected]]
> > > > Sent: Friday, October 27, 2017 8:39 AM
> > > > To: [email protected]
> > > > Subject: Incorrect encoding detected
> > > >
> > > > Hello,
> > > >
> > > > We have a problem with Tika, encoding and pages on this website:
> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > >
> > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out
> that the regular HTML parser does a fine job, but our TikaParser has a
> tough job dealing with this HTML. For some reason Tika thinks
> Content-Encoding=windows-1252 is what this webpage says it is, instead the
> page identifies itself properly as UTF-8.
> > > >
> > > > Of all websites we index, this is so far the only one giving trouble
> indexing accents, getting fÃ¥ instead of a regular få.
> > > >
> > > > Any tips to spare?
> > > >
> > > > Many many thanks!
> > > > Markus
> > > >
> > >
> >
>
>
>
>
>
> --
>
> Conal Tuohy
>
> http://conaltuohy.com/
>
> @conal_tuohy
> +61-466-324297 <0466%20324%20297>
>



-- 
Conal Tuohy
http://conaltuohy.com/
@conal_tuohy
+61-466-324297

Re: Incorrect encoding detected

Reply via email to