Can Nutch not supply Tika with the value of the HTTP response header? i.e. "text/html; charset=utf-8"
On 1 November 2017 at 00:49, Allison, Timothy B. <[email protected]> wrote: > Conal, > > > > I think I largely agree with this, but the problem with the example file > that Markus shared with us is that the page doesn’t get around to > claiming UTF-8 until 32,000 characters in. 😊 > > > > Longer term, I want to incorporate my script/style stripper, but I don’t > want to do that without some serious testing on our regression corpus or > perhaps an expanded corpus that focuses on a greater diversity of > languages/encodings along the lines of where we are/were headed on > TIKA-2038. > > > > > > *From:* Conal Tuohy [mailto:[email protected]] > *Sent:* Tuesday, October 31, 2017 9:55 AM > *To:* [email protected] > *Subject:* Re: Incorrect encoding detected > > > > If the parser found some non-UTF data in the first 8k bytes then it would > make sense to guess a different encoding, but I think that if the page > claims to be UTF-8, and the parser can find nothing that contradicts that > in the first 8k bytes, it should assume that it really is in UTF-8? > > > > > > > > On 31 October 2017 at 22:11, Allison, Timothy B. <[email protected]> > wrote: > > For 1.17, the simplest solution, I think, is to allow users to configure > extending the detection limit via our @Field config methods, that is, via > tika-config.xml. > > To confirm, Nutch will allow users to specify a tika-config file? Will > this work for you and Nutch? > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > > Sent: Tuesday, October 31, 2017 5:47 AM > To: [email protected] > Subject: RE: Incorrect encoding detected > > Hello Timothy - what would be your preferred solution? Increase detection > limit or skip inline styles and possibly other useless head information? > > Thanks, > Markus > > > > -----Original message----- > > From:Markus Jelsma <[email protected]> > > Sent: Friday 27th October 2017 15:37 > > To: [email protected] > > Subject: RE: Incorrect encoding detected > > > > Hi Tim, > > > > I have opened TIKA-2485 to track the problem. > > > > Thank you very very much! > > Markus > > > > > > > > -----Original message----- > > > From:Allison, Timothy B. <[email protected]> > > > Sent: Friday 27th October 2017 15:33 > > > To: [email protected] > > > Subject: RE: Incorrect encoding detected > > > > > > Unfortunately there is no way to do this now. _I think_ we could make > this configurable, though, fairly easily. Please open a ticket. > > > > > > The next RC for PDFBox might be out next week, and we'll try to > release Tika 1.17 shortly after that...so there should be time to get this > in. > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:[email protected]] > > > Sent: Friday, October 27, 2017 9:12 AM > > > To: [email protected] > > > Subject: RE: Incorrect encoding detected > > > > > > Hello Tim, > > > > > > Getting rid of script and style contents sounds plausible indeed. But > to work around the problem for now, can i instruct HTMLEncodingDetector > from within Nutch to look beyond the limit? > > > > > > Thanks! > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:Allison, Timothy B. <[email protected]> > > > > Sent: Friday 27th October 2017 14:53 > > > > To: [email protected] > > > > Subject: RE: Incorrect encoding detected > > > > > > > > Hi Markus, > > > > > > > > My guess is that the ~32,000 characters of mostly ascii-ish > <script/> are what is actually being used for encoding detection. The > HTMLEncodingDetector only looks in the first 8,192 characters, and the > other encoding detectors have similar (but longer?) restrictions. > > > > > > > > At some point, I had a dev version of a stripper that removed > contents of <script/> and <style/> before trying to detect the > encoding[0]...perhaps it is time to resurrect that code and integrate it? > > > > > > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, > we should expand how far we look into a stream for detection? > > > > > > > > Cheers, > > > > > > > > Tim > > > > > > > > [0] https://issues.apache.org/jira/browse/TIKA-2038 > > > > > > > > > > > > -----Original Message----- > > > > From: Markus Jelsma [mailto:[email protected]] > > > > Sent: Friday, October 27, 2017 8:39 AM > > > > To: [email protected] > > > > Subject: Incorrect encoding detected > > > > > > > > Hello, > > > > > > > > We have a problem with Tika, encoding and pages on this website: > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser > > > > > > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out > that the regular HTML parser does a fine job, but our TikaParser has a > tough job dealing with this HTML. For some reason Tika thinks > Content-Encoding=windows-1252 is what this webpage says it is, instead the > page identifies itself properly as UTF-8. > > > > > > > > Of all websites we index, this is so far the only one giving trouble > indexing accents, getting fÃ¥ instead of a regular få. > > > > > > > > Any tips to spare? > > > > > > > > Many many thanks! > > > > Markus > > > > > > > > > > > > > > > -- > > Conal Tuohy > > http://conaltuohy.com/ > > @conal_tuohy > +61-466-324297 <0466%20324%20297> > -- Conal Tuohy http://conaltuohy.com/ @conal_tuohy +61-466-324297
