I'm using nutch 1.10 and one site (http://www.choice-hifi.com) is failing to be parsed. It doesn't seem to correctly ascertain the pages are mimetype text/html. It then passes it to Tika parser, which also fails and returns it as a octet stream and no further processing of page occurs.

Some logs from parsing of one particular html page on this site:-

2016-02-14 21:48:21,571 INFO parse.ParseSegment - ParseSegment: starting at 2016-02-14 21:48:21 2016-02-14 21:48:21,572 INFO parse.ParseSegment - ParseSegment: segment: /home/arthur/nutch/crawl/segments/20160214214815 2016-02-14 21:48:21,797 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-02-14 21:48:22,404 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/octet-stream, but they are not mapped to it in the parse-plugins.xml file 2016-02-14 21:48:23,241 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/octet-stream 2016-02-14 21:48:23,248 WARN parse.ParseSegment - Error parsing: http://www.choice-hifi.com/Hi-Fi-Exchange-FREE-ADS-Used-Second-Hand-HiFi-Equipment/page/34: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream 2016-02-14 21:48:23,250 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2016-02-14 21:48:23,272 INFO parse.ParseSegment - Parsed (25ms):http://www.choice-hifi.com/Hi-Fi-Exchange-FREE-ADS-Used-Second-Hand-HiFi-Equipment/page/34 2016-02-14 21:48:24,084 INFO parse.ParseSegment - ParseSegment: finished at 2016-02-14 21:48:24, elapsed: 00:00:02


I can't work out what's going wrong here. The page in question is a little old school html wise, but does have:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">My mimetype/tika config is pretty stock, with exception I have a mimetype-filter to only accept text/html. Any ideas?


--
Arthur Yarwood

Reply via email to