I'm using nutch 1.10 and one site (http://www.choice-hifi.com) is
failing to be parsed. It doesn't seem to correctly ascertain the pages
are mimetype text/html. It then passes it to Tika parser, which also
fails and returns it as a octet stream and no further processing of page
occurs.
Some logs from parsing of one particular html page on this site:-
2016-02-14 21:48:21,571 INFO parse.ParseSegment - ParseSegment:
starting at 2016-02-14 21:48:21
2016-02-14 21:48:21,572 INFO parse.ParseSegment - ParseSegment:
segment: /home/arthur/nutch/crawl/segments/20160214214815
2016-02-14 21:48:21,797 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2016-02-14 21:48:22,404 INFO parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/octet-stream, but they are not mapped to it in the
parse-plugins.xml file
2016-02-14 21:48:23,241 ERROR tika.TikaParser - Can't retrieve Tika
parser for mime-type application/octet-stream
2016-02-14 21:48:23,248 WARN parse.ParseSegment - Error parsing:
http://www.choice-hifi.com/Hi-Fi-Exchange-FREE-ADS-Used-Second-Hand-HiFi-Equipment/page/34:
failed(2,0): Can't retrieve Tika parser for mime-type
application/octet-stream
2016-02-14 21:48:23,250 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2016-02-14 21:48:23,272 INFO parse.ParseSegment - Parsed
(25ms):http://www.choice-hifi.com/Hi-Fi-Exchange-FREE-ADS-Used-Second-Hand-HiFi-Equipment/page/34
2016-02-14 21:48:24,084 INFO parse.ParseSegment - ParseSegment:
finished at 2016-02-14 21:48:24, elapsed: 00:00:02
I can't work out what's going wrong here. The page in question is a
little old school html wise, but does have:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=iso-8859-1">My mimetype/tika config is pretty stock, with
exception I have a mimetype-filter to only accept text/html. Any ideas?
--
Arthur Yarwood