Hi, It complains about not finding a Tika parser for the content type, did you modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails because of PDFbox. I can parse it successfully with trunk, 1.5 is not going to work, not because it cannot find the TikaParser for PDFs but becasue PDFBox cannot handle it.
Cheers, -----Original message----- > From:kiran chitturi <[email protected]> > Sent: Mon 15-Oct-2012 21:58 > To: [email protected] > Subject: nutch - Status: failed(2,200): > org.apache.nutch.parse.ParseException: Unable to successfully parse content > > Hi, > > I am trying to parse pdf files using nutch and its failing everytime with > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException: > Unable to successfully parse content' in both nutch 1.5 and 2.x series when > i do the command 'sh bin/nutch parsechecker > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'. > > The hadoop.log looks like this > > > > > 2012-10-15 15:43:32,323 INFO http.Http - http.proxy.host = null > > 2012-10-15 15:43:32,323 INFO http.Http - http.proxy.port = 8080 > > 2012-10-15 15:43:32,323 INFO http.Http - http.timeout = 10000 > > 2012-10-15 15:43:32,323 INFO http.Http - http.content.limit = -1 > > 2012-10-15 15:43:32,323 INFO http.Http - http.agent = My Nutch > > Spider/Nutch-2.2-SNAPSHOT > > 2012-10-15 15:43:32,323 INFO http.Http - http.accept.language = > > en-us,en-gb,en;q=0.7,*;q=0.3 > > 2012-10-15 15:43:32,323 INFO http.Http - http.accept = > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > 2012-10-15 15:43:36,851 INFO parse.ParserChecker - parsing: > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf > > 2012-10-15 15:43:36,851 INFO parse.ParserChecker - contentType: > > application/pdf > > 2012-10-15 15:43:36,858 INFO crawl.SignatureFactory - Using Signature > > impl: org.apache.nutch.crawl.MD5Signature > > 2012-10-15 15:43:36,904 INFO parse.ParserFactory - The parsing plugins: > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > plugin.includes system property, and all claim to support the content type > > application/pdf, but they are not mapped to it in the parse-plugins.xml > > file > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika parser > > for mime-type application/pdf > > 2012-10-15 15:43:36,969 WARN parse.ParseUtil - Unable to successfully > > parse content > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of > > type application/pdf > > > The config file nutch-site.xml is as below: > > <?xml version="1.0"?> > > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > > <property> > > <name>http.agent.name</name> > > <value>My Nutch Spider</value> > > </property> > > > > <property> > > <name>plugin.folders</name> > > <value>/Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins > > </value> > > </property> > > > > <property> > > <name>plugin.includes</name> > > <value> > > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic) > > </value> > > </property> > > <!-- Used only if plugin parse-metatags is enabled. --> > > <property> > > <name>metatags.names</name> > > <value>*</value> > > <description> Names of the metatags to extract, separated by;. > > Use '*' to extract all metatags. Prefixes the names with 'metatag.' > > in the parse-metadata. For instance to index description and keywords, > > you need to activate the plugin index-metadata and set the value of the > > parameter 'index.parse.md' to 'metatag.description;metatag.keywords'. > > </description> > > </property> > > <property> > > <name>index.parse.md</name> > > <value> > > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion > > </value> > > <description> > > Comma-separated list of keys to be taken from the parse metadata to > > generate fields. > > Can be used e.g. for 'description' or 'keywords' provided that these > > values are generated > > by a parser (see parse-metatags plugin) > > </description> > > </property> > > <property> > > <name>http.content.limit</name> > > <value>-1</value> > > </property> > > </configuration> > > > > Are there any configuration settings that i need to do to work with pdf > files ? I have parsed them before and crawled but i am not sure which is > causing the error now. > > Can someone please point the cause of the errors above ? > > Many Thanks, > -- > Kiran Chitturi >

