RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

Markus Jelsma Mon, 15 Oct 2012 14:16:31 -0700

Hi,

It complains about not finding a Tika parser for the content type, did you 
modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails because 
of PDFbox. I can parse it successfully with trunk, 1.5 is not going to work, 
not because it cannot find the TikaParser for PDFs but becasue PDFBox cannot 
handle it.


Cheers,
 
 
-----Original message-----
> From:kiran chitturi <[email protected]>
> Sent: Mon 15-Oct-2012 21:58
> To: [email protected]
> Subject: nutch - Status: failed(2,200): 
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
> 
> Hi,
> 
> I am trying to parse pdf files using nutch and its failing everytime with
> the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException:
> Unable to successfully parse content' in both nutch 1.5 and 2.x series when
> i do the command 'sh bin/nutch parsechecker
> http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'.
> 
> The hadoop.log looks like this
> 
> >
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.host = null
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.port = 8080
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.timeout = 10000
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.content.limit = -1
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.agent = My Nutch
> > Spider/Nutch-2.2-SNAPSHOT
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept.language =
> > en-us,en-gb,en;q=0.7,*;q=0.3
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept =
> > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - parsing:
> > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
> > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - contentType:
> > application/pdf
> > 2012-10-15 15:43:36,858 INFO  crawl.SignatureFactory - Using Signature
> > impl: org.apache.nutch.crawl.MD5Signature
> > 2012-10-15 15:43:36,904 INFO  parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content type
> > application/pdf, but they are not mapped to it  in the parse-plugins.xml
> > file
> > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika parser
> > for mime-type application/pdf
> > 2012-10-15 15:43:36,969 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of
> > type application/pdf
> 
> 
> The config file nutch-site.xml is as below:
> 
>  <?xml version="1.0"?>
> >
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > <!-- Put site-specific property overrides in this file. -->
> > <configuration>
> > <property>
> >  <name>http.agent.name</name>
> >  <value>My Nutch Spider</value>
> > </property>
> >
> > <property>
> > <name>plugin.folders</name>
> > <value>/Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins
> > </value>
> > </property>
> >
> > <property>
> > <name>plugin.includes</name>
> > <value>
> > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
> > </value>
> > </property>
> > <!-- Used only if plugin parse-metatags is enabled. -->
> > <property>
> > <name>metatags.names</name>
> > <value>*</value>
> > <description> Names of the metatags to extract, separated by;.
> >   Use '*' to extract all metatags. Prefixes the names with 'metatag.'
> >   in the parse-metadata. For instance to index description and keywords,
> >   you need to activate the plugin index-metadata and set the value of the
> >   parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
> > </description>
> > </property>
> > <property>
> >   <name>index.parse.md</name>
> >   <value>
> > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion
> > </value>
> >   <description>
> >   Comma-separated list of keys to be taken from the parse metadata to
> > generate fields.
> >   Can be used e.g. for 'description' or 'keywords' provided that these
> > values are generated
> >   by a parser (see parse-metatags plugin)
> >   </description>
> > </property>
> > <property>
> > <name>http.content.limit</name>
> > <value>-1</value>
> > </property>
> > </configuration>
> >
> > Are there any configuration settings that i need to do to work with pdf
> files ? I have parsed them before and crawled but i am not sure which is
> causing the error now.
> 
> Can someone please point the cause of the errors above ?
> 
> Many Thanks,
> -- 
> Kiran Chitturi
>

RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

Reply via email to