Re: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

Lewis John Mcgibbney Tue, 16 Oct 2012 03:12:34 -0700

Hi Kiran,

If you apply the patch to your 2.x branch, then make sure that 'ant
runtime' is executed. Please also make sure that the tika 1.1
dependency _does_not_ exist in your runtime /lib directory as this may
conflict with expected results.


If you could update the ticket it would be excellent.

Thanks

Lewis

On Tue, Oct 16, 2012 at 8:04 AM, Markus Jelsma
<[email protected]> wrote:
> No, it doesn't work because of the old PDFBox version you are using. You need 
> Tika 1.2 or higher.
>
>
>
> -----Original message-----
>> From:kiran chitturi <[email protected]>
>> Sent: Tue 16-Oct-2012 01:32
>> To: [email protected]
>> Subject: Re: nutch - Status: failed(2,200): 
>> org.apache.nutch.parse.ParseException: Unable to successfully parse content
>>
>> When i tried the command 'sh bin/nutch parsechecker
>> http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf' the
>> logs (hadoop.log) says
>>
>> parse.ParserFactory - The parsing plugins:
>> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> > plugin.includes system property, and all claim to support the content type
>> > application/pdf, but they are not mapped to it  in the parse-plugins.xml
>> > file
>> > 2012-10-15 19:04:23,733 WARN  pdfparser.PDFParser - Parsing Error,
>> > Skipping Object
>> > java.io.IOException: expected='endstream' actual=''
>> > org.apache.pdfbox.io.PushBackInputStream@215983b7
>> >         at
>> > org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:530)
>> >         at
>> > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
>> >         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>> >         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
>> >         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055)
>> >         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123)
>> >         at
>> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96)
>> >         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>> >         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>> >         at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >         at
>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >         at
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >         at java.lang.Thread.run(Thread.java:680)
>> > 2012-10-15 19:04:23,734 WARN  pdfparser.XrefTrailerResolver - Did not
>> > found XRef object at specified startxref position 0
>> > 2012-10-15 19:04:23,933 INFO  crawl.SignatureFactory - Using Signature
>> > impl: org.apache.nutch.crawl.MD5Signature
>> > 2012-10-15 19:04:23,944 INFO  parse.ParserChecker - parsing:
>> > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
>>
>>
>> Does this has anything to do with content limit, or is this other kind of
>> error ?
>>
>> Thanks for the help.
>>
>> Regards,
>> Kiran.
>>
>> On Mon, Oct 15, 2012 at 5:20 PM, Markus Jelsma
>> <[email protected]>wrote:
>>
>> > Hi,
>> >
>> > It complains about not finding a Tika parser for the content type, did you
>> > modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails
>> > because of PDFbox. I can parse it successfully with trunk, 1.5 is not going
>> > to work, not because it cannot find the TikaParser for PDFs but becasue
>> > PDFBox cannot handle it.
>> >
>> > Cheers,
>> >
>> >
>> > -----Original message-----
>> > > From:kiran chitturi <[email protected]>
>> > > Sent: Mon 15-Oct-2012 21:58
>> > > To: [email protected]
>> > > Subject: nutch - Status: failed(2,200):
>> > org.apache.nutch.parse.ParseException: Unable to successfully parse content
>> > >
>> > > Hi,
>> > >
>> > > I am trying to parse pdf files using nutch and its failing everytime with
>> > > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException:
>> > > Unable to successfully parse content' in both nutch 1.5 and 2.x series
>> > when
>> > > i do the command 'sh bin/nutch parsechecker
>> > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'.
>> > >
>> > > The hadoop.log looks like this
>> > >
>> > > >
>> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.host = null
>> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.port = 8080
>> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.timeout = 10000
>> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.content.limit = -1
>> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.agent = My Nutch
>> > > > Spider/Nutch-2.2-SNAPSHOT
>> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept.language =
>> > > > en-us,en-gb,en;q=0.7,*;q=0.3
>> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept =
>> > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - parsing:
>> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
>> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - contentType:
>> > > > application/pdf
>> > > > 2012-10-15 15:43:36,858 INFO  crawl.SignatureFactory - Using Signature
>> > > > impl: org.apache.nutch.crawl.MD5Signature
>> > > > 2012-10-15 15:43:36,904 INFO  parse.ParserFactory - The parsing
>> > plugins:
>> > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> > > > plugin.includes system property, and all claim to support the content
>> > type
>> > > > application/pdf, but they are not mapped to it  in the
>> > parse-plugins.xml
>> > > > file
>> > > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika
>> > parser
>> > > > for mime-type application/pdf
>> > > > 2012-10-15 15:43:36,969 WARN  parse.ParseUtil - Unable to successfully
>> > > > parse content
>> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of
>> > > > type application/pdf
>> > >
>> > >
>> > > The config file nutch-site.xml is as below:
>> > >
>> > >  <?xml version="1.0"?>
>> > > >
>> > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> > > > <!-- Put site-specific property overrides in this file. -->
>> > > > <configuration>
>> > > > <property>
>> > > >  <name>http.agent.name</name>
>> > > >  <value>My Nutch Spider</value>
>> > > > </property>
>> > > >
>> > > > <property>
>> > > > <name>plugin.folders</name>
>> > > >
>> > <value>/Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins
>> > > > </value>
>> > > > </property>
>> > > >
>> > > > <property>
>> > > > <name>plugin.includes</name>
>> > > > <value>
>> > > >
>> > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
>> > > > </value>
>> > > > </property>
>> > > > <!-- Used only if plugin parse-metatags is enabled. -->
>> > > > <property>
>> > > > <name>metatags.names</name>
>> > > > <value>*</value>
>> > > > <description> Names of the metatags to extract, separated by;.
>> > > >   Use '*' to extract all metatags. Prefixes the names with 'metatag.'
>> > > >   in the parse-metadata. For instance to index description and
>> > keywords,
>> > > >   you need to activate the plugin index-metadata and set the value of
>> > the
>> > > >   parameter 'index.parse.md' to
>> > 'metatag.description;metatag.keywords'.
>> > > > </description>
>> > > > </property>
>> > > > <property>
>> > > >   <name>index.parse.md</name>
>> > > >   <value>
>> > > >
>> > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion
>> > > > </value>
>> > > >   <description>
>> > > >   Comma-separated list of keys to be taken from the parse metadata to
>> > > > generate fields.
>> > > >   Can be used e.g. for 'description' or 'keywords' provided that these
>> > > > values are generated
>> > > >   by a parser (see parse-metatags plugin)
>> > > >   </description>
>> > > > </property>
>> > > > <property>
>> > > > <name>http.content.limit</name>
>> > > > <value>-1</value>
>> > > > </property>
>> > > > </configuration>
>> > > >
>> > > > Are there any configuration settings that i need to do to work with pdf
>> > > files ? I have parsed them before and crawled but i am not sure which is
>> > > causing the error now.
>> > >
>> > > Can someone please point the cause of the errors above ?
>> > >
>> > > Many Thanks,
>> > > --
>> > > Kiran Chitturi
>> > >
>> >
>>
>>
>>
>> --
>> Kiran Chitturi
>>



-- 
Lewis

Re: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

Reply via email to