Hi Kiran, If you apply the patch to your 2.x branch, then make sure that 'ant runtime' is executed. Please also make sure that the tika 1.1 dependency _does_not_ exist in your runtime /lib directory as this may conflict with expected results.
If you could update the ticket it would be excellent. Thanks Lewis On Tue, Oct 16, 2012 at 8:04 AM, Markus Jelsma <[email protected]> wrote: > No, it doesn't work because of the old PDFBox version you are using. You need > Tika 1.2 or higher. > > > > -----Original message----- >> From:kiran chitturi <[email protected]> >> Sent: Tue 16-Oct-2012 01:32 >> To: [email protected] >> Subject: Re: nutch - Status: failed(2,200): >> org.apache.nutch.parse.ParseException: Unable to successfully parse content >> >> When i tried the command 'sh bin/nutch parsechecker >> http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf' the >> logs (hadoop.log) says >> >> parse.ParserFactory - The parsing plugins: >> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the >> > plugin.includes system property, and all claim to support the content type >> > application/pdf, but they are not mapped to it in the parse-plugins.xml >> > file >> > 2012-10-15 19:04:23,733 WARN pdfparser.PDFParser - Parsing Error, >> > Skipping Object >> > java.io.IOException: expected='endstream' actual='' >> > org.apache.pdfbox.io.PushBackInputStream@215983b7 >> > at >> > org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:530) >> > at >> > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566) >> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) >> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090) >> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055) >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123) >> > at >> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96) >> > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) >> > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) >> > at >> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> > at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> > at >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> > at >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> > at java.lang.Thread.run(Thread.java:680) >> > 2012-10-15 19:04:23,734 WARN pdfparser.XrefTrailerResolver - Did not >> > found XRef object at specified startxref position 0 >> > 2012-10-15 19:04:23,933 INFO crawl.SignatureFactory - Using Signature >> > impl: org.apache.nutch.crawl.MD5Signature >> > 2012-10-15 19:04:23,944 INFO parse.ParserChecker - parsing: >> > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf >> >> >> Does this has anything to do with content limit, or is this other kind of >> error ? >> >> Thanks for the help. >> >> Regards, >> Kiran. >> >> On Mon, Oct 15, 2012 at 5:20 PM, Markus Jelsma >> <[email protected]>wrote: >> >> > Hi, >> > >> > It complains about not finding a Tika parser for the content type, did you >> > modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails >> > because of PDFbox. I can parse it successfully with trunk, 1.5 is not going >> > to work, not because it cannot find the TikaParser for PDFs but becasue >> > PDFBox cannot handle it. >> > >> > Cheers, >> > >> > >> > -----Original message----- >> > > From:kiran chitturi <[email protected]> >> > > Sent: Mon 15-Oct-2012 21:58 >> > > To: [email protected] >> > > Subject: nutch - Status: failed(2,200): >> > org.apache.nutch.parse.ParseException: Unable to successfully parse content >> > > >> > > Hi, >> > > >> > > I am trying to parse pdf files using nutch and its failing everytime with >> > > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException: >> > > Unable to successfully parse content' in both nutch 1.5 and 2.x series >> > when >> > > i do the command 'sh bin/nutch parsechecker >> > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'. >> > > >> > > The hadoop.log looks like this >> > > >> > > > >> > > > 2012-10-15 15:43:32,323 INFO http.Http - http.proxy.host = null >> > > > 2012-10-15 15:43:32,323 INFO http.Http - http.proxy.port = 8080 >> > > > 2012-10-15 15:43:32,323 INFO http.Http - http.timeout = 10000 >> > > > 2012-10-15 15:43:32,323 INFO http.Http - http.content.limit = -1 >> > > > 2012-10-15 15:43:32,323 INFO http.Http - http.agent = My Nutch >> > > > Spider/Nutch-2.2-SNAPSHOT >> > > > 2012-10-15 15:43:32,323 INFO http.Http - http.accept.language = >> > > > en-us,en-gb,en;q=0.7,*;q=0.3 >> > > > 2012-10-15 15:43:32,323 INFO http.Http - http.accept = >> > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >> > > > 2012-10-15 15:43:36,851 INFO parse.ParserChecker - parsing: >> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf >> > > > 2012-10-15 15:43:36,851 INFO parse.ParserChecker - contentType: >> > > > application/pdf >> > > > 2012-10-15 15:43:36,858 INFO crawl.SignatureFactory - Using Signature >> > > > impl: org.apache.nutch.crawl.MD5Signature >> > > > 2012-10-15 15:43:36,904 INFO parse.ParserFactory - The parsing >> > plugins: >> > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the >> > > > plugin.includes system property, and all claim to support the content >> > type >> > > > application/pdf, but they are not mapped to it in the >> > parse-plugins.xml >> > > > file >> > > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika >> > parser >> > > > for mime-type application/pdf >> > > > 2012-10-15 15:43:36,969 WARN parse.ParseUtil - Unable to successfully >> > > > parse content >> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of >> > > > type application/pdf >> > > >> > > >> > > The config file nutch-site.xml is as below: >> > > >> > > <?xml version="1.0"?> >> > > > >> > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> > > > <!-- Put site-specific property overrides in this file. --> >> > > > <configuration> >> > > > <property> >> > > > <name>http.agent.name</name> >> > > > <value>My Nutch Spider</value> >> > > > </property> >> > > > >> > > > <property> >> > > > <name>plugin.folders</name> >> > > > >> > <value>/Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins >> > > > </value> >> > > > </property> >> > > > >> > > > <property> >> > > > <name>plugin.includes</name> >> > > > <value> >> > > > >> > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic) >> > > > </value> >> > > > </property> >> > > > <!-- Used only if plugin parse-metatags is enabled. --> >> > > > <property> >> > > > <name>metatags.names</name> >> > > > <value>*</value> >> > > > <description> Names of the metatags to extract, separated by;. >> > > > Use '*' to extract all metatags. Prefixes the names with 'metatag.' >> > > > in the parse-metadata. For instance to index description and >> > keywords, >> > > > you need to activate the plugin index-metadata and set the value of >> > the >> > > > parameter 'index.parse.md' to >> > 'metatag.description;metatag.keywords'. >> > > > </description> >> > > > </property> >> > > > <property> >> > > > <name>index.parse.md</name> >> > > > <value> >> > > > >> > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion >> > > > </value> >> > > > <description> >> > > > Comma-separated list of keys to be taken from the parse metadata to >> > > > generate fields. >> > > > Can be used e.g. for 'description' or 'keywords' provided that these >> > > > values are generated >> > > > by a parser (see parse-metatags plugin) >> > > > </description> >> > > > </property> >> > > > <property> >> > > > <name>http.content.limit</name> >> > > > <value>-1</value> >> > > > </property> >> > > > </configuration> >> > > > >> > > > Are there any configuration settings that i need to do to work with pdf >> > > files ? I have parsed them before and crawled but i am not sure which is >> > > causing the error now. >> > > >> > > Can someone please point the cause of the errors above ? >> > > >> > > Many Thanks, >> > > -- >> > > Kiran Chitturi >> > > >> > >> >> >> >> -- >> Kiran Chitturi >> -- Lewis

