Hi Elisabeth, Can you please check the size of the pdf files you are trying to parse and set the http.content.limit property accordingly in nutch-site.xml
Anything over the default limit will be truncated (or skipped in some cases) Please get back to us on this one.. On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler <[email protected]>wrote: > Actually, I don't think tika is the issue. If I add manually downloaded > PDFs to Nutch's test cases, the files are parsed correctly. I think it is > more likely something with Nutch not being able to download the files > correctly. > Any pointers? > thanks, > Elisabeth > > On 30.08.2011 19:41, Markus Jelsma wrote: > >> Hi, >> >> Can you report your issues to the Tika mailing list? You're more likely to >> get >> help there. >> >> Cheers >> >> Hi, >>> >>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the >>> tika-plugin (see [1]) to parse pdfs and MS Office documents, and >>> included the mime types in the parse-plugins.xml. >>> >>> On crawling, the URLs of my files are correctly retrieved, but on >>> parsing the files, I get the following errors: >>> [Error1]: Error parsing: >>> http://.../sample-site/news/**cactus/2011-06-22_** >>> ClientSupportWeeklyReport.pdf >>> : failed(2,0): null >>> [Error2]: Error parsing: >>> http://../sample-site/news/**cactus/Operations-Meeting-** >>> Minutes-2011-wk02.doc: >>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array >>> at index 0 referenced block # 208. This isn't allowed and your file is >>> corrupt >>> [Error3]: Error parsing: >>> http://../sample-site/news/**cactus/work_log_lisi.xls: failed(2,0): Your >>> file contains 127 sectors, but the initial DIFAT array at index 0 >>> referenced block # 241. This isn't allowed and your file is corrupt >>> >>> Further stack traces to the errors are below. When entering the ULRs in >>> a browser, the files can be opened without problems. Also, I used the >>> file in the Nutch test cases, and the files could be opened and read >>> correctly by Nutch, so it does not seem to be a problem with the files. >>> Also below on how I parse the files [2]. >>> >>> Did anyone encounter any of these problems so far? Any pointers are very >>> much appreciated! >>> Thanks a lot, >>> Elisabeth >>> >>> >>> [1] nutch-site.xml >>> <property><name>plugin.**includes</name> >>> <value>parse-(html|tika|js|**zip)|...</value> </property> >>> >>> [Error1]: >>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing >>> http://.../sample-site/news/**cactus/2011-06-22_** >>> ClientSupportWeeklyReport.pdf >>> java.lang.NullPointerException >>> at >>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109) >>> at >>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(** >>> PDDocument.java:946) >>> at >>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(** >>> PDFParser.java:107) >>> at org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.** >>> java:88) >>> at >>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95) >>> at >>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at >>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at >>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303) >>> at java.util.concurrent.**FutureTask.run(FutureTask.**java:138) >>> at java.lang.Thread.run(Thread.**java:662) >>> >>> [Error2]: >>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing >>> http://.../sample-site/news/**cactus/Operations-Meeting-** >>> Minutes-2011-wk02.doc >>> java.io.IOException: Your file contains 127 sectors, but the initial >>> DIFAT array at index 0 referenced block # 208. This isn't allowed and >>> your file is corrupt >>> at >>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<** >>> init>(BlockAllocat >>> ionTableReader.java:113) at >>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<** >>> init>(POIFSFileSystem.java >>> :166) at >>> org.apache.tika.parser.**microsoft.OfficeParser.parse(** >>> OfficeParser.java:160) >>> at >>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95) >>> at >>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at >>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at >>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303) >>> at java.util.concurrent.**FutureTask.run(FutureTask.**java:138) >>> at java.lang.Thread.run(Thread.**java:662) >>> >>> [Error3]: >>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing >>> http://.../sample-site/news/**cactus/work_log_lisi.xls >>> java.io.IOException: Your file contains 127 sectors, but the initial >>> DIFAT array at index 0 referenced block # 241. This isn't allowed and >>> your file is corrupt >>> at >>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<** >>> init>(BlockAllocat >>> ionTableReader.java:113) at >>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<** >>> init>(POIFSFileSystem.java >>> :166) at >>> org.apache.tika.parser.**microsoft.OfficeParser.parse(** >>> OfficeParser.java:160) >>> at >>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95) >>> at >>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at >>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at >>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303) >>> at java.util.concurrent.**FutureTask.run(FutureTask.**java:138) >>> at java.lang.Thread.run(Thread.**java:662) >>> >>> [2] >>> ./bin/nutch inject crawl/crawldb urls>> crawl.log >>> ./bin/nutch generate crawl/crawldb crawl/segments>> crawl.log >>> s1=`ls -d crawl/segments/2* | tail -1`>> crawl.log >>> ./bin/nutch fetch $s1 -noParsing>> crawl.log >>> ./bin/nutch parse $s1>> crawl.log >>> >> -- *Lewis*

