In that case, check the file size. Nutch imposes configurable limits. Check size and config.
> Actually, I don't think tika is the issue. If I add manually downloaded > PDFs to Nutch's test cases, the files are parsed correctly. I think it > is more likely something with Nutch not being able to download the files > correctly. > Any pointers? > thanks, > Elisabeth > > On 30.08.2011 19:41, Markus Jelsma wrote: > > Hi, > > > > Can you report your issues to the Tika mailing list? You're more likely > > to get help there. > > > > Cheers > > > >> Hi, > >> > >> I am using Nutch 1.3 to crawl our intranet page. I have turned on the > >> tika-plugin (see [1]) to parse pdfs and MS Office documents, and > >> included the mime types in the parse-plugins.xml. > >> > >> On crawling, the URLs of my files are correctly retrieved, but on > >> parsing the files, I get the following errors: > >> [Error1]: Error parsing: > >> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport. > >> pdf > >> > >> : failed(2,0): null > >> > >> [Error2]: Error parsing: > >> http://../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.d > >> oc: failed(2,0): Your file contains 127 sectors, but the initial DIFAT > >> array at index 0 referenced block # 208. This isn't allowed and your > >> file is corrupt > >> [Error3]: Error parsing: > >> http://../sample-site/news/cactus/work_log_lisi.xls: failed(2,0): Your > >> file contains 127 sectors, but the initial DIFAT array at index 0 > >> referenced block # 241. This isn't allowed and your file is corrupt > >> > >> Further stack traces to the errors are below. When entering the ULRs in > >> a browser, the files can be opened without problems. Also, I used the > >> file in the Nutch test cases, and the files could be opened and read > >> correctly by Nutch, so it does not seem to be a problem with the files. > >> Also below on how I parse the files [2]. > >> > >> Did anyone encounter any of these problems so far? Any pointers are very > >> much appreciated! > >> Thanks a lot, > >> Elisabeth > >> > >> > >> [1] nutch-site.xml > >> <property><name>plugin.includes</name> > >> <value>parse-(html|tika|js|zip)|...</value> </property> > >> > >> [Error1]: > >> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing > >> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport. > >> pdf java.lang.NullPointerException > >> > >> at > >> > >> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109) > >> > >> at > >> > >> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:94 > >> 6) > >> > >> at > >> > >> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:107) > >> > >> at > >> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:88) > >> at > >> > >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) > >> > >> at > >> > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >> > >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> [Error2]: > >> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing > >> http://.../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02. > >> doc java.io.IOException: Your file contains 127 sectors, but the initial > >> DIFAT array at index 0 referenced block # 208. This isn't allowed and > >> your file is corrupt > >> > >> at > >> > >> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllo > >> cat ionTableReader.java:113) at > >> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.j > >> ava > >> > >> :166) at > >> > >> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:16 > >> 0) > >> > >> at > >> > >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) > >> > >> at > >> > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >> > >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> [Error3]: > >> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing > >> http://.../sample-site/news/cactus/work_log_lisi.xls > >> java.io.IOException: Your file contains 127 sectors, but the initial > >> DIFAT array at index 0 referenced block # 241. This isn't allowed and > >> your file is corrupt > >> > >> at > >> > >> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllo > >> cat ionTableReader.java:113) at > >> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.j > >> ava > >> > >> :166) at > >> > >> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:16 > >> 0) > >> > >> at > >> > >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) > >> > >> at > >> > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >> > >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> [2] > >> ./bin/nutch inject crawl/crawldb urls>> crawl.log > >> ./bin/nutch generate crawl/crawldb crawl/segments>> crawl.log > >> s1=`ls -d crawl/segments/2* | tail -1`>> crawl.log > >> ./bin/nutch fetch $s1 -noParsing>> crawl.log > >> ./bin/nutch parse $s1>> crawl.log

