In that case, check the file size. Nutch imposes configurable limits. Check 
size and config.

> Actually, I don't think tika is the issue. If I add manually downloaded
> PDFs to Nutch's test cases, the files are parsed correctly. I think it
> is more likely something with Nutch not being able to download the files
> correctly.
> Any pointers?
> thanks,
> Elisabeth
> 
> On 30.08.2011 19:41, Markus Jelsma wrote:
> > Hi,
> > 
> > Can you report your issues to the Tika mailing list? You're more likely
> > to get help there.
> > 
> > Cheers
> > 
> >> Hi,
> >> 
> >> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
> >> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
> >> included the mime types in the parse-plugins.xml.
> >> 
> >> On crawling, the URLs of my files are correctly retrieved, but on
> >> parsing the files, I get the following errors:
> >> [Error1]: Error parsing:
> >> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.
> >> pdf
> >> 
> >> : failed(2,0): null
> >> 
> >> [Error2]: Error parsing:
> >> http://../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.d
> >> oc: failed(2,0): Your file contains 127 sectors, but the initial DIFAT
> >> array at index 0 referenced block # 208. This isn't allowed and  your
> >> file is corrupt
> >> [Error3]: Error parsing:
> >> http://../sample-site/news/cactus/work_log_lisi.xls: failed(2,0): Your
> >> file contains 127 sectors, but the initial DIFAT array at index 0
> >> referenced block # 241. This isn't allowed and  your file is corrupt
> >> 
> >> Further stack traces to the errors are below. When entering the ULRs in
> >> a browser, the files can be opened without problems. Also, I used the
> >> file in the Nutch test cases, and the files could be opened and read
> >> correctly by Nutch, so it does not seem to be a problem with the files.
> >> Also below on how I parse the files [2].
> >> 
> >> Did anyone encounter any of these problems so far? Any pointers are very
> >> much appreciated!
> >> Thanks a lot,
> >> Elisabeth
> >> 
> >> 
> >> [1] nutch-site.xml
> >> <property><name>plugin.includes</name>
> >> <value>parse-(html|tika|js|zip)|...</value>  </property>
> >> 
> >> [Error1]:
> >> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
> >> http://.../sample-site/news/cactus/2011-06-22_ClientSupportWeeklyReport.
> >> pdf java.lang.NullPointerException
> >> 
> >>           at
> >> 
> >> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
> >> 
> >>           at
> >> 
> >> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:94
> >> 6)
> >> 
> >>           at
> >> 
> >> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:107)
> >> 
> >>           at
> >>           org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:88)
> >>           at
> >> 
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> 
> >>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>           at java.lang.Thread.run(Thread.java:662)
> >> 
> >> [Error2]:
> >> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
> >> http://.../sample-site/news/cactus/Operations-Meeting-Minutes-2011-wk02.
> >> doc java.io.IOException: Your file contains 127 sectors, but the initial
> >> DIFAT array at index 0 referenced block # 208. This isn't allowed and
> >> your file is corrupt
> >> 
> >>           at
> >> 
> >> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllo
> >> cat ionTableReader.java:113) at
> >> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.j
> >> ava
> >> 
> >> :166) at
> >> 
> >> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:16
> >> 0)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> 
> >>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>           at java.lang.Thread.run(Thread.java:662)
> >> 
> >> [Error3]:
> >> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
> >> http://.../sample-site/news/cactus/work_log_lisi.xls
> >> java.io.IOException: Your file contains 127 sectors, but the initial
> >> DIFAT array at index 0 referenced block # 241. This isn't allowed and
> >> your file is corrupt
> >> 
> >>           at
> >> 
> >> org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllo
> >> cat ionTableReader.java:113) at
> >> org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.j
> >> ava
> >> 
> >> :166) at
> >> 
> >> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:16
> >> 0)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >> 
> >>           at
> >> 
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> 
> >>           at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>           at java.lang.Thread.run(Thread.java:662)
> >> 
> >> [2]
> >> ./bin/nutch inject crawl/crawldb urls>>  crawl.log
> >> ./bin/nutch generate crawl/crawldb crawl/segments>>  crawl.log
> >> s1=`ls -d crawl/segments/2* | tail -1`>>  crawl.log
> >> ./bin/nutch fetch $s1 -noParsing>>  crawl.log
> >> ./bin/nutch parse $s1>>  crawl.log

Reply via email to