Hi Elisabeth,

Can you please check the size of the pdf files you are trying to parse and
set the http.content.limit property accordingly in nutch-site.xml

Anything over the default limit will be truncated (or skipped in some cases)

Please get back to us on this one..

On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler
<[email protected]>wrote:

> Actually, I don't think tika is the issue. If I add manually downloaded
> PDFs to Nutch's test cases, the files are parsed correctly. I think it is
> more likely something with Nutch not being able to download the files
> correctly.
> Any pointers?
> thanks,
> Elisabeth
>
> On 30.08.2011 19:41, Markus Jelsma wrote:
>
>> Hi,
>>
>> Can you report your issues to the Tika mailing list? You're more likely to
>> get
>> help there.
>>
>> Cheers
>>
>>  Hi,
>>>
>>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
>>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
>>> included the mime types in the parse-plugins.xml.
>>>
>>> On crawling, the URLs of my files are correctly retrieved, but on
>>> parsing the files, I get the following errors:
>>> [Error1]: Error parsing:
>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>> ClientSupportWeeklyReport.pdf
>>> : failed(2,0): null
>>> [Error2]: Error parsing:
>>> http://../sample-site/news/**cactus/Operations-Meeting-**
>>> Minutes-2011-wk02.doc:
>>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
>>> at index 0 referenced block # 208. This isn't allowed and  your file is
>>> corrupt
>>> [Error3]: Error parsing:
>>> http://../sample-site/news/**cactus/work_log_lisi.xls: failed(2,0): Your
>>> file contains 127 sectors, but the initial DIFAT array at index 0
>>> referenced block # 241. This isn't allowed and  your file is corrupt
>>>
>>> Further stack traces to the errors are below. When entering the ULRs in
>>> a browser, the files can be opened without problems. Also, I used the
>>> file in the Nutch test cases, and the files could be opened and read
>>> correctly by Nutch, so it does not seem to be a problem with the files.
>>> Also below on how I parse the files [2].
>>>
>>> Did anyone encounter any of these problems so far? Any pointers are very
>>> much appreciated!
>>> Thanks a lot,
>>> Elisabeth
>>>
>>>
>>> [1] nutch-site.xml
>>> <property><name>plugin.**includes</name>
>>> <value>parse-(html|tika|js|**zip)|...</value>  </property>
>>>
>>> [Error1]:
>>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
>>> http://.../sample-site/news/**cactus/2011-06-22_**
>>> ClientSupportWeeklyReport.pdf
>>> java.lang.NullPointerException
>>>          at
>>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109)
>>>          at
>>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(**
>>> PDDocument.java:946)
>>>          at
>>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(**
>>> PDFParser.java:107)
>>>          at org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.**
>>> java:88)
>>>          at
>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>          at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>          at java.lang.Thread.run(Thread.**java:662)
>>>
>>> [Error2]:
>>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
>>> http://.../sample-site/news/**cactus/Operations-Meeting-**
>>> Minutes-2011-wk02.doc
>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
>>> your file is corrupt
>>>          at
>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>> init>(BlockAllocat
>>> ionTableReader.java:113) at
>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>> init>(POIFSFileSystem.java
>>> :166) at
>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>> OfficeParser.java:160)
>>>          at
>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>          at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>          at java.lang.Thread.run(Thread.**java:662)
>>>
>>> [Error3]:
>>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
>>> http://.../sample-site/news/**cactus/work_log_lisi.xls
>>> java.io.IOException: Your file contains 127 sectors, but the initial
>>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
>>> your file is corrupt
>>>          at
>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
>>> init>(BlockAllocat
>>> ionTableReader.java:113) at
>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
>>> init>(POIFSFileSystem.java
>>> :166) at
>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
>>> OfficeParser.java:160)
>>>          at
>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
>>>          at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
>>>          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>>          at java.lang.Thread.run(Thread.**java:662)
>>>
>>> [2]
>>> ./bin/nutch inject crawl/crawldb urls>>  crawl.log
>>> ./bin/nutch generate crawl/crawldb crawl/segments>>  crawl.log
>>> s1=`ls -d crawl/segments/2* | tail -1`>>  crawl.log
>>> ./bin/nutch fetch $s1 -noParsing>>  crawl.log
>>> ./bin/nutch parse $s1>>  crawl.log
>>>
>>


-- 
*Lewis*

Reply via email to