Hi all,
thanks for the help!
The culprit was that I was setting the file.content.limit instead of the http.content.limit.

On 31.08.2011 08:22, Elisabeth Adler wrote:
The size of the PDF is 528kb (the .doc is 108kb and the xls is 123kb) and I set the limit in the config to -1:
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
            truncated;otherwise, no truncation at all.
</description>
</property>

Is there any setting where I can force Nutch to somehow persist the file before parsing it so I can make sure it's actually there?


On 30.08.2011 21:42, lewis john mcgibbney wrote:
Hi Elisabeth,

Can you please check the size of the pdf files you are trying to parse and
set the http.content.limit property accordingly in nutch-site.xml

Anything over the default limit will be truncated (or skipped in some cases)

Please get back to us on this one..

On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler
<[email protected]>wrote:

Actually, I don't think tika is the issue. If I add manually downloaded
PDFs to Nutch's test cases, the files are parsed correctly. I think it is
more likely something with Nutch not being able to download the files
correctly.
Any pointers?
thanks,
Elisabeth

On 30.08.2011 19:41, Markus Jelsma wrote:

Hi,

Can you report your issues to the Tika mailing list? You're more likely to
get
help there.

Cheers

  Hi,
I am using Nutch 1.3 to crawl our intranet page. I have turned on the
tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
included the mime types in the parse-plugins.xml.

On crawling, the URLs of my files are correctly retrieved, but on
parsing the files, I get the following errors:
[Error1]: Error parsing:
http://.../sample-site/news/**cactus/2011-06-22_**
ClientSupportWeeklyReport.pdf
: failed(2,0): null
[Error2]: Error parsing:
http://../sample-site/news/**cactus/Operations-Meeting-**
Minutes-2011-wk02.doc:
failed(2,0): Your file contains 127 sectors, but the initial DIFAT array
at index 0 referenced block # 208. This isn't allowed and  your file is
corrupt
[Error3]: Error parsing:
http://../sample-site/news/**cactus/work_log_lisi.xls:  failed(2,0): Your
file contains 127 sectors, but the initial DIFAT array at index 0
referenced block # 241. This isn't allowed and  your file is corrupt

Further stack traces to the errors are below. When entering the ULRs in
a browser, the files can be opened without problems. Also, I used the
file in the Nutch test cases, and the files could be opened and read
correctly by Nutch, so it does not seem to be a problem with the files.
Also below on how I parse the files [2].

Did anyone encounter any of these problems so far? Any pointers are very
much appreciated!
Thanks a lot,
Elisabeth


[1] nutch-site.xml
<property><name>plugin.**includes</name>
<value>parse-(html|tika|js|**zip)|...</value>   </property>

[Error1]:
2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
http://.../sample-site/news/**cactus/2011-06-22_**
ClientSupportWeeklyReport.pdf
java.lang.NullPointerException
          at
org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109)
          at
org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(**
PDDocument.java:946)
          at
org.apache.tika.parser.pdf.**PDFParser.extractMetadata(**
PDFParser.java:107)
          at org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.**
java:88)
          at
org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
          at
org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
          at java.lang.Thread.run(Thread.**java:662)

[Error2]:
2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
http://.../sample-site/news/**cactus/Operations-Meeting-**
Minutes-2011-wk02.doc
java.io.IOException: Your file contains 127 sectors, but the initial
DIFAT array at index 0 referenced block # 208. This isn't allowed and
your file is corrupt
          at
org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
init>(BlockAllocat
ionTableReader.java:113) at
org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
init>(POIFSFileSystem.java
:166) at
org.apache.tika.parser.**microsoft.OfficeParser.parse(**
OfficeParser.java:160)
          at
org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
          at
org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
          at java.lang.Thread.run(Thread.**java:662)

[Error3]:
2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
http://.../sample-site/news/**cactus/work_log_lisi.xls
java.io.IOException: Your file contains 127 sectors, but the initial
DIFAT array at index 0 referenced block # 241. This isn't allowed and
your file is corrupt
          at
org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
init>(BlockAllocat
ionTableReader.java:113) at
org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
init>(POIFSFileSystem.java
:166) at
org.apache.tika.parser.**microsoft.OfficeParser.parse(**
OfficeParser.java:160)
          at
org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:95)
          at
org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) at
org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) at
java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:303)
          at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
          at java.lang.Thread.run(Thread.**java:662)

[2]
./bin/nutch inject crawl/crawldb urls>>   crawl.log
./bin/nutch generate crawl/crawldb crawl/segments>>   crawl.log
s1=`ls -d crawl/segments/2* | tail -1`>>   crawl.log
./bin/nutch fetch $s1 -noParsing>>   crawl.log
./bin/nutch parse $s1>>   crawl.log

Reply via email to