Re: Nutch 1.3 - DIFAT array IOException on parsing files

Markus Jelsma Wed, 31 Aug 2011 06:19:14 -0700

Ah, you are using an older version. Newer Nutch releases mention both limits 
in the description to avoid confusion.


Cheers

On Wednesday 31 August 2011 15:11:29 Elisabeth Adler wrote:
> Hi all,
> thanks for the help!
> The culprit was that I was setting the file.content.limit instead of the
> http.content.limit.
> 
> On 31.08.2011 08:22, Elisabeth Adler wrote:
> > The size of the PDF is 528kb  (the .doc is 108kb and the xls is 123kb)
> > and I set the limit in the config to -1:
> > <property>
> > <name>file.content.limit</name>
> > <value>-1</value>
> > <description>The length limit for downloaded content, in bytes.
> > 
> >             If this value is nonnegative (>=0), content longer than it
> > 
> > will be
> > 
> >             truncated;otherwise, no truncation at all.
> > 
> > </description>
> > </property>
> > 
> > Is there any setting where I can force Nutch to somehow persist the
> > file before parsing it so I can make sure it's actually there?
> > 
> > On 30.08.2011 21:42, lewis john mcgibbney wrote:
> >> Hi Elisabeth,
> >> 
> >> Can you please check the size of the pdf files you are trying to parse
> >> and set the http.content.limit property accordingly in nutch-site.xml
> >> 
> >> Anything over the default limit will be truncated (or skipped in some
> >> cases)
> >> 
> >> Please get back to us on this one..
> >> 
> >> On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler
> >> 
> >> <[email protected]>wrote:
> >>> Actually, I don't think tika is the issue. If I add manually downloaded
> >>> PDFs to Nutch's test cases, the files are parsed correctly. I think it
> >>> is more likely something with Nutch not being able to download the
> >>> files correctly.
> >>> Any pointers?
> >>> thanks,
> >>> Elisabeth
> >>> 
> >>> On 30.08.2011 19:41, Markus Jelsma wrote:
> >>>> Hi,
> >>>> 
> >>>> Can you report your issues to the Tika mailing list? You're more
> >>>> likely to get
> >>>> help there.
> >>>> 
> >>>> Cheers
> >>>> 
> >>>>   Hi,
> >>>>> 
> >>>>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the
> >>>>> tika-plugin (see [1]) to parse pdfs  and MS Office documents, and
> >>>>> included the mime types in the parse-plugins.xml.
> >>>>> 
> >>>>> On crawling, the URLs of my files are correctly retrieved, but on
> >>>>> parsing the files, I get the following errors:
> >>>>> [Error1]: Error parsing:
> >>>>> http://.../sample-site/news/**cactus/2011-06-22_**
> >>>>> ClientSupportWeeklyReport.pdf
> >>>>> 
> >>>>> : failed(2,0): null
> >>>>> 
> >>>>> [Error2]: Error parsing:
> >>>>> http://../sample-site/news/**cactus/Operations-Meeting-**
> >>>>> Minutes-2011-wk02.doc:
> >>>>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT
> >>>>> array at index 0 referenced block # 208. This isn't allowed and 
> >>>>> your file is corrupt
> >>>>> [Error3]: Error parsing:
> >>>>> http://../sample-site/news/**cactus/work_log_lisi.xls:  failed(2,0):
> >>>>> Your file contains 127 sectors, but the initial DIFAT array at index
> >>>>> 0 referenced block # 241. This isn't allowed and  your file is
> >>>>> corrupt
> >>>>> 
> >>>>> Further stack traces to the errors are below. When entering the ULRs
> >>>>> in a browser, the files can be opened without problems. Also, I used
> >>>>> the file in the Nutch test cases, and the files could be opened and
> >>>>> read correctly by Nutch, so it does not seem to be a problem with
> >>>>> the files. Also below on how I parse the files [2].
> >>>>> 
> >>>>> Did anyone encounter any of these problems so far? Any pointers are
> >>>>> very much appreciated!
> >>>>> Thanks a lot,
> >>>>> Elisabeth
> >>>>> 
> >>>>> 
> >>>>> [1] nutch-site.xml
> >>>>> <property><name>plugin.**includes</name>
> >>>>> <value>parse-(html|tika|js|**zip)|...</value>   </property>
> >>>>> 
> >>>>> [Error1]:
> >>>>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing
> >>>>> http://.../sample-site/news/**cactus/2011-06-22_**
> >>>>> ClientSupportWeeklyReport.pdf
> >>>>> java.lang.NullPointerException
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109
> >>>>> )
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(**
> >>>>> PDDocument.java:946)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(**
> >>>>> PDFParser.java:107)
> >>>>> 
> >>>>>           at
> >>>>>           org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.**
> >>>>> 
> >>>>> java:88)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9
> >>>>> 5)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35)
> >>>>> at
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24)
> >>>>> at
> >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30
> >>>>> 3)
> >>>>> 
> >>>>>           at
> >>>>>           java.util.concurrent.**FutureTask.run(FutureTask.**java:13
> >>>>>           8) at java.lang.Thread.run(Thread.**java:662)
> >>>>> 
> >>>>> [Error2]:
> >>>>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing
> >>>>> http://.../sample-site/news/**cactus/Operations-Meeting-**
> >>>>> Minutes-2011-wk02.doc
> >>>>> java.io.IOException: Your file contains 127 sectors, but the initial
> >>>>> DIFAT array at index 0 referenced block # 208. This isn't allowed and
> >>>>> your file is corrupt
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
> >>>>> init>(BlockAllocat
> >>>>> ionTableReader.java:113) at
> >>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
> >>>>> init>(POIFSFileSystem.java
> >>>>> 
> >>>>> :166) at
> >>>>> 
> >>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
> >>>>> OfficeParser.java:160)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9
> >>>>> 5)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35)
> >>>>> at
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24)
> >>>>> at
> >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30
> >>>>> 3)
> >>>>> 
> >>>>>           at
> >>>>>           java.util.concurrent.**FutureTask.run(FutureTask.**java:13
> >>>>>           8) at java.lang.Thread.run(Thread.**java:662)
> >>>>> 
> >>>>> [Error3]:
> >>>>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing
> >>>>> http://.../sample-site/news/**cactus/work_log_lisi.xls
> >>>>> java.io.IOException: Your file contains 127 sectors, but the initial
> >>>>> DIFAT array at index 0 referenced block # 241. This isn't allowed and
> >>>>> your file is corrupt
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<**
> >>>>> init>(BlockAllocat
> >>>>> ionTableReader.java:113) at
> >>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<**
> >>>>> init>(POIFSFileSystem.java
> >>>>> 
> >>>>> :166) at
> >>>>> 
> >>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(**
> >>>>> OfficeParser.java:160)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9
> >>>>> 5)
> >>>>> 
> >>>>>           at
> >>>>> 
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35)
> >>>>> at
> >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24)
> >>>>> at
> >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30
> >>>>> 3)
> >>>>> 
> >>>>>           at
> >>>>>           java.util.concurrent.**FutureTask.run(FutureTask.**java:13
> >>>>>           8) at java.lang.Thread.run(Thread.**java:662)
> >>>>> 
> >>>>> [2]
> >>>>> ./bin/nutch inject crawl/crawldb urls>>   crawl.log
> >>>>> ./bin/nutch generate crawl/crawldb crawl/segments>>   crawl.log
> >>>>> s1=`ls -d crawl/segments/2* | tail -1`>>   crawl.log
> >>>>> ./bin/nutch fetch $s1 -noParsing>>   crawl.log
> >>>>> ./bin/nutch parse $s1>>   crawl.log

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch 1.3 - DIFAT array IOException on parsing files

Reply via email to