Ah, you are using an older version. Newer Nutch releases mention both limits in the description to avoid confusion.
Cheers On Wednesday 31 August 2011 15:11:29 Elisabeth Adler wrote: > Hi all, > thanks for the help! > The culprit was that I was setting the file.content.limit instead of the > http.content.limit. > > On 31.08.2011 08:22, Elisabeth Adler wrote: > > The size of the PDF is 528kb (the .doc is 108kb and the xls is 123kb) > > and I set the limit in the config to -1: > > <property> > > <name>file.content.limit</name> > > <value>-1</value> > > <description>The length limit for downloaded content, in bytes. > > > > If this value is nonnegative (>=0), content longer than it > > > > will be > > > > truncated;otherwise, no truncation at all. > > > > </description> > > </property> > > > > Is there any setting where I can force Nutch to somehow persist the > > file before parsing it so I can make sure it's actually there? > > > > On 30.08.2011 21:42, lewis john mcgibbney wrote: > >> Hi Elisabeth, > >> > >> Can you please check the size of the pdf files you are trying to parse > >> and set the http.content.limit property accordingly in nutch-site.xml > >> > >> Anything over the default limit will be truncated (or skipped in some > >> cases) > >> > >> Please get back to us on this one.. > >> > >> On Tue, Aug 30, 2011 at 8:27 PM, Elisabeth Adler > >> > >> <[email protected]>wrote: > >>> Actually, I don't think tika is the issue. If I add manually downloaded > >>> PDFs to Nutch's test cases, the files are parsed correctly. I think it > >>> is more likely something with Nutch not being able to download the > >>> files correctly. > >>> Any pointers? > >>> thanks, > >>> Elisabeth > >>> > >>> On 30.08.2011 19:41, Markus Jelsma wrote: > >>>> Hi, > >>>> > >>>> Can you report your issues to the Tika mailing list? You're more > >>>> likely to get > >>>> help there. > >>>> > >>>> Cheers > >>>> > >>>> Hi, > >>>>> > >>>>> I am using Nutch 1.3 to crawl our intranet page. I have turned on the > >>>>> tika-plugin (see [1]) to parse pdfs and MS Office documents, and > >>>>> included the mime types in the parse-plugins.xml. > >>>>> > >>>>> On crawling, the URLs of my files are correctly retrieved, but on > >>>>> parsing the files, I get the following errors: > >>>>> [Error1]: Error parsing: > >>>>> http://.../sample-site/news/**cactus/2011-06-22_** > >>>>> ClientSupportWeeklyReport.pdf > >>>>> > >>>>> : failed(2,0): null > >>>>> > >>>>> [Error2]: Error parsing: > >>>>> http://../sample-site/news/**cactus/Operations-Meeting-** > >>>>> Minutes-2011-wk02.doc: > >>>>> failed(2,0): Your file contains 127 sectors, but the initial DIFAT > >>>>> array at index 0 referenced block # 208. This isn't allowed and > >>>>> your file is corrupt > >>>>> [Error3]: Error parsing: > >>>>> http://../sample-site/news/**cactus/work_log_lisi.xls: failed(2,0): > >>>>> Your file contains 127 sectors, but the initial DIFAT array at index > >>>>> 0 referenced block # 241. This isn't allowed and your file is > >>>>> corrupt > >>>>> > >>>>> Further stack traces to the errors are below. When entering the ULRs > >>>>> in a browser, the files can be opened without problems. Also, I used > >>>>> the file in the Nutch test cases, and the files could be opened and > >>>>> read correctly by Nutch, so it does not seem to be a problem with > >>>>> the files. Also below on how I parse the files [2]. > >>>>> > >>>>> Did anyone encounter any of these problems so far? Any pointers are > >>>>> very much appreciated! > >>>>> Thanks a lot, > >>>>> Elisabeth > >>>>> > >>>>> > >>>>> [1] nutch-site.xml > >>>>> <property><name>plugin.**includes</name> > >>>>> <value>parse-(html|tika|js|**zip)|...</value> </property> > >>>>> > >>>>> [Error1]: > >>>>> 2011-08-30 18:13:11,783 ERROR tika.TikaParser - Error parsing > >>>>> http://.../sample-site/news/**cactus/2011-06-22_** > >>>>> ClientSupportWeeklyReport.pdf > >>>>> java.lang.NullPointerException > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.pdfbox.pdmodel.**PDPageNode.getCount(**PDPageNode.java:109 > >>>>> ) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.pdfbox.pdmodel.**PDDocument.getNumberOfPages(** > >>>>> PDDocument.java:946) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.tika.parser.pdf.**PDFParser.extractMetadata(** > >>>>> PDFParser.java:107) > >>>>> > >>>>> at > >>>>> org.apache.tika.parser.pdf.**PDFParser.parse(PDFParser.** > >>>>> > >>>>> java:88) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9 > >>>>> 5) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) > >>>>> at > >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) > >>>>> at > >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30 > >>>>> 3) > >>>>> > >>>>> at > >>>>> java.util.concurrent.**FutureTask.run(FutureTask.**java:13 > >>>>> 8) at java.lang.Thread.run(Thread.**java:662) > >>>>> > >>>>> [Error2]: > >>>>> 2011-08-30 18:13:11,864 ERROR tika.TikaParser - Error parsing > >>>>> http://.../sample-site/news/**cactus/Operations-Meeting-** > >>>>> Minutes-2011-wk02.doc > >>>>> java.io.IOException: Your file contains 127 sectors, but the initial > >>>>> DIFAT array at index 0 referenced block # 208. This isn't allowed and > >>>>> your file is corrupt > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<** > >>>>> init>(BlockAllocat > >>>>> ionTableReader.java:113) at > >>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<** > >>>>> init>(POIFSFileSystem.java > >>>>> > >>>>> :166) at > >>>>> > >>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(** > >>>>> OfficeParser.java:160) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9 > >>>>> 5) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) > >>>>> at > >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) > >>>>> at > >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30 > >>>>> 3) > >>>>> > >>>>> at > >>>>> java.util.concurrent.**FutureTask.run(FutureTask.**java:13 > >>>>> 8) at java.lang.Thread.run(Thread.**java:662) > >>>>> > >>>>> [Error3]: > >>>>> 2011-08-30 18:13:11,902 ERROR tika.TikaParser - Error parsing > >>>>> http://.../sample-site/news/**cactus/work_log_lisi.xls > >>>>> java.io.IOException: Your file contains 127 sectors, but the initial > >>>>> DIFAT array at index 0 referenced block # 241. This isn't allowed and > >>>>> your file is corrupt > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.poi.poifs.storage.**BlockAllocationTableReader.<** > >>>>> init>(BlockAllocat > >>>>> ionTableReader.java:113) at > >>>>> org.apache.poi.poifs.**filesystem.POIFSFileSystem.<** > >>>>> init>(POIFSFileSystem.java > >>>>> > >>>>> :166) at > >>>>> > >>>>> org.apache.tika.parser.**microsoft.OfficeParser.parse(** > >>>>> OfficeParser.java:160) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.nutch.parse.tika.**TikaParser.getParse(**TikaParser.java:9 > >>>>> 5) > >>>>> > >>>>> at > >>>>> > >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:35) > >>>>> at > >>>>> org.apache.nutch.parse.**ParseCallable.call(**ParseCallable.java:24) > >>>>> at > >>>>> java.util.concurrent.**FutureTask$Sync.innerRun(**FutureTask.java:30 > >>>>> 3) > >>>>> > >>>>> at > >>>>> java.util.concurrent.**FutureTask.run(FutureTask.**java:13 > >>>>> 8) at java.lang.Thread.run(Thread.**java:662) > >>>>> > >>>>> [2] > >>>>> ./bin/nutch inject crawl/crawldb urls>> crawl.log > >>>>> ./bin/nutch generate crawl/crawldb crawl/segments>> crawl.log > >>>>> s1=`ls -d crawl/segments/2* | tail -1`>> crawl.log > >>>>> ./bin/nutch fetch $s1 -noParsing>> crawl.log > >>>>> ./bin/nutch parse $s1>> crawl.log -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

