I have been dealing with the exact same issues, and I wonder what happens to PDF's that exceed the file size limit, are they cropped (and partly parsed?) or fully ignored ? I seem to observe parsing problems in PDFs since using a file size limit. Setting the limit to -1 indeed caused consistent choke errors on large pages/files so setting a hard limit seemed the only option.
thanks Piet On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney < [email protected]> wrote: > yes well then you should either set this property to -1 (which is a > safe guard to ensure that you definitely crawl and parse all of your > PDF's) or a a safe guard, responsible value to reflect the size of > PDF's or other documents which you envisage to be obtained during your > crawl. The first option has the downside that on occasion the parser > can choke on rather large files... > > On Tue, May 22, 2012 at 10:36 AM, Tolga <[email protected]> wrote: > > What is that value's unit? kilobytes? My PDF file is 4.7mb. > > > > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: > >> > >> Yes I know. > >> > >> If your PDF's are larger than this then they will be either truncated > >> or may not be crawled. Please look thoroughly at your log output... > >> you may wish to use the http.verbose and fetcher.verbose properties as > >> well. > >> > >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]> wrote: > >>> > >>> The value is 65536 > >>> > >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: > >>>> > >>>> try your http.content.limit and also make sure that you haven't > >>>> changed anything within the tika mimeType mappings. > >>>> > >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]> wrote: > >>>>> > >>>>> Sorry, I forgot to also add my original problem. PDF files are not > >>>>> crawled. > >>>>> I even modified -topN to be 10. > >>>>> > >>>>> > >>>>> -------- Original Message -------- > >>>>> Subject: PDF not crawled/indexed > >>>>> Date: Tue, 22 May 2012 10:48:15 +0300 > >>>>> From: Tolga<[email protected]> > >>>>> To: [email protected] > >>>>> > >>>>> > >>>>> > >>>>> Hi, > >>>>> > >>>>> I am crawling my website with this command: > >>>>> > >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr > >>>>> http://localhost:8983/solr/ -depth 20 -topN 5 > >>>>> > >>>>> Is it a good idea to modify the directory name? Should I always > delete > >>>>> indexes prior to crawling and stick to the same directory name? > >>>>> > >>>>> Regards, > >>>>> > >>>> > >> > >> > > > > > > -- > Lewis >

