Yes I know. If your PDF's are larger than this then they will be either truncated or may not be crawled. Please look thoroughly at your log output... you may wish to use the http.verbose and fetcher.verbose properties as well.
On Tue, May 22, 2012 at 10:31 AM, Tolga <[email protected]> wrote: > The value is 65536 > > On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >> >> try your http.content.limit and also make sure that you haven't >> changed anything within the tika mimeType mappings. >> >> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]> wrote: >>> >>> Sorry, I forgot to also add my original problem. PDF files are not >>> crawled. >>> I even modified -topN to be 10. >>> >>> >>> -------- Original Message -------- >>> Subject: PDF not crawled/indexed >>> Date: Tue, 22 May 2012 10:48:15 +0300 >>> From: Tolga<[email protected]> >>> To: [email protected] >>> >>> >>> >>> Hi, >>> >>> I am crawling my website with this command: >>> >>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>> http://localhost:8983/solr/ -depth 20 -topN 5 >>> >>> Is it a good idea to modify the directory name? Should I always delete >>> indexes prior to crawling and stick to the same directory name? >>> >>> Regards, >>> >> >> > -- Lewis

