Sorry I should have been more explicit about the exact file locationb http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml
hth On Tue, May 22, 2012 at 10:19 AM, Tolga <[email protected]> wrote: > By, tika mimeType settings, do you mean protocol-http? > > On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >> >> try your http.content.limit and also make sure that you haven't >> changed anything within the tika mimeType mappings. >> >> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]> wrote: >>> >>> Sorry, I forgot to also add my original problem. PDF files are not >>> crawled. >>> I even modified -topN to be 10. >>> >>> >>> -------- Original Message -------- >>> Subject: PDF not crawled/indexed >>> Date: Tue, 22 May 2012 10:48:15 +0300 >>> From: Tolga<[email protected]> >>> To: [email protected] >>> >>> >>> >>> Hi, >>> >>> I am crawling my website with this command: >>> >>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>> http://localhost:8983/solr/ -depth 20 -topN 5 >>> >>> Is it a good idea to modify the directory name? Should I always delete >>> indexes prior to crawling and stick to the same directory name? >>> >>> Regards, >>> >> >> > -- Lewis

