Hmm, okay. I never touched that file.
On 5/22/12 12:26 PM, Lewis John Mcgibbney wrote:
Sorry I should have been more explicit about the exact file locationb
http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml
hth
On Tue, May 22, 2012 at 10:19 AM, Tolga<[email protected]> wrote:
By, tika mimeType settings, do you mean protocol-http?
On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
try your http.content.limit and also make sure that you haven't
changed anything within the tika mimeType mappings.
On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]> wrote:
Sorry, I forgot to also add my original problem. PDF files are not
crawled.
I even modified -topN to be 10.
-------- Original Message --------
Subject: PDF not crawled/indexed
Date: Tue, 22 May 2012 10:48:15 +0300
From: Tolga<[email protected]>
To: [email protected]
Hi,
I am crawling my website with this command:
bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
http://localhost:8983/solr/ -depth 20 -topN 5
Is it a good idea to modify the directory name? Should I always delete
indexes prior to crawling and stick to the same directory name?
Regards,