Hello - see inline.
Markus
 
-----Original message-----
> From:Ajmal Rahman <[email protected]>
> Sent: Tuesday 16th August 2016 15:55
> To: [email protected]
> Subject: Query on Single Crawl script to Crawl website (Nutch) and Index 
> results (Solr)
> 
> Dear Team,
> 
> I have a query. I'm not sure if this is the right place to ask. But here it 
> goes:
> 
> I have to crawl and index my website.
> 
> These are the steps I have been asked to follow.
> 
> Delete the crawl folders (apache-nutch-1.10\crawl)
> Remove the existing indexes:
> Solr-Admin-> Skyweb->Documents->Document Type (xml) and execute :
> Go to Solr-Admin -> Core Admin -> Click on 'Reload' and then 'Optimize'
> And run the crawl job using the following command:
> bin/crawl -i -D solr.server.url=http://IP:8080/solr/website/ urls/ crawl/ 5
> I did some research and felt that doing these tasks manually is overwork and 
> the script should take care of all the above tasks.
> 
> So my queries\concerns are:
> 
> Doesn't the above script take care of the entire process? Do I still need to 
> delete the crawl folders and clear the existing indexes manually?

The bin/crawl script does not delete any directory or clear an existing index.

> 
> What is the relevance of the Admin tasks - 'Reload' and 'Optimize'?

I don't know how they are relevant to you. Reload will just reload Solr 
configuration and optimize will merge all Lucene segments into one. But, 
optimize is usually very bad advice. Unless you really know what you are doing, 
don't do it, it is bad practice.

> 
> Can I cron schedule the the crawl script to run weekly and will it take care 
> of the entire process?

Sure.

> 
> How else can I automate the crawling and indexing to run periodically?

Well, simply wrap a script around it that deletes all records from the index 
and removes your crawl directory prior to calling the bin/crawl script. 
Although i can hardly come up with any reason why someone would crawl and 
index, remove everything again, and then crawl and index.

> 
> 
> Regards,
> 
> Mohammed Ajmal Rahman
> Tata Consultancy Services
> Mailto: [email protected]
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty. IT Services
> Business Solutions
> Consulting
> ____________________________________________
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
> 
> 

Reply via email to