从我的诺基亚手机发送
-----原始邮件-----
自:remi tassing
发送时间: 2012/04/24 06:57:04
主题: Re: Good workflow for a regular re-indexing job

Have you read this?
http://wiki.apache.org/nutch/NutchTutorial/
You can put all commands in a shell script

Remi

On Monday, April 23, 2012, Ian Piper wrote:

> Hi all,
>
> I have set up a process for crawling a client's website using nutch and
> then creating a Solr index. I have run into a workflow problem and would
> appreciate some guidance - preferably a tutorial of some sort.
>
> My current workflow is:
>
> 1. Clear out existing index
> 2. Run the crawl to create the database
> 3. Move the database to Solr to make a new index
>
> This (predictably) causes a problem if the crawl fails, as the index no
> longer exists and searching therefore fails. I am doing the crawl and index
> in one go using this command:
>
> bin/nutch crawl urls -solr http://[domain]/solr/ -depth 5 -topN 500
>
> I should probably split up the crawl and index processes:
>
> 1. Run the crawl
> 2. Check that it has run correctly and created the database
> 3. Clear out the old index
> 4. Create the new index from the database.
>
> However I can't find the information on the right syntax. Also, what is a
> good way to check whether the crawl has successfully run so that I can move
> on to removing the old index and creating the new one?
>
> Any guidance much appreciated.
>
>
> Ian.
> *-- *
> *Dr Ian Piper*
> Tellura Information Services - the web, document and information people
> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
> http://www.tellura.co.uk/
> Creator of monickr: http://monickr.com
> 01926 813736 | 07973 156616
> *-- *
>
>
>

Reply via email to