从我的诺基亚手机发送 -----原始邮件----- 自:remi tassing 发送时间: 2012/04/24 06:57:04 主题: Re: Good workflow for a regular re-indexing job
Have you read this? http://wiki.apache.org/nutch/NutchTutorial/ You can put all commands in a shell script Remi On Monday, April 23, 2012, Ian Piper wrote: > Hi all, > > I have set up a process for crawling a client's website using nutch and > then creating a Solr index. I have run into a workflow problem and would > appreciate some guidance - preferably a tutorial of some sort. > > My current workflow is: > > 1. Clear out existing index > 2. Run the crawl to create the database > 3. Move the database to Solr to make a new index > > This (predictably) causes a problem if the crawl fails, as the index no > longer exists and searching therefore fails. I am doing the crawl and index > in one go using this command: > > bin/nutch crawl urls -solr http://[domain]/solr/ -depth 5 -topN 500 > > I should probably split up the crawl and index processes: > > 1. Run the crawl > 2. Check that it has run correctly and created the database > 3. Clear out the old index > 4. Create the new index from the database. > > However I can't find the information on the right syntax. Also, what is a > good way to check whether the crawl has successfully run so that I can move > on to removing the old index and creating the new one? > > Any guidance much appreciated. > > > Ian. > *-- * > *Dr Ian Piper* > Tellura Information Services - the web, document and information people > Registered in England and Wales: 5076715, VAT Number: 874 2060 29 > http://www.tellura.co.uk/ > Creator of monickr: http://monickr.com > 01926 813736 | 07973 156616 > *-- * > > >

