Hi, I think the best start point could be: http://wiki.apache.org/nutch/Nutch_0.9_Crawl_Script_Tutorial You can modify the order of same steps.
On Thu, Aug 9, 2012 at 1:26 AM, aabbcc <[email protected]> wrote: > Hi, > > my problem is that i have a domain (es http://*.apache.org) and I want to > crawl every document and page in this website and indicize them with Solr. > I was able to do it using the basic command to crawl with nutch: > > bin/nutch crawl urls -solr http://localhost:8983/solr/ > > but the indicization part comes at the end of the process. So I have to > wait > for the whole crawl to end befor I can access my data. > I would like to create a script that ciclically crawls a certain nuber of > pages (for example 10000) and than indicize them. > In the nutch tutorial wiki I found this: > > bin/nutch generate crawl/crawldb crawl/segments -topN 1000 > s2=`ls -d crawl/segments/2* | tail -1` > echo $s2 > > bin/nutch fetch $s2 > bin/nutch parse $s2 > bin/nutch updatedb crawl/crawldb $s2 > > but I don't know how to specify it to stops when he had crawled the enteire > domain. > > Thanks for your help. > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

