Hi, my problem is that i have a domain (es http://*.apache.org) and I want to crawl every document and page in this website and indicize them with Solr. I was able to do it using the basic command to crawl with nutch:
bin/nutch crawl urls -solr http://localhost:8983/solr/ but the indicization part comes at the end of the process. So I have to wait for the whole crawl to end befor I can access my data. I would like to create a script that ciclically crawls a certain nuber of pages (for example 10000) and than indicize them. In the nutch tutorial wiki I found this: bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb crawl/crawldb $s2 but I don't know how to specify it to stops when he had crawled the enteire domain. Thanks for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html Sent from the Nutch - User mailing list archive at Nabble.com.