Hi,
my problem is that i have a domain (es http://*.apache.org) and I want to
crawl every document and page in this website and indicize them with Solr.
I was able to do it using the basic command to crawl with nutch:
bin/nutch crawl urls -solr http://localhost:8983/solr/
but the indicization part comes at the end of the process. So I have to wait
for the whole crawl to end befor I can access my data.
I would like to create a script that ciclically crawls a certain nuber of
pages (for example 10000) and than indicize them.
In the nutch tutorial wiki I found this:
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
but I don't know how to specify it to stops when he had crawled the enteire
domain.
Thanks for your help.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html
Sent from the Nutch - User mailing list archive at Nabble.com.