Nutch script to crawl a whole domain

aabbcc Wed, 08 Aug 2012 19:09:35 -0700

Hi,

my problem is that i have a domain (es http://*.apache.org) and I want to
crawl every document and page in this website and indicize them with Solr.
I was able to do it using the basic command to crawl with nutch:


    bin/nutch crawl urls -solr http://localhost:8983/solr/

but the indicization part comes at the end of the process. So I have to wait
for the whole crawl to end befor I can access my data.
I would like to create a script that ciclically crawls a certain nuber of
pages (for example 10000) and than indicize them.
In the nutch tutorial wiki I found this:

    bin/nutch generate crawl/crawldb crawl/segments -topN 1000
    s2=`ls -d crawl/segments/2* | tail -1`
    echo $s2

    bin/nutch fetch $s2
    bin/nutch parse $s2
    bin/nutch updatedb crawl/crawldb $s2

but I don't know how to specify it to stops when he had crawled the enteire
domain.

Thanks for your help.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch script to crawl a whole domain

Reply via email to