Hi, I think the best start point could be:
http://wiki.apache.org/nutch/Nutch_0.9_Crawl_Script_Tutorial
You can modify the order of same steps.

On Thu, Aug 9, 2012 at 1:26 AM, aabbcc <[email protected]> wrote:

> Hi,
>
> my problem is that i have a domain (es http://*.apache.org) and I want to
> crawl every document and page in this website and indicize them with Solr.
> I was able to do it using the basic command to crawl with nutch:
>
>     bin/nutch crawl urls -solr http://localhost:8983/solr/
>
> but the indicization part comes at the end of the process. So I have to
> wait
> for the whole crawl to end befor I can access my data.
> I would like to create a script that ciclically crawls a certain nuber of
> pages (for example 10000) and than indicize them.
> In the nutch tutorial wiki I found this:
>
>     bin/nutch generate crawl/crawldb crawl/segments -topN 1000
>     s2=`ls -d crawl/segments/2* | tail -1`
>     echo $s2
>
>     bin/nutch fetch $s2
>     bin/nutch parse $s2
>     bin/nutch updatedb crawl/crawldb $s2
>
> but I don't know how to specify it to stops when he had crawled the enteire
> domain.
>
> Thanks for your help.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to