The version of Nutch in the trunk has a useful crawl script in the bin dir
which does all the typical steps of a crawl and sends the docs to SOLR for
indexing at the end of each fetching round. The script is also more robust
and can work in both local and deployed mode

HTH

Julien

On 9 August 2012 00:26, aabbcc <[email protected]> wrote:

> Hi,
>
> my problem is that i have a domain (es http://*.apache.org) and I want to
> crawl every document and page in this website and indicize them with Solr.
> I was able to do it using the basic command to crawl with nutch:
>
>     bin/nutch crawl urls -solr http://localhost:8983/solr/
>
> but the indicization part comes at the end of the process. So I have to
> wait
> for the whole crawl to end befor I can access my data.
> I would like to create a script that ciclically crawls a certain nuber of
> pages (for example 10000) and than indicize them.
> In the nutch tutorial wiki I found this:
>
>     bin/nutch generate crawl/crawldb crawl/segments -topN 1000
>     s2=`ls -d crawl/segments/2* | tail -1`
>     echo $s2
>
>     bin/nutch fetch $s2
>     bin/nutch parse $s2
>     bin/nutch updatedb crawl/crawldb $s2
>
> but I don't know how to specify it to stops when he had crawled the enteire
> domain.
>
> Thanks for your help.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to