Re: nutch reindexes all documents after each crawl

Bai Shen Fri, 30 Dec 2011 06:12:11 -0800

If you use the individual commands instead of the crawl command, you can
tell nutch to work on only a specific segment.


2011/12/30 Magnús Skúlason <[email protected]>

> Hi,
>
> I am using nutch to crawl a set of web sites and index them to solr,
> using the default crawl command:
> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
>
> I decided to use the default command since the sites I crawl are
> relatively few (< 1000).
>
> I have noticed that after each crawl, nutch reindexes every command to
> solr, not only the ones fetched / parsed during the last crawl, is
> this normal behaviour? If so, is there any way to turn this off, i.e.
> can I add a parameter to the command to tell nutch to only reindex new
> content?
>
> If not what would be the easiest way to modify this behaviour?
>
> One solution that comes to mind would be:
> bin/nutch crawl urls -depth 3 -topN 5
> find crawl/segments/ -maxdepth 1 -mmin -300 -type d -name '20*' -exec
> runtime/local/bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl
> crawl/linkdb {} \;
>
> i.e. skip indexing in the crawl command and call the solr indexing
> only on segments changed in the last X minutes (here 300, the
> estimated time of my crawl), would this produce the desired results?
> If I do this will I have to invert links before calling the solrindex
> command or does the crawl command take care of that?
>
> An additional question, how can I get a list of fetched urls from a
> segment?
>
> best regards,
> Magnus
>

Re: nutch reindexes all documents after each crawl

Reply via email to