Hi, I am using nutch to crawl a set of web sites and index them to solr, using the default crawl command: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
I decided to use the default command since the sites I crawl are relatively few (< 1000). I have noticed that after each crawl, nutch reindexes every command to solr, not only the ones fetched / parsed during the last crawl, is this normal behaviour? If so, is there any way to turn this off, i.e. can I add a parameter to the command to tell nutch to only reindex new content? If not what would be the easiest way to modify this behaviour? One solution that comes to mind would be: bin/nutch crawl urls -depth 3 -topN 5 find crawl/segments/ -maxdepth 1 -mmin -300 -type d -name '20*' -exec runtime/local/bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl crawl/linkdb {} \; i.e. skip indexing in the crawl command and call the solr indexing only on segments changed in the last X minutes (here 300, the estimated time of my crawl), would this produce the desired results? If I do this will I have to invert links before calling the solrindex command or does the crawl command take care of that? An additional question, how can I get a list of fetched urls from a segment? best regards, Magnus

