If you use the individual commands instead of the crawl command, you can tell nutch to work on only a specific segment.
2011/12/30 Magnús Skúlason <[email protected]> > Hi, > > I am using nutch to crawl a set of web sites and index them to solr, > using the default crawl command: > bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > > I decided to use the default command since the sites I crawl are > relatively few (< 1000). > > I have noticed that after each crawl, nutch reindexes every command to > solr, not only the ones fetched / parsed during the last crawl, is > this normal behaviour? If so, is there any way to turn this off, i.e. > can I add a parameter to the command to tell nutch to only reindex new > content? > > If not what would be the easiest way to modify this behaviour? > > One solution that comes to mind would be: > bin/nutch crawl urls -depth 3 -topN 5 > find crawl/segments/ -maxdepth 1 -mmin -300 -type d -name '20*' -exec > runtime/local/bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl > crawl/linkdb {} \; > > i.e. skip indexing in the crawl command and call the solr indexing > only on segments changed in the last X minutes (here 300, the > estimated time of my crawl), would this produce the desired results? > If I do this will I have to invert links before calling the solrindex > command or does the crawl command take care of that? > > An additional question, how can I get a list of fetched urls from a > segment? > > best regards, > Magnus >

