Hi, I'm trying to build an "incremental" crawler, using the various Nutch crawl tools (generate + fetch/parse + updatedb etc.). By "incremental" I mean I want crawled pages to show up quickly in the index (instead of waiting till the end of the crawl). So, I'd like to index as soon as I have fetched a segment.
The requirement to invoke update-db and invert-links at the end of each fetch+parse phase (before solrindex and before the next generate) can slow down this crawl. Instead, here is what I'm thinking of doing for each segment (after fetch+parse): 1) Invoke update-db and invert-links to "local" crawldb and linkdb folders (within the segment). 2) Invoke solr-index using these "local" crawldb and linkdb folders, 3) Do steps 1-2 for a few pre-generated segments (I would have pre-generated several mutually-exclusive segments before step 1) 4) *Merge* these local crawldbs and linkdbs into the "master" crawldb and linkdb 5) Proceed to generate the next set of segments from the merged "master" crawldb and linkdb Do you see any problem with this approach? More specifically: a) is an updatedb (to a local crawldb) followed by a mergedb (to the master crawldb) the same as doing an updatedb directly to the master crawldb? And similarly, b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the master linkdb) the same as doing an invertlinks directly to the master linkdb? Thanks in advance! Regards, Safdar

