Hi,

I'm trying to build an "incremental" crawler, using the various Nutch crawl
tools (generate + fetch/parse + updatedb etc.). By "incremental" I mean I
want crawled pages to show up quickly in the index (instead of waiting till
the end of the crawl). So, I'd like to index as soon as I have fetched a
segment.

The requirement to invoke update-db and invert-links at the end of each
fetch+parse phase (before solrindex and before the next generate) can slow
down this crawl. Instead, here is what I'm thinking of doing for each
segment (after fetch+parse):
1) Invoke update-db and invert-links to "local" crawldb and linkdb folders
(within the segment).
2) Invoke solr-index using these "local" crawldb and linkdb folders,
3) Do steps 1-2 for a few pre-generated segments (I would have
pre-generated several mutually-exclusive segments before step 1)
4) *Merge* these local crawldbs and linkdbs into the "master" crawldb and
linkdb
5) Proceed to generate the next set of segments from the merged "master"
crawldb and linkdb

Do you see any problem with this approach? More specifically:
a) is an updatedb (to a local crawldb) followed by a mergedb (to the master
crawldb) the same as doing an updatedb directly to the master crawldb? And
similarly,
b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the
master linkdb) the same as doing an invertlinks directly to the master
linkdb?

Thanks in advance!

Regards,
Safdar

Reply via email to