Hi, Just checking if anyone could comment on my post below. :)
Thanks in advance. Safdar On Mon, Jun 11, 2012 at 8:10 AM, Ali Safdar Kureishy < [email protected]> wrote: > Hi, > > I'm trying to build an "incremental" crawler, using the various Nutch > crawl tools (generate + fetch/parse + updatedb etc.). By "incremental" I > mean I want crawled pages to show up quickly in the index (instead of > waiting till the end of the crawl). So, I'd like to index as soon as I have > fetched a segment. > > The requirement to invoke update-db and invert-links at the end of each > fetch+parse phase (before solrindex and before the next generate) can slow > down this crawl. Instead, here is what I'm thinking of doing for each > segment (after fetch+parse): > 1) Invoke update-db and invert-links to "local" crawldb and linkdb folders > (within the segment). > 2) Invoke solr-index using these "local" crawldb and linkdb folders, > 3) Do steps 1-2 for a few pre-generated segments (I would have > pre-generated several mutually-exclusive segments before step 1) > 4) *Merge* these local crawldbs and linkdbs into the "master" crawldb and > linkdb > 5) Proceed to generate the next set of segments from the merged "master" > crawldb and linkdb > > Do you see any problem with this approach? More specifically: > a) is an updatedb (to a local crawldb) followed by a mergedb (to the > master crawldb) the same as doing an updatedb directly to the master > crawldb? And similarly, > b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the > master linkdb) the same as doing an invertlinks directly to the master > linkdb? > > Thanks in advance! > > Regards, > Safdar >

