Hi,

Just checking if anyone could comment on my post below. :)

Thanks in advance.

Safdar


On Mon, Jun 11, 2012 at 8:10 AM, Ali Safdar Kureishy <
[email protected]> wrote:

> Hi,
>
> I'm trying to build an "incremental" crawler, using the various Nutch
> crawl tools (generate + fetch/parse + updatedb etc.). By "incremental" I
> mean I want crawled pages to show up quickly in the index (instead of
> waiting till the end of the crawl). So, I'd like to index as soon as I have
> fetched a segment.
>
> The requirement to invoke update-db and invert-links at the end of each
> fetch+parse phase (before solrindex and before the next generate) can slow
> down this crawl. Instead, here is what I'm thinking of doing for each
> segment (after fetch+parse):
> 1) Invoke update-db and invert-links to "local" crawldb and linkdb folders
> (within the segment).
> 2) Invoke solr-index using these "local" crawldb and linkdb folders,
> 3) Do steps 1-2 for a few pre-generated segments (I would have
> pre-generated several mutually-exclusive segments before step 1)
> 4) *Merge* these local crawldbs and linkdbs into the "master" crawldb and
> linkdb
> 5) Proceed to generate the next set of segments from the merged "master"
> crawldb and linkdb
>
> Do you see any problem with this approach? More specifically:
> a) is an updatedb (to a local crawldb) followed by a mergedb (to the
> master crawldb) the same as doing an updatedb directly to the master
> crawldb? And similarly,
> b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the
> master linkdb) the same as doing an invertlinks directly to the master
> linkdb?
>
> Thanks in advance!
>
> Regards,
> Safdar
>

Reply via email to