Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Markus Jelsma Sat, 16 Jul 2011 05:17:47 -0700

Because Nutch is a crawler intending to write to more than one search engine. 
Besides, the crawldb is gone, as a flat file, in trunk. Also, Solr is really 
slow when it comes to updating millions of records, the crawldb isn't when 
split over multiple machines.


> Hello,
> 
> I had this draft lurking for a while now, and before archiving for personal
> reference I wondered if it's accurate, and if you recommend posting it to
> the wiki.
> 
> Nutch maintains a crawldb (and linkdb, for that matter) of the urls it
> crawled, the fetch status, and the date. This data is maintained beyond
> fetch so that pages may be re-crawled, after the a re-crawling period. At
> the same time Solr maintains an inverted index of all the fetched pages.
> It'd seem more efficient if nutch relied on the index instead of
> maintaining its own crawldb, to !store the same url twice. [BUT THAT'S
> JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN SOLR]

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Reply via email to