Because Nutch is a crawler intending to write to more than one search engine. Besides, the crawldb is gone, as a flat file, in trunk. Also, Solr is really slow when it comes to updating millions of records, the crawldb isn't when split over multiple machines.
> Hello, > > I had this draft lurking for a while now, and before archiving for personal > reference I wondered if it's accurate, and if you recommend posting it to > the wiki. > > Nutch maintains a crawldb (and linkdb, for that matter) of the urls it > crawled, the fetch status, and the date. This data is maintained beyond > fetch so that pages may be re-crawled, after the a re-crawling period. At > the same time Solr maintains an inverted index of all the fetched pages. > It'd seem more efficient if nutch relied on the index instead of > maintaining its own crawldb, to !store the same url twice. [BUT THAT'S > JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN SOLR]

