I would not recommend using the Crawl command for large crawls, because: 1. Tuning Hadoop ist not possible at all 2. Incremental Crawling is also pretty difficult because you can't control the different processes/steps
On Sat, Feb 26, 2011 at 9:58 AM, firespin <[email protected]> wrote: > I would like to do a large crawl and let nutch run to index up to > 10-100 million webpages. I know on > http://wiki.apache.org/nutch/NutchTutorial the nutch crawl command > will do all steps with just that command, but the page calls it > intranet crawling. Also the page say the crawl command have > limitations, but doesn't tell what they are. > My questions are can I use the crawl command for indexing 10-100 > millions of pages from many different sites? Also what are the > limitations of the crawl command? >

