I would not recommend using the Crawl command for large crawls, because:
1. Tuning Hadoop ist not possible at all
2. Incremental Crawling is also pretty difficult because you can't control
the different processes/steps

On Sat, Feb 26, 2011 at 9:58 AM, firespin <[email protected]> wrote:

> I would like to do a large crawl and let nutch run to index up to
> 10-100 million webpages. I know on
> http://wiki.apache.org/nutch/NutchTutorial the nutch crawl command
> will do all steps with just that command, but the page calls it
> intranet crawling. Also the page say the crawl command have
> limitations, but doesn't tell what they are.
> My questions are can I use the crawl command for indexing 10-100
> millions of pages from many different sites? Also what are the
> limitations of the crawl command?
>

Reply via email to