Hi Puneet, On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]> wrote:
> I have started using nutch recently. > As I understand nutch crawling is a cyclic process > inject->generate->fetch->parse->update > Yes this is typically what you would execute. > > 1. When does parse start when I use the "crawl" command line. Is it after > all the urls have been fetched in the segment? Depends on what settings you specify in nutch-site.xml, by default parsing is done as a separate process (after fetching) when using the crawl command. > What if I want to the parse > the content as soon as it has been fetched? > Change your settings in nutch-site.xml to override the defaults, then rebuild the project. > 2. Is it possible to run two fetches in parallel? Suppose I generate 2 > segments is it possible to run fetch on seg1 and seg2 in parallel? > Yes this is possible, you would set the number of threads in your fetcher to run this task in parallel. > 3. Can I limit the number of urls per host per segment in the generate step > itself? > Yes, please check out nutch-default.xml for generator properties, I don't have the settings off my head but this is possible. > > Puneet > -- *Lewis*

