my questions/doubts are inline On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney < [email protected]> wrote:
> Hi Puneet, > > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]> > wrote: > > > I have started using nutch recently. > > As I understand nutch crawling is a cyclic process > > inject->generate->fetch->parse->update > > > Yes this is typically what you would execute. > > > > > > 1. When does parse start when I use the "crawl" command line. Is it after > > all the urls have been fetched in the segment? > > Depends on what settings you specify in nutch-site.xml, by default parsing > is done as a separate process (after fetching) when using the crawl > command. > Suppose i submitted 10K urls in a segment for crawl. Does the parsing of the content start as soon as the first URL is available (i.e. fetched) or the parsing starts only after all 10K have been fetched. For my use case i want parsing to start on the urls as soon as they are available w/o waiting for fetch on others to complete. > > > > What if I want to the parse > > the content as soon as it has been fetched? > > > Change your settings in nutch-site.xml to override the defaults, then > rebuild the project. > > > > 2. Is it possible to run two fetches in parallel? Suppose I generate 2 > > segments is it possible to run fetch on seg1 and seg2 in parallel? > > > Yes this is possible, you would set the number of threads in your fetcher > to run this task in parallel. > > I need to crawl 100K urls everyday. I have separate process which produces the urls for me, but it is a bit time taking process. I do not want to wait for all the urls to be generated and then start the nutch crawl. What i want is to start the nutch fetch process whenever I have received a batch of urls (say 10K) available. Is it possible to inject the batch2 of 10K urls while fetch for batch1 is still running? If yes, when will nutch pick the next batch for crawl. Also, I do not want to crawl any of the links from the fetched pages. The only urls that need to be crawled are the ones generated by my process. How do i ensure this. Is there any config setting with which we can disable crawl of links present in fetched pages? > > > 3. Can I limit the number of urls per host per segment in the generate > step > > itself? > > > Yes, please check out nutch-default.xml for generator properties, I don't > have the settings off my head but this is possible. > > > > > > Puneet > > > > > > -- > *Lewis* >

