Re: Build a pipeline using nutch

Markus Jelsma Wed, 15 Feb 2012 14:49:44 -0800

> my questions/doubts are inline
> 
> On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney <
> 
> [email protected]> wrote:
> > Hi Puneet,
> > 
> > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]>
> > 
> > wrote:
> > > I have started using nutch recently.
> > > As I understand nutch crawling is a cyclic process
> > > inject->generate->fetch->parse->update
> > 
> > Yes this is typically what you would execute.
> > 
> > > 1. When does parse start when I use the "crawl" command line. Is it
> > > after all the urls have been fetched in the segment?
> > 
> > Depends on what settings you specify in nutch-site.xml, by default
> > parsing is done as a separate process (after fetching) when using the
> > crawl command.
> 
> Suppose i submitted 10K urls in a segment for crawl. Does the parsing of
> the content start as soon as the first URL is available (i.e. fetched) or
> the parsing starts only after all 10K have been fetched. For my use case i
> want parsing to start on the urls as soon as they are available w/o waiting
> for fetch on others to complete.


don't use the crawl command, it has fetchign and parsing as separate jobs. You 
need to enable fetcher.parse to parse fetched files immediately.

> 
> > > What if I want to the parse
> > > the content as soon as it has been fetched?
> > 
> > Change your settings in nutch-site.xml to override the defaults, then
> > rebuild the project.
> > 
> > > 2. Is it possible to run two fetches in parallel? Suppose I generate 2
> > > segments is it possible to run fetch on seg1 and seg2 in parallel?
> > 
> > Yes this is possible, you would set the number of threads in your fetcher
> > to run this task in parallel.
> > 
> > I need to crawl 100K urls everyday. I have separate process which
> > produces
> 
> the urls for me, but it is a bit time taking process. I do not want to wait
> for all the urls to be generated and then start the nutch crawl. What i
> want is to start the nutch fetch process whenever I have received a batch
> of urls (say 10K) available. Is it possible to inject the batch2 of 10K
> urls while fetch for batch1 is still running? If yes, when will nutch pick
> the next batch for crawl.

This is only possible when you use the freegen command. Also, i'd not 
recommend running concurrent jobs in local mode.

> 
> Also, I do not want to crawl any of the links from the fetched pages. The
> only urls that need to be crawled are the ones generated by my process. How
> do i ensure this. Is there any config setting with which we can disable
> crawl of links present in fetched pages?

Update the crawldb with additions disabled.

> 
> > > 3. Can I limit the number of urls per host per segment in the generate
> > 
> > step
> > 
> > > itself?
> > 
> > Yes, please check out nutch-default.xml for generator properties, I don't
> > have the settings off my head but this is possible.
> > 
> > > Puneet
> > 
> > --
> > *Lewis*

Re: Build a pipeline using nutch

Reply via email to