Re: Build a pipeline using nutch

Lewis John Mcgibbney Tue, 14 Feb 2012 03:57:10 -0800

Hi Puneet,

On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]> wrote:


> I have started using nutch recently.
> As I understand nutch crawling is a cyclic process
> inject->generate->fetch->parse->update
>
Yes this is typically what you would execute.


>
> 1. When does parse start when I use the "crawl" command line. Is it after
> all the urls have been fetched in the segment?

Depends on what settings you specify in nutch-site.xml, by default parsing
is done as a separate process (after fetching) when using the crawl command.


> What if I want to the parse
> the content as soon as it has been fetched?
>
Change your settings in nutch-site.xml to override the defaults, then
rebuild the project.


> 2. Is it possible to run two fetches in parallel? Suppose I generate 2
> segments is it possible to run fetch on seg1 and seg2 in parallel?
>
Yes this is possible, you would set the number of threads in your fetcher
to run this task in parallel.


> 3. Can I limit the number of urls per host per segment in the generate step
> itself?
>
Yes, please check out nutch-default.xml for generator properties, I don't
have the settings off my head but this is possible.


>
> Puneet
>



-- 
*Lewis*

Re: Build a pipeline using nutch

Reply via email to