Re: Build a pipeline using nutch

Puneet Pandey Wed, 15 Feb 2012 11:30:25 -0800

my questions/doubts are inline

On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney <
[email protected]> wrote:


> Hi Puneet,
>
> On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]>
> wrote:
>
> > I have started using nutch recently.
> > As I understand nutch crawling is a cyclic process
> > inject->generate->fetch->parse->update
> >
> Yes this is typically what you would execute.
>
>
> >
> > 1. When does parse start when I use the "crawl" command line. Is it after
> > all the urls have been fetched in the segment?
>
> Depends on what settings you specify in nutch-site.xml, by default parsing
> is done as a separate process (after fetching) when using the crawl
> command.
>
Suppose i submitted 10K urls in a segment for crawl. Does the parsing of
the content start as soon as the first URL is available (i.e. fetched) or
the parsing starts only after all 10K have been fetched. For my use case i
want parsing to start on the urls as soon as they are available w/o waiting
for fetch on others to complete.


>
>
> > What if I want to the parse
> > the content as soon as it has been fetched?
> >
> Change your settings in nutch-site.xml to override the defaults, then
> rebuild the project.
>
>
> > 2. Is it possible to run two fetches in parallel? Suppose I generate 2
> > segments is it possible to run fetch on seg1 and seg2 in parallel?
> >
> Yes this is possible, you would set the number of threads in your fetcher
> to run this task in parallel.
>
> I need to crawl 100K urls everyday. I have separate process which produces
the urls for me, but it is a bit time taking process. I do not want to wait
for all the urls to be generated and then start the nutch crawl. What i
want is to start the nutch fetch process whenever I have received a batch
of urls (say 10K) available. Is it possible to inject the batch2 of 10K
urls while fetch for batch1 is still running? If yes, when will nutch pick
the next batch for crawl.

Also, I do not want to crawl any of the links from the fetched pages. The
only urls that need to be crawled are the ones generated by my process. How
do i ensure this. Is there any config setting with which we can disable
crawl of links present in fetched pages?

>
> > 3. Can I limit the number of urls per host per segment in the generate
> step
> > itself?
> >
> Yes, please check out nutch-default.xml for generator properties, I don't
> have the settings off my head but this is possible.
>
>
> >
> > Puneet
> >
>
>
>
> --
> *Lewis*
>

Re: Build a pipeline using nutch

Reply via email to