---------- Forwarded message ----------
From: Puneet Pandey <[email protected]>
Date: Thu, Feb 16, 2012 at 4:34 PM
Subject: Re: Build a pipeline using nutch
To: [email protected]




On Thu, Feb 16, 2012 at 3:50 PM, Markus Jelsma <[email protected]> wrote:

>
> > > > I need to crawl 100K urls everyday. I have separate process
> >
> > which produces
> >
> > > the urls for me, but it is a bit time taking process. I do not want to
> >
> > wait
> >
> > > for all the urls to be generated and then start the nutch crawl. What i
> > > want is to start the nutch fetch process whenever I have received a
> batch
> > > of urls (say 10K) available. Is it possible to inject the batch2 of 10K
> > > urls while fetch for batch1 is still running? If yes, when will nutch
> > > pick the next batch for crawl.
> >
> > markus: This is only possible when you use the freegen command. Also, i'd
> > not
> > recommend running concurrent jobs in local mode.
> >
> >
> > puneetp: I am planning to run the jobs using hadoop cluster. Suppose I am
> > running two concurrent fetch jobs will politeness be taken care of across
> > the jobs ??
>
> No. Each job is independant.
>
> >
> > Also, is it possible to set up a pipeline where i just keep injecting
> urls
> > (say in batches of 10K) and generator/freegen/"something else" keeps
> > feeding it to the fetcher respecting all the politeness??
>
> You can only maintain politeness in a single regular fetcher with host or
> domain queues set up.
>

Does this mean that if I want to run concurrent fetchers (while respecting
politeness) I would have to ensure that all the urls of a domain are part
of one and only one fetcher job ??


>
> >
> > On Thu, Feb 16, 2012 at 1:01 AM, Markus Jelsma <[email protected]>
> wrote:
> > > > my questions/doubts are inline
> > > >
> > > > On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney <
> > > >
> > > > [email protected]> wrote:
> > > > > Hi Puneet,
> > > > >
> > > > > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <
> [email protected]>
> > > > >
> > > > > wrote:
> > > > > > I have started using nutch recently.
> > > > > > As I understand nutch crawling is a cyclic process
> > > > > > inject->generate->fetch->parse->update
> > > > >
> > > > > Yes this is typically what you would execute.
> > > > >
> > > > > > 1. When does parse start when I use the "crawl" command line. Is
> it
> > > > > > after all the urls have been fetched in the segment?
> > > > >
> > > > > Depends on what settings you specify in nutch-site.xml, by default
> > > > > parsing is done as a separate process (after fetching) when using
> the
> > > > > crawl command.
> > > >
> > > > Suppose i submitted 10K urls in a segment for crawl. Does the parsing
> > > > of the content start as soon as the first URL is available (i.e.
> > > > fetched) or the parsing starts only after all 10K have been fetched.
> > > > For my use case
> > >
> > > i
> > >
> > > > want parsing to start on the urls as soon as they are available w/o
> > >
> > > waiting
> > >
> > > > for fetch on others to complete.
> > >
> > > don't use the crawl command, it has fetchign and parsing as separate
> > > jobs. You
> > > need to enable fetcher.parse to parse fetched files immediately.
> > >
> > > > > > What if I want to the parse
> > > > > > the content as soon as it has been fetched?
> > > > >
> > > > > Change your settings in nutch-site.xml to override the defaults,
> then
> > > > > rebuild the project.
> > > > >
> > > > > > 2. Is it possible to run two fetches in parallel? Suppose I
> > > > > > generate
> > >
> > > 2
> > >
> > > > > > segments is it possible to run fetch on seg1 and seg2 in
> parallel?
> > > > >
> > > > > Yes this is possible, you would set the number of threads in your
> > >
> > > fetcher
> > >
> > > > > to run this task in parallel.
> > > > >
> > > > > I need to crawl 100K urls everyday. I have separate process which
> > > > > produces
> > > >
> > > > the urls for me, but it is a bit time taking process. I do not want
> to
> > >
> > > wait
> > >
> > > > for all the urls to be generated and then start the nutch crawl.
> What i
> > > > want is to start the nutch fetch process whenever I have received a
> > > > batch of urls (say 10K) available. Is it possible to inject the
> batch2
> > > > of 10K urls while fetch for batch1 is still running? If yes, when
> will
> > > > nutch
> > >
> > > pick
> > >
> > > > the next batch for crawl.
> > >
> > > This is only possible when you use the freegen command. Also, i'd not
> > > recommend running concurrent jobs in local mode.
> > >
> > > > Also, I do not want to crawl any of the links from the fetched pages.
> > > > The only urls that need to be crawled are the ones generated by my
> > > > process.
> > >
> > > How
> > >
> > > > do i ensure this. Is there any config setting with which we can
> disable
> > > > crawl of links present in fetched pages?
> > >
> > > Update the crawldb with additions disabled.
> > >
> > > > > > 3. Can I limit the number of urls per host per segment in the
> > >
> > > generate
> > >
> > > > > step
> > > > >
> > > > > > itself?
> > > > >
> > > > > Yes, please check out nutch-default.xml for generator properties, I
> > >
> > > don't
> > >
> > > > > have the settings off my head but this is possible.
> > > > >
> > > > > > Puneet
> > > > >
> > > > > --
> > > > > *Lewis*
>

Reply via email to