Re: Build a pipeline using nutch

Puneet Pandey Thu, 16 Feb 2012 01:56:30 -0800

In my use case there will be lot of urls for the same host. Nutch will do
the scheduling for me respecting all the politeness. Also I can plugin my
parser to post-process the received data.
Yes I can write my java httpclient to fetch pages but then to make it fast
i will have to make it distributed... exactly what nutch offers.


Puneet

2012/2/16 Magnús Skúlason <[email protected]>

> As it sounds to me its not obvious that you would want to use Nutch to
> deliver this functionality. What is it that you hope to get out of
> Nutch?
>
> Why not just write a simple java process using httpclient to fetch the
> pages from your other process? Or even wget them? and extract the
> content
>
> best regards,
> Magnus
>
> On Wed, Feb 15, 2012 at 7:40 PM, Markus Jelsma <[email protected]> wrote:
> >  Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney <
> >>
> >> [email protected]> wrote:
> >> > Hi Puneet,
> >> >
> >> > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]>
> >> >
> >> > wrote:
> >> > > I have started using nutch recently.
> >> > > As I understand nutch crawling is a cyclic process
> >> > > inject->generate->fetch->parse->update
> >> >
> >> > Yes this is typically what you would execute.
> >> >
> >> > > 1. When does parse start when I use the "crawl" command line. Is it
> >> > > after all the urls have been fetched in the segment?
> >> >
> >> > Depends on what settings you specify in nutch-site.xml, by default
> >> > parsing is done as a separate process (after fetching) when using the
> >> > crawl command.
> >>
> >> Suppose i submitted 10K urls in a segment for crawl. Does the parsing of
> >> the content start as soon as the first URL is available (i.e. fetched)
> or
> >> the parsing starts only after all 10K have been fetched. For my use
> case i
> >> want parsing to start on the urls as soon as they are available w/o
> waiting
> >> for fetch on others to complete.
> >
> > don't use the crawl command, it has fetchign and parsing as separate
> jobs. You
> > need to enable fetcher.parse to parse fetched files immediately.
> >
> >>
> >> > > What if I want to the parse
> >> > > the content as soon as it has been fetched?
> >> >
> >> > Change your settings in nutch-site.xml to override the defaults, then
> >> > rebuild the project.
> >> >
> >> > > 2. Is it possible to run two fetches in parallel? Suppose I
> generate 2
> >> > > segments is it possible to run fetch on seg1 and seg2 in parallel?
> >> >
> >> > Yes this is possible, you would set the number of threads in your
> fetcher
> >> > to run this task in parallel.
> >> >
> >> > I need to crawl 100K urls everyday. I have separate process which
> >> > produces
> >>
> >> the urls for me, but it is a bit time taking process. I do not want to
> wait
> >> for all the urls to be generated and then start the nutch crawl. What i
> >> want is to start the nutch fetch process whenever I have received a
> batch
> >> of urls (say 10K) available. Is it possible to inject the batch2 of 10K
> >> urls while fetch for batch1 is still running? If yes, when will nutch
> pick
> >> the next batch for crawl.
> >
> > This is only possible when you use the freegen command. Also, i'd not
> > recommend running concurrent jobs in local mode.
> >
> >>
> >> Also, I do not want to crawl any of the links from the fetched pages.
> The
> >> only urls that need to be crawled are the ones generated by my process.
> How
> >> do i ensure this. Is there any config setting with which we can disable
> >> crawl of links present in fetched pages?
> >
> > Update the crawldb with additions disabled.
> >
> >>
> >> > > 3. Can I limit the number of urls per host per segment in the
> generate
> >> >
> >> > step
> >> >
> >> > > itself?
> >> >
> >> > Yes, please check out nutch-default.xml for generator properties, I
> don't
> >> > have the settings off my head but this is possible.
> >> >
> >> > > Puneet
> >> >
> >> > --
> >> > Lewis
>

Re: Build a pipeline using nutch

Reply via email to