Re: Build a pipeline using nutch

remi tassing Wed, 15 Feb 2012 20:29:13 -0800

Hi,

Just a related question: Does.it make a big difference to fetch and parse
directly than fetch all first, then parse. I was.under the impression that
they yield.to the same end result....


Remi

On Wednesday, February 15, 2012, Markus Jelsma <[email protected]> wrote:
>
>> my questions/doubts are inline
>>
>> On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney <
>>
>> [email protected]> wrote:
>> > Hi Puneet,
>> >
>> > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]>
>> >
>> > wrote:
>> > > I have started using nutch recently.
>> > > As I understand nutch crawling is a cyclic process
>> > > inject->generate->fetch->parse->update
>> >
>> > Yes this is typically what you would execute.
>> >
>> > > 1. When does parse start when I use the "crawl" command line. Is it
>> > > after all the urls have been fetched in the segment?
>> >
>> > Depends on what settings you specify in nutch-site.xml, by default
>> > parsing is done as a separate process (after fetching) when using the
>> > crawl command.
>>
>> Suppose i submitted 10K urls in a segment for crawl. Does the parsing of
>> the content start as soon as the first URL is available (i.e. fetched) or
>> the parsing starts only after all 10K have been fetched. For my use case
i
>> want parsing to start on the urls as soon as they are available w/o
waiting
>> for fetch on others to complete.
>
> don't use the crawl command, it has fetchign and parsing as separate
jobs. You
> need to enable fetcher.parse to parse fetched files immediately.
>
>>
>> > > What if I want to the parse
>> > > the content as soon as it has been fetched?
>> >
>> > Change your settings in nutch-site.xml to override the defaults, then
>> > rebuild the project.
>> >
>> > > 2. Is it possible to run two fetches in parallel? Suppose I generate
2
>> > > segments is it possible to run fetch on seg1 and seg2 in parallel?
>> >
>> > Yes this is possible, you would set the number of threads in your
fetcher
>> > to run this task in parallel.
>> >
>> > I need to crawl 100K urls everyday. I have separate process which
>> > produces
>>
>> the urls for me, but it is a bit time taking process. I do not want to
wait
>> for all the urls to be generated and then start the nutch crawl. What i
>> want is to start the nutch fetch process whenever I have received a batch
>> of urls (say 10K) available. Is it possible to inject the batch2 of 10K
>> urls while fetch for batch1 is still running? If yes, when will nutch
pick
>> the next batch for crawl.
>
> This is only possible when you use the freegen command. Also, i'd not
> recommend running concurrent jobs in local mode.
>
>>
>> Also, I do not want to crawl any of the links from the fetched pages. The
>> only urls that need to be crawled are the ones generated by my process.
How
>> do i ensure this. Is there any config setting with which we can disable
>> crawl of links present in fetched pages?
>
> Update the crawldb with additions disabled.
>
>>
>> > > 3. Can I limit the number of urls per host per segment in the
generate
>> >
>> > step
>> >
>> > > itself?
>> >
>> > Yes, please check out nutch-default.xml for generator properties, I
don't
>> > have the settings off my head but this is possible.
>> >
>> > > Puneet
>> >
>> > --
>> > *Lewis*
>

Re: Build a pipeline using nutch

Reply via email to