Re: Build a pipeline using nutch

Magnús Skúlason Wed, 15 Feb 2012 15:47:47 -0800

As it sounds to me its not obvious that you would want to use Nutch to
deliver this functionality. What is it that you hope to get out of
Nutch?


Why not just write a simple java process using httpclient to fetch the
pages from your other process? Or even wget them? and extract the
content

best regards,
Magnus

On Wed, Feb 15, 2012 at 7:40 PM, Markus Jelsma <[email protected]> wrote:
>  Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney <
>>
>> [email protected]> wrote:
>> > Hi Puneet,
>> >
>> > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]>
>> >
>> > wrote:
>> > > I have started using nutch recently.
>> > > As I understand nutch crawling is a cyclic process
>> > > inject->generate->fetch->parse->update
>> >
>> > Yes this is typically what you would execute.
>> >
>> > > 1. When does parse start when I use the "crawl" command line. Is it
>> > > after all the urls have been fetched in the segment?
>> >
>> > Depends on what settings you specify in nutch-site.xml, by default
>> > parsing is done as a separate process (after fetching) when using the
>> > crawl command.
>>
>> Suppose i submitted 10K urls in a segment for crawl. Does the parsing of
>> the content start as soon as the first URL is available (i.e. fetched) or
>> the parsing starts only after all 10K have been fetched. For my use case i
>> want parsing to start on the urls as soon as they are available w/o waiting
>> for fetch on others to complete.
>
> don't use the crawl command, it has fetchign and parsing as separate jobs. You
> need to enable fetcher.parse to parse fetched files immediately.
>
>>
>> > > What if I want to the parse
>> > > the content as soon as it has been fetched?
>> >
>> > Change your settings in nutch-site.xml to override the defaults, then
>> > rebuild the project.
>> >
>> > > 2. Is it possible to run two fetches in parallel? Suppose I generate 2
>> > > segments is it possible to run fetch on seg1 and seg2 in parallel?
>> >
>> > Yes this is possible, you would set the number of threads in your fetcher
>> > to run this task in parallel.
>> >
>> > I need to crawl 100K urls everyday. I have separate process which
>> > produces
>>
>> the urls for me, but it is a bit time taking process. I do not want to wait
>> for all the urls to be generated and then start the nutch crawl. What i
>> want is to start the nutch fetch process whenever I have received a batch
>> of urls (say 10K) available. Is it possible to inject the batch2 of 10K
>> urls while fetch for batch1 is still running? If yes, when will nutch pick
>> the next batch for crawl.
>
> This is only possible when you use the freegen command. Also, i'd not
> recommend running concurrent jobs in local mode.
>
>>
>> Also, I do not want to crawl any of the links from the fetched pages. The
>> only urls that need to be crawled are the ones generated by my process. How
>> do i ensure this. Is there any config setting with which we can disable
>> crawl of links present in fetched pages?
>
> Update the crawldb with additions disabled.
>
>>
>> > > 3. Can I limit the number of urls per host per segment in the generate
>> >
>> > step
>> >
>> > > itself?
>> >
>> > Yes, please check out nutch-default.xml for generator properties, I don't
>> > have the settings off my head but this is possible.
>> >
>> > > Puneet
>> >
>> > --
>> > Lewis

Re: Build a pipeline using nutch

Reply via email to