---------- Forwarded message ---------- From: Puneet Pandey <[email protected]> Date: Thu, Feb 16, 2012 at 4:34 PM Subject: Re: Build a pipeline using nutch To: [email protected]
On Thu, Feb 16, 2012 at 3:50 PM, Markus Jelsma <[email protected]> wrote: > > > > > I need to crawl 100K urls everyday. I have separate process > > > > which produces > > > > > the urls for me, but it is a bit time taking process. I do not want to > > > > wait > > > > > for all the urls to be generated and then start the nutch crawl. What i > > > want is to start the nutch fetch process whenever I have received a > batch > > > of urls (say 10K) available. Is it possible to inject the batch2 of 10K > > > urls while fetch for batch1 is still running? If yes, when will nutch > > > pick the next batch for crawl. > > > > markus: This is only possible when you use the freegen command. Also, i'd > > not > > recommend running concurrent jobs in local mode. > > > > > > puneetp: I am planning to run the jobs using hadoop cluster. Suppose I am > > running two concurrent fetch jobs will politeness be taken care of across > > the jobs ?? > > No. Each job is independant. > > > > > Also, is it possible to set up a pipeline where i just keep injecting > urls > > (say in batches of 10K) and generator/freegen/"something else" keeps > > feeding it to the fetcher respecting all the politeness?? > > You can only maintain politeness in a single regular fetcher with host or > domain queues set up. > Does this mean that if I want to run concurrent fetchers (while respecting politeness) I would have to ensure that all the urls of a domain are part of one and only one fetcher job ?? > > > > > On Thu, Feb 16, 2012 at 1:01 AM, Markus Jelsma <[email protected]> > wrote: > > > > my questions/doubts are inline > > > > > > > > On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney < > > > > > > > > [email protected]> wrote: > > > > > Hi Puneet, > > > > > > > > > > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey < > [email protected]> > > > > > > > > > > wrote: > > > > > > I have started using nutch recently. > > > > > > As I understand nutch crawling is a cyclic process > > > > > > inject->generate->fetch->parse->update > > > > > > > > > > Yes this is typically what you would execute. > > > > > > > > > > > 1. When does parse start when I use the "crawl" command line. Is > it > > > > > > after all the urls have been fetched in the segment? > > > > > > > > > > Depends on what settings you specify in nutch-site.xml, by default > > > > > parsing is done as a separate process (after fetching) when using > the > > > > > crawl command. > > > > > > > > Suppose i submitted 10K urls in a segment for crawl. Does the parsing > > > > of the content start as soon as the first URL is available (i.e. > > > > fetched) or the parsing starts only after all 10K have been fetched. > > > > For my use case > > > > > > i > > > > > > > want parsing to start on the urls as soon as they are available w/o > > > > > > waiting > > > > > > > for fetch on others to complete. > > > > > > don't use the crawl command, it has fetchign and parsing as separate > > > jobs. You > > > need to enable fetcher.parse to parse fetched files immediately. > > > > > > > > > What if I want to the parse > > > > > > the content as soon as it has been fetched? > > > > > > > > > > Change your settings in nutch-site.xml to override the defaults, > then > > > > > rebuild the project. > > > > > > > > > > > 2. Is it possible to run two fetches in parallel? Suppose I > > > > > > generate > > > > > > 2 > > > > > > > > > segments is it possible to run fetch on seg1 and seg2 in > parallel? > > > > > > > > > > Yes this is possible, you would set the number of threads in your > > > > > > fetcher > > > > > > > > to run this task in parallel. > > > > > > > > > > I need to crawl 100K urls everyday. I have separate process which > > > > > produces > > > > > > > > the urls for me, but it is a bit time taking process. I do not want > to > > > > > > wait > > > > > > > for all the urls to be generated and then start the nutch crawl. > What i > > > > want is to start the nutch fetch process whenever I have received a > > > > batch of urls (say 10K) available. Is it possible to inject the > batch2 > > > > of 10K urls while fetch for batch1 is still running? If yes, when > will > > > > nutch > > > > > > pick > > > > > > > the next batch for crawl. > > > > > > This is only possible when you use the freegen command. Also, i'd not > > > recommend running concurrent jobs in local mode. > > > > > > > Also, I do not want to crawl any of the links from the fetched pages. > > > > The only urls that need to be crawled are the ones generated by my > > > > process. > > > > > > How > > > > > > > do i ensure this. Is there any config setting with which we can > disable > > > > crawl of links present in fetched pages? > > > > > > Update the crawldb with additions disabled. > > > > > > > > > 3. Can I limit the number of urls per host per segment in the > > > > > > generate > > > > > > > > step > > > > > > > > > > > itself? > > > > > > > > > > Yes, please check out nutch-default.xml for generator properties, I > > > > > > don't > > > > > > > > have the settings off my head but this is possible. > > > > > > > > > > > Puneet > > > > > > > > > > -- > > > > > *Lewis* >

