In my use case there will be lot of urls for the same host. Nutch will do the scheduling for me respecting all the politeness. Also I can plugin my parser to post-process the received data. Yes I can write my java httpclient to fetch pages but then to make it fast i will have to make it distributed... exactly what nutch offers.
Puneet 2012/2/16 Magnús Skúlason <[email protected]> > As it sounds to me its not obvious that you would want to use Nutch to > deliver this functionality. What is it that you hope to get out of > Nutch? > > Why not just write a simple java process using httpclient to fetch the > pages from your other process? Or even wget them? and extract the > content > > best regards, > Magnus > > On Wed, Feb 15, 2012 at 7:40 PM, Markus Jelsma <[email protected]> wrote: > > Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney < > >> > >> [email protected]> wrote: > >> > Hi Puneet, > >> > > >> > On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey <[email protected]> > >> > > >> > wrote: > >> > > I have started using nutch recently. > >> > > As I understand nutch crawling is a cyclic process > >> > > inject->generate->fetch->parse->update > >> > > >> > Yes this is typically what you would execute. > >> > > >> > > 1. When does parse start when I use the "crawl" command line. Is it > >> > > after all the urls have been fetched in the segment? > >> > > >> > Depends on what settings you specify in nutch-site.xml, by default > >> > parsing is done as a separate process (after fetching) when using the > >> > crawl command. > >> > >> Suppose i submitted 10K urls in a segment for crawl. Does the parsing of > >> the content start as soon as the first URL is available (i.e. fetched) > or > >> the parsing starts only after all 10K have been fetched. For my use > case i > >> want parsing to start on the urls as soon as they are available w/o > waiting > >> for fetch on others to complete. > > > > don't use the crawl command, it has fetchign and parsing as separate > jobs. You > > need to enable fetcher.parse to parse fetched files immediately. > > > >> > >> > > What if I want to the parse > >> > > the content as soon as it has been fetched? > >> > > >> > Change your settings in nutch-site.xml to override the defaults, then > >> > rebuild the project. > >> > > >> > > 2. Is it possible to run two fetches in parallel? Suppose I > generate 2 > >> > > segments is it possible to run fetch on seg1 and seg2 in parallel? > >> > > >> > Yes this is possible, you would set the number of threads in your > fetcher > >> > to run this task in parallel. > >> > > >> > I need to crawl 100K urls everyday. I have separate process which > >> > produces > >> > >> the urls for me, but it is a bit time taking process. I do not want to > wait > >> for all the urls to be generated and then start the nutch crawl. What i > >> want is to start the nutch fetch process whenever I have received a > batch > >> of urls (say 10K) available. Is it possible to inject the batch2 of 10K > >> urls while fetch for batch1 is still running? If yes, when will nutch > pick > >> the next batch for crawl. > > > > This is only possible when you use the freegen command. Also, i'd not > > recommend running concurrent jobs in local mode. > > > >> > >> Also, I do not want to crawl any of the links from the fetched pages. > The > >> only urls that need to be crawled are the ones generated by my process. > How > >> do i ensure this. Is there any config setting with which we can disable > >> crawl of links present in fetched pages? > > > > Update the crawldb with additions disabled. > > > >> > >> > > 3. Can I limit the number of urls per host per segment in the > generate > >> > > >> > step > >> > > >> > > itself? > >> > > >> > Yes, please check out nutch-default.xml for generator properties, I > don't > >> > have the settings off my head but this is possible. > >> > > >> > > Puneet > >> > > >> > -- > >> > Lewis >

