I take it you are updating the database with the crawl data? This will mark all links extracted during parse phase (depending upon your config) as due for fetching. When you generate these links will be populated within the batchId's and Nutch will attempt to fetch them. Please also search out list archives for the definition of the depth parameter. Lewis
On Monday, January 14, 2013, 高睿 <[email protected]> wrote: > Hi, > > I'm customizing nutch 2.1 for crawling blogs from several authors. Each author's blog has list page and article pages. > > Say, I want to crawl articles in 50 article lists (each have 30 articles). I add the article list links in the feed.txt, and specify '-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it will crawl all the list pages and the articles in each list. But, actually, it seems the urls that nutch crawled becomes more and more, and takes more and more time (3 hours -> more than 24 hours). > > Could someone explain me what happens? Does nutch 2.1 always start crawling from the seed folder and follow the 'depth' parameter? What should I do to meet my requirement? > Thanks. > > Regards, > Rui > -- *Lewis*

