I'm not quite sure about your question here. I'm using the Nutch2.1 default configuration, and run command: bin/nutch crawl urls -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 1000 The 'urls' folder includes the blog index pages (each index page includes a list of article pages). I think the plugin 'parse-html' and 'parse-tika' are currently responsible for parse the links from the html. Should I clean the outlinks in an additional Parse plugin in order to prevent nutch from crawling the outlinks in the article page?
At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <[email protected]> wrote: >I take it you are updating the database with the crawl data? This will mark >all links extracted during parse phase (depending upon your config) as due >for fetching. When you generate these links will be populated within the >batchId's and Nutch will attempt to fetch them. >Please also search out list archives for the definition of the depth >parameter. >Lewis > >On Monday, January 14, 2013, 高睿 <[email protected]> wrote: >> Hi, >> >> I'm customizing nutch 2.1 for crawling blogs from several authors. Each >author's blog has list page and article pages. >> >> Say, I want to crawl articles in 50 article lists (each have 30 >articles). I add the article list links in the feed.txt, and specify >'-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it >will crawl all the list pages and the articles in each list. But, actually, >it seems the urls that nutch crawled becomes more and more, and takes more >and more time (3 hours -> more than 24 hours). >> >> Could someone explain me what happens? Does nutch 2.1 always start >crawling from the seed folder and follow the 'depth' parameter? What should >I do to meet my requirement? >> Thanks. >> >> Regards, >> Rui >> > >-- >*Lewis*

