You need to give more information. what does hadoop.log say? Try running with the debug log setting. One reason could be your settings in crawl-urlfilter. Do all those unique links point to sub domains on cnn.com or are they links to some other websites. If they are outside of cnn then they might now be traversed depending on entries in crawl-urlfilter.txt. Also, even for web-pages on cnn domain the particular path needs to meet different regex rules present in crawl-urlfilter.txt
On Thu, May 20, 2010 at 2:42 AM, Artyom Shvedchikov <[email protected]>wrote: > Hi Nutch community. > > We are trying to solve such task with the help of nutch: > User give to us path on site and number of pages to grab. For example > http://www.cnn.com/ and 100 pages. > We start nutch with settings depth = 2 topN=100. > As result we receive only 16 pages. > When we start nutch with settings depth = 2 topN=1000 we still receive 17 > pages. > > But on the home page of cnn.com there near 50 unique links. > > If anyone can explain how we can make nutch to get determined amount of > pages from site we will be very appreciate. > > Thanks in advance. > ------------------------------------------------- > Best wishes, Artyom Shvedchikov >

