Hi Artyom , In that case, I am assuming you checked regex-urlfilter.txt. If i am not mistaken, for a complete web crawl, nutch uses that file instead of crawl-urlfilter. Other things you may wanna consider
1) db.max.outlinks.per.page in nutch-default.xml. It limits the no. of outlinks it traverses. Try it with value -1 2) Make sure the outlinks that you menntion are not prohibited by robots.txt ( check www.cnn.com/robots.txt) 3) Check http.content.limit in nutch-default. It limits the content downloaded from a page which in turn limits the no. of outlinks founs. Try it with value -1 If all else fails, debug through method getOutlinks in DOMContentUtils.java :-) Harry On Thu, May 20, 2010 at 7:06 PM, Artyom Shvedchikov <[email protected]>wrote: > Hello, thanks for the fast reply. > We do not use crawl tool, we use runbot script from Nutch wiki for > whole-web crawling (it makes generate/fetch/update cycle using depth > parameter as cycle count). So crawl-urlfiler.xml does not work in such case. > Also we do not use any other plug-in for url filtering. But we > set db.ignore.external.links to true for skipping external links. > Our goal is go grab determined number of pages from only one determined > site. For example 1000 pages from only cnn.com or its subdomains. > > ------------------------------------------------- > Best wishes, Artyom Shvedchikov > > > > On Thu, May 20, 2010 at 8:10 AM, Harry Nutch <[email protected]> wrote: > >> You need to give more information. what does hadoop.log say? Try running >> with the debug log setting. >> One reason could be your settings in crawl-urlfilter. Do all those unique >> links point to sub domains on cnn.com or are they links to some other >> websites. If they are outside of cnn then they might now be traversed >> depending on entries in crawl-urlfilter.txt. Also, even for web-pages on >> cnn >> domain the particular path needs to meet different regex rules present in >> crawl-urlfilter.txt >> >> >> On Thu, May 20, 2010 at 2:42 AM, Artyom Shvedchikov <[email protected] >> >wrote: >> >> > Hi Nutch community. >> > >> > We are trying to solve such task with the help of nutch: >> > User give to us path on site and number of pages to grab. For example >> > http://www.cnn.com/ and 100 pages. >> > We start nutch with settings depth = 2 topN=100. >> > As result we receive only 16 pages. >> > When we start nutch with settings depth = 2 topN=1000 we still receive >> 17 >> > pages. >> > >> > But on the home page of cnn.com there near 50 unique links. >> > >> > If anyone can explain how we can make nutch to get determined amount of >> > pages from site we will be very appreciate. >> > >> > Thanks in advance. >> > ------------------------------------------------- >> > Best wishes, Artyom Shvedchikov >> > >> > >

