Hi Manish, Evidence of this from the crawldb statistics would be helpful. I have not noticed behavior of this nature however it may also have to do with a robots.txt issue or other restriction. Can you provide some crawldb statistics please? Thanks Lewis
On Wed, Dec 23, 2015 at 7:11 AM, <[email protected]> wrote: > > When crawling it looks it crawls more pages from seed URL then the > discovered links. > > I am crawling apple.com <http://apple.com/> as seed (language english by > default) and this contain links for other languages like apple.com/cn < > http://apple.com/cn> for china and so on for other language. > What I am observing after 7 cycles en language has 10 time more pages then > any other language like /cn , I was expecting almost same for each language. > > Then I did reverse I put apple.com/cn <http://apple.com/cn> in seed and > removed apple.com <http://apple.com/> , now observed there are more docs > from /cn then other language. > > I am using nutch 1.10 and crawling usng crawl script > crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ > 7 > I observed from logs crawl script uses -topn 50000 by default. > > Please suggest. > >

