Hi,

When crawling it looks it crawls more pages from seed URL then the discovered 
links.

I am crawling apple.com <http://apple.com/> as seed (language english by 
default) and this contain links for other languages like apple.com/cn 
<http://apple.com/cn> for china and so on for other language.
What I am observing after 7 cycles en language has 10 time more pages then any 
other language like /cn , I was expecting almost same for each language.

Then I did reverse I put apple.com/cn <http://apple.com/cn> in seed and removed 
apple.com <http://apple.com/> , now observed there are more docs from /cn then 
other language.

I am using nutch 1.10 and crawling usng crawl script 
crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/  7
I observed from logs crawl script  uses -topn 50000 by default.

Please suggest.

Thanks
Manish

Reply via email to