Re: Nutch Crawls More From Seed Then The Discovered Links

Lewis John Mcgibbney Tue, 29 Dec 2015 13:32:58 -0800

Hi Manish,

Evidence of this from the crawldb statistics would be helpful.
I have not noticed behavior of this nature however it may also have to do
with a robots.txt issue or other restriction.
Can you provide some crawldb statistics please?
Thanks
Lewis



On Wed, Dec 23, 2015 at 7:11 AM, <[email protected]> wrote:

>
> When crawling it looks it crawls more pages from seed URL then the
> discovered links.
>
> I am crawling apple.com <http://apple.com/> as seed (language english by
> default) and this contain links for other languages like apple.com/cn <
> http://apple.com/cn> for china and so on for other language.
> What I am observing after 7 cycles en language has 10 time more pages then
> any other language like /cn , I was expecting almost same for each language.
>
> Then I did reverse I put apple.com/cn <http://apple.com/cn> in seed and
> removed apple.com <http://apple.com/> , now observed there are more docs
> from /cn then other language.
>
> I am using nutch 1.10 and crawling usng crawl script
> crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/
> 7
> I observed from logs crawl script  uses -topn 50000 by default.
>
> Please suggest.
>
>

Re: Nutch Crawls More From Seed Then The Discovered Links

Reply via email to