Hi
I am crawling a url. I downloaded the page as well. I counted the urls in
the page by simply doing...

grep -c href page.html

I got 724 links

So I run inject/generate/fetch/parse/updatedb once. I believe this first
run will collect all the links on this page to be crawled on next run.

So I run the next generate/fetch

This is what I see in the fetch reducer on jobtracker

20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s,
1000 URLs in 1 queues > reduce


So why are there 1000 urls in the queue, when the page only has 724 links.
This page does not have any ajax stuff.

Reply via email to