Hi, 1) link attributes are also found in a, area, form, frame, iframe, script, link, img elements. The attribute is not always named "href" but also "src" and "action". Cf. property parser.html.outlinks.ignore_tags: to exclude img,script,link is a good choice (but not the default).
2) grep is case-sensitive if not told otherwise (option -i). HTML may specify <A HREF="..." Cheers, Sebastian On 07/12/2013 08:52 PM, h b wrote: > Hi > I am crawling a url. I downloaded the page as well. I counted the urls in > the page by simply doing... > > grep -c href page.html > > I got 724 links > > So I run inject/generate/fetch/parse/updatedb once. I believe this first > run will collect all the links on this page to be crawled on next run. > > So I run the next generate/fetch > > This is what I see in the fetch reducer on jobtracker > > 20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s, > 1000 URLs in 1 queues > reduce > > > So why are there 1000 urls in the queue, when the page only has 724 links. > This page does not have any ajax stuff. >

