Hi,

1) link attributes are also found in a, area, form, frame, iframe, script, 
link, img
elements. The attribute is not always named "href" but also "src" and "action".
Cf. property parser.html.outlinks.ignore_tags:
to exclude img,script,link is a good choice (but not the default).

2) grep is case-sensitive if not told otherwise (option -i). HTML may specify 
<A HREF="..."

Cheers,
Sebastian

On 07/12/2013 08:52 PM, h b wrote:
> Hi
> I am crawling a url. I downloaded the page as well. I counted the urls in
> the page by simply doing...
> 
> grep -c href page.html
> 
> I got 724 links
> 
> So I run inject/generate/fetch/parse/updatedb once. I believe this first
> run will collect all the links on this page to be crawled on next run.
> 
> So I run the next generate/fetch
> 
> This is what I see in the fetch reducer on jobtracker
> 
> 20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s,
> 1000 URLs in 1 queues > reduce
> 
> 
> So why are there 1000 urls in the queue, when the page only has 724 links.
> This page does not have any ajax stuff.
> 

Reply via email to