Re: URL count in queue

Sebastian Nagel Fri, 12 Jul 2013 13:42:02 -0700

Hi,

1) link attributes are also found in a, area, form, frame, iframe, script, 
link, img
elements. The attribute is not always named "href" but also "src" and "action".
Cf. property parser.html.outlinks.ignore_tags:
to exclude img,script,link is a good choice (but not the default).


2) grep is case-sensitive if not told otherwise (option -i). HTML may specify 
<A HREF="..."

Cheers,
Sebastian

On 07/12/2013 08:52 PM, h b wrote:
> Hi
> I am crawling a url. I downloaded the page as well. I counted the urls in
> the page by simply doing...
> 
> grep -c href page.html
> 
> I got 724 links
> 
> So I run inject/generate/fetch/parse/updatedb once. I believe this first
> run will collect all the links on this page to be crawled on next run.
> 
> So I run the next generate/fetch
> 
> This is what I see in the fetch reducer on jobtracker
> 
> 20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s,
> 1000 URLs in 1 queues > reduce
> 
> 
> So why are there 1000 urls in the queue, when the page only has 724 links.
> This page does not have any ajax stuff.
>

Re: URL count in queue

Reply via email to