Re: What urls does Nutch crawl?

Lewis John Mcgibbney Mon, 14 Jan 2013 21:31:44 -0800

I take it you are updating the database with the crawl data? This will mark
all links extracted during parse phase (depending upon your config) as due
for fetching. When you generate these links will be populated within the
batchId's and Nutch will attempt to fetch them.
Please also search out list archives for the definition of the depth
parameter.
Lewis


On Monday, January 14, 2013, 高睿 <[email protected]> wrote:
> Hi,
>
> I'm customizing nutch 2.1 for crawling blogs from several authors. Each
author's blog has list page and article pages.
>
> Say, I want to crawl articles in 50 article lists (each have 30
articles). I add the article list links in the feed.txt, and specify
'-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it
will crawl all the list pages and the articles in each list. But, actually,
it seems the urls that nutch crawled becomes more and more, and takes more
and more time (3 hours -> more than 24 hours).
>
> Could someone explain me what happens? Does nutch 2.1 always start
crawling from the seed folder and follow the 'depth' parameter? What should
I do to meet my requirement?
> Thanks.
>
> Regards,
> Rui
>

-- 
*Lewis*

Re: What urls does Nutch crawl?

Reply via email to