Re:Re: What urls does Nutch crawl?

高睿 Tue, 15 Jan 2013 09:54:23 -0800

I'm not quite sure about your question here. I'm using the Nutch2.1 default 
configuration, and run command: bin/nutch crawl urls -solr 
http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 1000
The 'urls' folder includes the blog index pages (each index page includes a 
list of article pages).
I think the plugin 'parse-html' and 'parse-tika' are currently responsible for 
parse the links from the html. Should I clean the outlinks in an additional 
Parse plugin in order to prevent nutch from crawling the outlinks in the 
article page?




At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <[email protected]> wrote:
>I take it you are updating the database with the crawl data? This will mark
>all links extracted during parse phase (depending upon your config) as due
>for fetching. When you generate these links will be populated within the
>batchId's and Nutch will attempt to fetch them.
>Please also search out list archives for the definition of the depth
>parameter.
>Lewis
>
>On Monday, January 14, 2013, 高睿 <[email protected]> wrote:
>> Hi,
>>
>> I'm customizing nutch 2.1 for crawling blogs from several authors. Each
>author's blog has list page and article pages.
>>
>> Say, I want to crawl articles in 50 article lists (each have 30
>articles). I add the article list links in the feed.txt, and specify
>'-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it
>will crawl all the list pages and the articles in each list. But, actually,
>it seems the urls that nutch crawled becomes more and more, and takes more
>and more time (3 hours -> more than 24 hours).
>>
>> Could someone explain me what happens? Does nutch 2.1 always start
>crawling from the seed folder and follow the 'depth' parameter? What should
>I do to meet my requirement?
>> Thanks.
>>
>> Regards,
>> Rui
>>
>
>-- 
>*Lewis*

Re:Re: What urls does Nutch crawl?

Reply via email to