Re:Re: What urls does Nutch crawl?

高睿 Thu, 17 Jan 2013 02:48:12 -0800

Yes, that is my case.
Remove all previous data is an option, but the data will be lost.
I want to write a plugin to empty 'outlinks' for the article page. So, the 
crawling will be terminated at the article urls, therefore no additional links 
will be stored in DB.




At 2013-01-16 04:24:11,"Sebastian Nagel" <[email protected]> wrote: 
>Hi,no > >did I understood you correctly? >- feed.txt is placed in the seed url 
folder and >- contains URLs of the 50 article lists >If yes: > -depth 2 >will 
crawl these 50 URLs and for each article list all its 30 outlinks, >in total 50 
+ 50*30 = 1550 documents. > >If you continue crawling Nutch fetch the outlinks 
of the 1500 docs fetched >in the second cycle, and then the links found again, 
and so on: it will >continue to crawl the whole web. To limit the crawl to 
exactly the 1550 docs >either remove all previously crawled data to start again 
from scratch >or have a look at the plugin "scoring-depth" (it's new and, 
>unfortunately, not yet adapted to 2.x, see 
https://issues.apache.org/jira/browse/NUTCH-1331 >and 
https://issues.apache.org/jira/browse/NUTCH-1508). > >The option name -depth 
does not mean a "limitation of a certain linkage depth" (that's >the meaning in 
"scoring-depth") but the number of crawl cycles or rounds. >If a crawl is 
started from scratch the results are identical in most cases. > >Sebastian > 
>On 01/15/2013 06:53 PM, 高睿 wrote: >> I'm not quite sure about your question 
here. I'm using the Nutch2.1 default configuration, and run command: bin/nutch 
crawl urls -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 
-topN 1000 >> The 'urls' folder includes the blog index pages (each index page 
includes a list of article pages). >> I think the plugin 'parse-html' and 
'parse-tika' are currently responsible for parse the links from the html. 
Should I clean the outlinks in an additional Parse plugin in order to prevent 
nutch from crawling the outlinks in the article page? >>  >>  >>  >> At 
2013-01-15 13:31:11,"Lewis John Mcgibbney" <[email protected]> wrote: 
>>> I take it you are updating the database with the crawl data? This will mark 
>>> all links extracted during parse phase (depending upon your config) as due 
>>> for fetching. When you generate these links will be populated within the 
>>> batchId's and Nutch will attempt to fetch them. >>> Please also search out 
list archives for the definition of the depth >>> parameter. >>> Lewis >>> >>> 
On Monday, January 14, 2013, 高睿 <[email protected]> wrote: >>>> Hi, >>>> >>>> 
I'm customizing nutch 2.1 for crawling blogs from several authors. Each >>> 
author's blog has list page and article pages. >>>> >>>> Say, I want to crawl 
articles in 50 article lists (each have 30 >>> articles). I add the article 
list links in the feed.txt, and specify >>> '-depth 2' and '-topN 2000'. My 
expectation is each time I run nutch, it >>> will crawl all the list pages and 
the articles in each list. But, actually, >>> it seems the urls that nutch 
crawled becomes more and more, and takes more >>> and more time (3 hours -> 
more than 24 hours). >>>> >>>> Could someone explain me what happens? Does 
nutch 2.1 always start >>> crawling from the seed folder and follow the 'depth' 
parameter? What should >>> I do to meet my requirement? >>>> Thanks. >>>> >>>> 
Regards, >>>> Rui >>>> >>> >>> --  >>> *Lewis*

Re:Re: What urls does Nutch crawl?

Reply via email to