subject:"still not so clear to me"

still not so clear to me

2006-03-07 Thread Richard Braman

Can someone confirm this:
 
Uou start a crawldb from a list of urls and you generate a fetch list,
which is akin to seeding your crawldb. When you fetch it just fetches
those seed urls. 
When you do your next round of generate/fetch/update,  The fetch list
will have the links found while parsing the pages in the original urls.
Then on your next round, it will fetch the links found during the
previous fetch.  
 
 
So with each round of fetching, nutch goes deeper and deeper into the
web, only fetching urls it hasn't previously fetched.
The generate command generates a fetch list first based on the seed
urls, then on the links found on that page (for each subsequent
iteration), then on the links on those pages, and so forth and son on
until the entire domain is crawled, if you limit the domains with a
filter.
 
 
 

Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice) 

http://www.taxcodesoftware.org http://www.taxcodesoftware.org/ 
Free Open Source Tax Software

Re: still not so clear to me

2006-03-07 Thread Andrzej Bialecki


Richard Braman wrote:

Can someone confirm this:
 
Uou start a crawldb from a list of urls and you generate a fetch list,

which is akin to seeding your crawldb. When you fetch it just fetches
those seed urls. 
When you do your next round of generate/fetch/update,  The fetch list

will have the links found while parsing the pages in the original urls.
Then on your next round, it will fetch the links found during the
previous fetch.  
 
 
So with each round of fetching, nutch goes deeper and deeper into the

web, only fetching urls it hasn't previously fetched.
The generate command generates a fetch list first based on the seed
urls, then on the links found on that page (for each subsequent
iteration), then on the links on those pages, and so forth and son on
until the entire domain is crawled, if you limit the domains with a
filter.
  


Yes.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: still not so clear to me

2006-03-07 Thread Doug Cutting




Richard Braman wrote:

Can someone confirm this:
 
Uou start a crawldb from a list of urls and you generate a fetch list,

which is akin to seeding your crawldb. When you fetch it just fetches
those seed urls. 
When you do your next round of generate/fetch/update,  The fetch list

will have the links found while parsing the pages in the original urls.
Then on your next round, it will fetch the links found during the
previous fetch.  
 
 
So with each round of fetching, nutch goes deeper and deeper into the

web, only fetching urls it hasn't previously fetched.
The generate command generates a fetch list first based on the seed
urls, then on the links found on that page (for each subsequent
iteration), then on the links on those pages, and so forth and son on
until the entire domain is crawled, if you limit the domains with a
filter.


This all sounds right to me.

Some clarifications:

- urls are filtered before adding them to the crawldb, so the db only 
ever contains urls that pass the filter.


- the db contains both urls that have been fetched and those that have 
not been fetched.  When you find a new link to a url that is already in 
the db it does not add a new entry to the db, but rather just updates 
the existing entry's score.


- higher-scoring pages are generated in preference to lower-scoring 
pages when the -topN option is used.  So a page discovered in the first 
round might not be fetched until the fourth round, when enough other 
links have been found to that page to warrant fetching it.  This, when 
topN is specified, crawling is not totally breadth first.


Doug

still not so clear to me

Re: still not so clear to me

Re: still not so clear to me

3 matches

Site Navigation

Mail list logo

Footer information