Re: run crawl parameters (UNCLASSIFIED)

Sebastian Nagel Wed, 10 Aug 2016 00:16:45 -0700

Hi Kris,

the last parameter of bin/crawl defines the number of rounds (or cycles).
In each cycle the following steps are performed:
 - generate a list of URLs to be fetched
 - fetch this list
 - parse documents and extract outlinks
 - write these outlink URLs to CrawlDb
   (as new entries if they are not yet known)
 - update LinkDb
 - index content of this cycle


> I assumed that increasing the integer would increase the link depth
> that would get crawled and I would retrieve more content.

Yes, the number of cycles is the same as the link depth under the following
conditions:
- all unfetched URLs in the CrawlDb fit into the fetch list
  (see parameter -topN of bin/nutch generate, for bin/crawl this
   is 50,000 per default)
- and the fetch list is entirely fetched within the configured time limit
- no transient errors (eg, a network timeout) which cause a page to
  be refetched at a later time (in one of the following rounds)

See the plugin scoring-depth if you need to set an exact limit on the link 
depth.


Best,
Sebastian


On 08/09/2016 01:38 PM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) wrote:
> CLASSIFICATION: UNCLASSIFIED
> 
> What effect does the integer parameter at the end of the call to bin/crawl 
> represent?
> 
> ./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ARLInside urls/ 
> crawlARLInside 5
> 
> When I run this call I don't get back all of the results I would expect.
> 
> I assumed that increasing the integer would increase the link depth that 
> would get crawled and I would retrieve more content.
> 
> Someone correct my thinking on this please.
> 
> Thanks,
> Kris
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.      
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> [email protected]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> 
> 
> CLASSIFICATION: UNCLASSIFIED
>

Re: run crawl parameters (UNCLASSIFIED)

Reply via email to