Hi Kris, the last parameter of bin/crawl defines the number of rounds (or cycles). In each cycle the following steps are performed: - generate a list of URLs to be fetched - fetch this list - parse documents and extract outlinks - write these outlink URLs to CrawlDb (as new entries if they are not yet known) - update LinkDb - index content of this cycle
> I assumed that increasing the integer would increase the link depth > that would get crawled and I would retrieve more content. Yes, the number of cycles is the same as the link depth under the following conditions: - all unfetched URLs in the CrawlDb fit into the fetch list (see parameter -topN of bin/nutch generate, for bin/crawl this is 50,000 per default) - and the fetch list is entirely fetched within the configured time limit - no transient errors (eg, a network timeout) which cause a page to be refetched at a later time (in one of the following rounds) See the plugin scoring-depth if you need to set an exact limit on the link depth. Best, Sebastian On 08/09/2016 01:38 PM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) wrote: > CLASSIFICATION: UNCLASSIFIED > > What effect does the integer parameter at the end of the call to bin/crawl > represent? > > ./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ARLInside urls/ > crawlARLInside 5 > > When I run this call I don't get back all of the results I would expect. > > I assumed that increasing the integer would increase the link depth that > would get crawled and I would retrieve more content. > > Someone correct my thinking on this please. > > Thanks, > Kris > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > Kris T. Musshorn > FileMaker Developer - Contractor - Catapult Technology Inc. > US Army Research Lab > Aberdeen Proving Ground > Application Management & Development Branch > 410-278-7251 > [email protected] > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > CLASSIFICATION: UNCLASSIFIED >

