Hi Scott,

cycles/rounds/depth is roughly equivalent to the number of hops/links to reach
a document starting from one of the seeds. It has nothing in common with the
depth in the server's file system hierarchy. If there is a link from
 http://www.bizjournals.com/triangle/
to e.g.
 http://www.bizjournals.com/triangle/blog/techflash/story.html
the latter document is crawled in the second round.

The easiest way to limit by directory depth are regex URL filters.

Sebastian

On 04/07/2015 04:04 PM, Scott Lundgren wrote:
> Is Nutch’s  Rounds/Crawl Depth relative to the URLs in seed. txt?
> 
> For example if my seed.txt is http://www.bizjournals.com/triangle/ and I want 
> to make sure that I’m crawling 
> http://www.bizjournals.com/triangle/prnewswire/press_releases/.* and 
> http://www.bizjournals.com/triangle/blog/techflash/.* does my rounds need to 
> be set to 2 (i.e.: everything under /prnewswire/press_releases/ is crawled ) 
> or 3 (/triangle/prnewswire/press_releases/)
> 
> Scott Lundgren
> Software Engineer
> (704) 973-7388
> [email protected]<mailto:[email protected]>
> 
> QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
> 11121 Carmel Commons Boulevard | Suite 250
> Charlotte, North Carolina 28226
> 
> Our Portfolio of Commercial Real Estate Solutions:
> •        <http://www.defeasewithease.com> Commercial 
> Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
> •        Fairview Real Estate Solutions<http://www.fairviewres.com/>
> •        Great River Mortgage 
> Capital<http://www.greatrivermortgagecapital.com/>
> •        Tax Credit Asset Management<http://www.tcamre.com/>
> •        Radian Generation<http://www.radiangeneration.com/>
> •        EntityKeeper<http://www.entitykeeper.com/>™
> •        Crowd With Ease<http://www.crowdwithease.com>™
> •        FullCapitalStack<http://www.fullcapitalstack.com>™
> •        CrowdRabbit<http://www.crowdrabbit.com>™
> 

Reply via email to