Re: crawl recursively possible? (UNCLASSIFIED)

Sebastian Nagel Wed, 03 Aug 2016 05:38:48 -0700

Hi,

that's only possible if there are links from the page
  https://internalsite/inside/
to it's "subdirectories", e.g.
  <a href="./subdir/"> ... </a>t

If there are no outgoing links to different hosts or
other "directories" than "inside/" you are done.
If not the URL filters have to be configured appropriately.

While file systems have directories and are able to list
all contained files and subdirectories, that's not the
case in the web: there are only documents (HTML pages and
other doc formats).

Sebastian

On 08/03/2016 02:27 PM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) wrote:
> CLASSIFICATION: UNCLASSIFIED
> 
> I have /urls/seed.txt setup to crawl https://internalsite/inside/
> 
> I want nutch to crawl https://internalsite/inside/ and all directories under 
> that.
> 
> How do I set this up without having to name every sub dir that I want to 
> crawl in seed.txt? 
> 
> Thanks,
> Kris
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.      
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> [email protected]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> 
> 
> CLASSIFICATION: UNCLASSIFIED
>

Re: crawl recursively possible? (UNCLASSIFIED)

Reply via email to