Re: Problem with regex url filter

Bayu Widyasanyata Mon, 05 May 2014 15:43:26 -0700

On Mon, May 5, 2014 at 10:34 PM, Paul Rogers <[email protected]> wrote:


> My question is how do I get nutch to crawl all the files on a web site not
> just the "root" url?
>

Hi,

nutch is acts as crawler, the same about we uses any Internet browser.
nutch or we can't browse or crawl the pages that doesn't have a referer
page (linked page).
So, you should have a page that has link to index1.html.
File index.html is automatically crawled since it should be your
DirectoryIndex page.

Otherwise you can still also fetch if through "directory crawling" (instead
of browser crawling) or you disable directory Index page setting (such on
Apache / DirectoryIndex), so clients (nutch) can browse your entire
directories.

Thanks.-


-- 
wassalam,
[bayu]

Re: Problem with regex url filter

Reply via email to