Re: Problem with regex url filter

Paul Rogers Mon, 05 May 2014 16:06:06 -0700

Hi Bayu

Many thanks for the response.


> Otherwise you can still also fetch if through "directory crawling"
(instead
of browser crawling)

By that do you mean using file:// as opposed to http:// crawling?

Thanks

P


On 5 May 2014 17:42, Bayu Widyasanyata <[email protected]> wrote:

> On Mon, May 5, 2014 at 10:34 PM, Paul Rogers <[email protected]>
> wrote:
>
> > My question is how do I get nutch to crawl all the files on a web site
> not
> > just the "root" url?
> >
>
> Hi,
>
> nutch is acts as crawler, the same about we uses any Internet browser.
> nutch or we can't browse or crawl the pages that doesn't have a referer
> page (linked page).
> So, you should have a page that has link to index1.html.
> File index.html is automatically crawled since it should be your
> DirectoryIndex page.
>
> Otherwise you can still also fetch if through "directory crawling" (instead
> of browser crawling) or you disable directory Index page setting (such on
> Apache / DirectoryIndex), so clients (nutch) can browse your entire
> directories.
>
> Thanks.-
>
>
> --
> wassalam,
> [bayu]
>

Re: Problem with regex url filter

Reply via email to