Hi Paul,

Apologize for late reply since I have another tasks should be finished.

The common practice if your website is common site in providing
information, e.g. blog, product infos, company profile, etc., you should
*enable* DirectoryIndex as described here [0]

But, if you have a particular directory which will shown as directory
listing and you don't want it crawling and indexing, you can disallow it by
configure nutch regex-urlfilter.txt file.

e.g.:

-^http://yoursite.com/directory/directory-disallow/*

Thanks.-


On Fri, May 9, 2014 at 1:06 AM, Paul Rogers <[email protected]> wrote:

> Hi Bayu
>
> Many thanks for that.  Disabling the directory index page and enabling a
> directory has fixed the issue.  I now get three documents indexed.  The
> directory listing, index.html and index1.html
>
> Is there anyway to stop nutch from indexing (rather than crawing) the
>  directory listing itself?
>
> Thanks
>
> Paul
>
>
> On 5 May 2014 18:57, Bayu Widyasanyata <[email protected]> wrote:
>
> > On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <[email protected]>
> > wrote:
> >
> > > By that do you mean using file:// as opposed to http:// crawling?
> >
> >
> > Yupe.
> >
> >
> https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>



-- 
wassalam,
[bayu]

Reply via email to