Hi Paul, Apologize for late reply since I have another tasks should be finished.
The common practice if your website is common site in providing information, e.g. blog, product infos, company profile, etc., you should *enable* DirectoryIndex as described here [0] But, if you have a particular directory which will shown as directory listing and you don't want it crawling and indexing, you can disallow it by configure nutch regex-urlfilter.txt file. e.g.: -^http://yoursite.com/directory/directory-disallow/* Thanks.- On Fri, May 9, 2014 at 1:06 AM, Paul Rogers <[email protected]> wrote: > Hi Bayu > > Many thanks for that. Disabling the directory index page and enabling a > directory has fixed the issue. I now get three documents indexed. The > directory listing, index.html and index1.html > > Is there anyway to stop nutch from indexing (rather than crawing) the > directory listing itself? > > Thanks > > Paul > > > On 5 May 2014 18:57, Bayu Widyasanyata <[email protected]> wrote: > > > On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <[email protected]> > > wrote: > > > > > By that do you mean using file:// as opposed to http:// crawling? > > > > > > Yupe. > > > > > https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol > > > > > > -- > > wassalam, > > [bayu] > > > -- wassalam, [bayu]

