Hi Bayu Many thanks for that.
What I'm trying to do is crawl the documents in the directory but not have nutch submit the directory listing to solr for indexing. So for example if I have a directory with four pdf documents in it nutch crawls it and solr indexes five documents (the directory listing and the four pdf documents). I can see the logic - nutch is crawling URL's so http://mysite/my-directory/(the directory listing) and http://mysite/my-directory/pdfdoc.pdf are both valid URL's. What I think I need is a regex filter that excludes directories (and their listings) but includes any files in them. Thanks P On 19 May 2014 09:31, Bayu Widyasanyata <[email protected]> wrote: > Hi Paul, > > Apologize for late reply since I have another tasks should be finished. > > The common practice if your website is common site in providing > information, e.g. blog, product infos, company profile, etc., you should > *enable* DirectoryIndex as described here [0] > > But, if you have a particular directory which will shown as directory > listing and you don't want it crawling and indexing, you can disallow it by > configure nutch regex-urlfilter.txt file. > > e.g.: > > -^http://yoursite.com/directory/directory-disallow/* > > Thanks.- > > > On Fri, May 9, 2014 at 1:06 AM, Paul Rogers <[email protected]> > wrote: > > > Hi Bayu > > > > Many thanks for that. Disabling the directory index page and enabling a > > directory has fixed the issue. I now get three documents indexed. The > > directory listing, index.html and index1.html > > > > Is there anyway to stop nutch from indexing (rather than crawing) the > > directory listing itself? > > > > Thanks > > > > Paul > > > > > > On 5 May 2014 18:57, Bayu Widyasanyata <[email protected]> wrote: > > > > > On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <[email protected]> > > > wrote: > > > > > > > By that do you mean using file:// as opposed to http:// crawling? > > > > > > > > > Yupe. > > > > > > > > > https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol > > > > > > > > > -- > > > wassalam, > > > [bayu] > > > > > > > > > -- > wassalam, > [bayu] >

