When I perform a crawl, one of the documents returned by Nutch is the index of documents. e.g.
for a crawl of: https://my.domain.name/inside/test/ the content of the first document is: Index of /inside/test Index of /inside/test Parent Directory test_css.css test_css.html test_css1.html test_css2.html test_css3.html test_css4.css test_css4.html test_css5.cfm test_css6.cfm How do I prevent this from happening? regex-urlfilter.txt has the following: # skip URLs -^https://my.domain.name/inside/test$ # accept URLs +^https://my.domain.name/inside/test/* -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html Sent from the Nutch - User mailing list archive at Nabble.com.

