When I perform a crawl, one of the documents returned by Nutch is the index
of documents. e.g.

for a crawl of:
https://my.domain.name/inside/test/

the content of the first document is:
Index of /inside/test Index of /inside/test Parent Directory test_css.css
test_css.html test_css1.html test_css2.html test_css3.html test_css4.css
test_css4.html test_css5.cfm test_css6.cfm

How do I prevent this from happening?

regex-urlfilter.txt has the following:
# skip URLs
-^https://my.domain.name/inside/test$

# accept URLs
+^https://my.domain.name/inside/test/*




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to