Re: Nutch returns index as document

Sebastian Nagel Thu, 25 Jul 2013 13:50:02 -0700

Hi,

regexes must follow the Java regex syntax, see
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html


I think your intention was:

# skip .../test and .../test/
-^https://my\.domain\.name/inside/test/?$
# allow paths below .../test/
+^https://my\.domain\.name/inside/test/.+

Finally, also seeds are filtered: you cannot use
  https://my.domain.name/inside/test/
as seed URL.

Sebastian


On 07/25/2013 02:49 PM, stone2dbone wrote:
> When I perform a crawl, one of the documents returned by Nutch is the index
> of documents. e.g.
> 
> for a crawl of:
> https://my.domain.name/inside/test/
> 
> the content of the first document is:
> Index of /inside/test Index of /inside/test Parent Directory test_css.css
> test_css.html test_css1.html test_css2.html test_css3.html test_css4.css
> test_css4.html test_css5.cfm test_css6.cfm
> 
> How do I prevent this from happening?
> 
> regex-urlfilter.txt has the following:
> # skip URLs
> -^https://my.domain.name/inside/test$
> 
> # accept URLs
> +^https://my.domain.name/inside/test/*
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Nutch returns index as document

Reply via email to