Re: Can't get any regex to work in regex-urlfilters.txt

Sebastian Nagel Tue, 21 Nov 2017 08:29:52 -0800

Hi,

these are invalid expressions:
> +*html
> +*html$


these should work:
> +html
> +.*(html)$

but the simpler expression would be
+\.html$

Of course, if your seed URL does not match the regular expression it's excluded.
That's the case for, e.g.:
 http://nutch.apache.org/
 http://example.com/index.php

It's better to verify whether the URL filter configuration works as expected
beforehand:
 cat .../seeds.txt | $NUTCH_HOME/bin/nutch filterchecker -allCombined


If you want to keep only HTML pages, have a look at the plugins
  urlfilter-suffix
    to filter away URLs with undesired file extensions (.pdf, .xlsx, etc.)
  mimetype-filter
    to index selectively by MIME type


Best,
Sebastian


On 11/21/2017 03:45 PM, Sol Lederman wrote:
> In my regex-urlfilters.txt I have the default filters that come with nutch.
> If I have +. as the very last line of the file crawling works fine.
> 
> If I change that line to anything else then I get "Total urls rejected by
> filters: 1" and no urls are fetched.
> 
> I've tried a bunch of different entries in the last line:
> 
> +html
> +*html
> +*html$
> +.*(html)$
> 
> What am I missing?
> 
> Thanks.
> 
> Sol
>

Re: Can't get any regex to work in regex-urlfilters.txt

Reply via email to