Hi, these are invalid expressions: > +*html > +*html$
these should work: > +html > +.*(html)$ but the simpler expression would be +\.html$ Of course, if your seed URL does not match the regular expression it's excluded. That's the case for, e.g.: http://nutch.apache.org/ http://example.com/index.php It's better to verify whether the URL filter configuration works as expected beforehand: cat .../seeds.txt | $NUTCH_HOME/bin/nutch filterchecker -allCombined If you want to keep only HTML pages, have a look at the plugins urlfilter-suffix to filter away URLs with undesired file extensions (.pdf, .xlsx, etc.) mimetype-filter to index selectively by MIME type Best, Sebastian On 11/21/2017 03:45 PM, Sol Lederman wrote: > In my regex-urlfilters.txt I have the default filters that come with nutch. > If I have +. as the very last line of the file crawling works fine. > > If I change that line to anything else then I get "Total urls rejected by > filters: 1" and no urls are fetched. > > I've tried a bunch of different entries in the last line: > > +html > +*html > +*html$ > +.*(html)$ > > What am I missing? > > Thanks. > > Sol >

