Hi Sol, > doesn't "+html" work as well regardless of what is in seeds.txt? I should > be able to have http://foo.bar in seeds.txt and "+html" for the regex > filter, yes?
URL filters are also applied to the seed list by default. That's why the Injector logs Total urls rejected by filters: 1 > All I get back is "-http://foo.bar" That means that this URL is rejected. Accepted URLs are marked by a leading "+". > What am I missing? You may - disable URL filters for the injector (-noFilter) - or make sure that all seeds are accepted by the configured URL filters, add a rule: +http://foo\.far/?$ Best, Sebastian On 11/21/2017 09:09 PM, Sol Lederman wrote: > Sebastian, > > Thanks for the engagement and for the quick reply. I still can't get it to > work. Here's something I don't understand. I assume that the dot in "+." > means to match any character so it matches any URL. That's great. Why > doesn't "+html" work as well regardless of what is in seeds.txt? I should > be able to have http://foo.bar in seeds.txt and "+html" for the regex > filter, yes? Or, are you saying that my regex filter has to look something > like "http://foo.bar/.*html"? > > In any case, I've tried a variety of regex patterns, with and without the > domain name in them, and none of them work. And, yes, the site in question > does have files at the top level ending in ".html". And, yes, the default > nutch.apache.org case crawls fine. > > I also did do the filterchecker test. All I get back is "-http://foo.bar" > and a return code of 0. I get the same behavior for the working > nutch.apache.org seed URL. > > What am I missing? > > Thanks again. > > Sol >

