Re: Can't get any regex to work in regex-urlfilters.txt

Sebastian Nagel Tue, 21 Nov 2017 13:06:06 -0800

Hi Sol,

> doesn't "+html" work as well regardless of what is in seeds.txt? I should
> be able to have http://foo.bar in seeds.txt and "+html" for the regex
> filter, yes?


URL filters are also applied to the seed list by default. That's
why the Injector logs
 Total urls rejected by filters: 1


> All I get back is "-http://foo.bar";

That means that this URL is rejected. Accepted URLs are marked by a leading "+".

> What am I missing?

You may
 - disable URL filters for the injector  (-noFilter)
 - or make sure that all seeds are accepted by the configured URL filters,
   add a rule:
    +http://foo\.far/?$

Best,
Sebastian

On 11/21/2017 09:09 PM, Sol Lederman wrote:
> Sebastian,
> 
> Thanks for the engagement and for the quick reply. I still can't get it to
> work. Here's something I don't understand. I assume that the dot in "+."
> means to match any character so it matches any URL. That's great. Why
> doesn't "+html" work as well regardless of what is in seeds.txt? I should
> be able to have http://foo.bar in seeds.txt and "+html" for the regex
> filter, yes? Or, are you saying that my regex filter has to look something
> like "http://foo.bar/.*html";?
> 
> In any case, I've tried a variety of regex patterns, with and without the
> domain name in them, and none of them work. And, yes, the site in question
> does have files at the top level ending in ".html". And, yes, the default
> nutch.apache.org case crawls fine.
> 
> I also did do the filterchecker test. All I get back is "-http://foo.bar";
> and a return code of 0. I get the same behavior for the working
> nutch.apache.org seed URL.
> 
> What am I missing?
> 
> Thanks again.
> 
> Sol
>

Re: Can't get any regex to work in regex-urlfilters.txt

Reply via email to