Re: Can't get any regex to work in regex-urlfilters.txt

Sol Lederman Tue, 21 Nov 2017 12:09:53 -0800

Sebastian,

Thanks for the engagement and for the quick reply. I still can't get it to
work. Here's something I don't understand. I assume that the dot in "+."
means to match any character so it matches any URL. That's great. Why
doesn't "+html" work as well regardless of what is in seeds.txt? I should
be able to have http://foo.bar in seeds.txt and "+html" for the regex
filter, yes? Or, are you saying that my regex filter has to look something
like "http://foo.bar/.*html";?


In any case, I've tried a variety of regex patterns, with and without the
domain name in them, and none of them work. And, yes, the site in question
does have files at the top level ending in ".html". And, yes, the default
nutch.apache.org case crawls fine.

I also did do the filterchecker test. All I get back is "-http://foo.bar";
and a return code of 0. I get the same behavior for the working
nutch.apache.org seed URL.

What am I missing?

Thanks again.

Sol

Re: Can't get any regex to work in regex-urlfilters.txt

Reply via email to