Thanks for all the positive suggestions folks. I will have a go at these and hopefully they will help me narrow down the problem.


Ian.
--

On 31 Jul 2012, at 20:22, AC Nutch <[email protected]> wrote:

A couple of things I could think of are:

(1) Make sure those regex excludes aren't below a "catch-all" include. If you had "+." right above those for example in the regex-urlfilter file, it is my understanding that Nutch will index them.

(2) I know everyone keeps saying this but make sure the regexes are correct. One thing I noticed is that your dots are not escaped. I would try making it more general and narrow it down, or use an online regex validation tool. If you're feeling lazy try the following:

-^http://.*\.elaweb\.org\.uk/resources/topic\..*

It's a little more general and easier to not screw up ;-) If that's not acceptable for your purposes let us know I'm sure someone could help with the specific regexes.



On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper <[email protected]> wrote:
Hi all,

I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan...

I have a job that crawls over a client's site. I want to exclude urls that look like this:


and



To achieve this I thought I could put this into conf/regex-urlfilter.txt:

[...]
-^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
-^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
[...]

Yet when I next run the crawl I see things like this:

-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
[...]
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
[...]

and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded.

Is anyone able to explain what I have missed? Any guidance much appreciated.

Thanks,


Ian.
-- 
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales: 5076715, VAT Number: 874 2060 29
Creator of monickr: http://monickr.com
01926 813736 | 07973 156616
-- 




-- 
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales: 5076715, VAT Number: 874 2060 29
Creator of monickr: http://monickr.com
01926 813736 | 07973 156616
-- 


Reply via email to