Hi all,

I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan...

I have a job that crawls over a client's site. I want to exclude urls that look like this:


and



To achieve this I thought I could put this into conf/regex-urlfilter.txt:

[...]
-^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
-^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
[...]

Yet when I next run the crawl I see things like this:

-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
[...]
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
[...]

and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded.

Is anyone able to explain what I have missed? Any guidance much appreciated.

Thanks,


Ian.
-- 
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales: 5076715, VAT Number: 874 2060 29
Creator of monickr: http://monickr.com
01926 813736 | 07973 156616
-- 


Reply via email to