Why won't my crawl ignore these urls?

Hi all,

I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan...

I have a job that crawls over a client's site. I want to exclude urls that look like this:

and

To achieve this I thought I could put this into conf/regex-urlfilter.txt:

[...]

-^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*

-^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*

[...]

Yet when I next run the crawl I see things like this:

-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37

[...]

-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36

[...]

and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded.

Is anyone able to explain what I have missed? Any guidance much appreciated.

Thanks,

Ian.

Dr Ian Piper

Tellura Information Services - the web, document and information people

Registered in England and Wales: 5076715, VAT Number: 874 2060 29

Creator of monickr: http://monickr.com

01926 813736 | 07973 156616

Reply via email to