| Hi all, I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan... I have a job that crawls over a client's site. I want to exclude urls that look like this: and To achieve this I thought I could put this into conf/regex-urlfilter.txt: [...] -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.* -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.* [...] Yet when I next run the crawl I see things like this: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 [...] -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 [...] and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded. Is anyone able to explain what I have missed? Any guidance much appreciated. Thanks, Ian. -- Dr Ian Piper Tellura Information Services - the web, document and information people Registered in England and Wales: 5076715, VAT Number: 874 2060 29 Creator of monickr: http://monickr.com 01926 813736 | 07973 156616 --
|
- Why won't my crawl ignore these urls? Ian Piper
- RE: Why won't my crawl ignore these urls? Markus Jelsma
- Re: Why won't my crawl ignore these urls? AC Nutch
- Re: Why won't my crawl ignore these urls? Ian Piper
- Re: Why won't my crawl ignore these urls? [SOLV... Ian Piper
- Re: Why won't my crawl ignore these urls? [... Alejandro Caceres


