Why do not you test your regex, to see if it really takes the urls you want to eliminate. It seems to me that your regex does not eliminate the type of urls you specified.
Alex. -----Original Message----- From: Ian Piper <[email protected]> To: user <[email protected]> Sent: Mon, Jul 30, 2012 1:52 pm Subject: Re: Why won't my crawl ignore these urls? Hi again, Regarding disabling filters. I just checked in my nutch-default.xml and nutch-site.xml files. There is no reference to crawl.generate in either, which seems (http://wiki.apache.org/nutch/bin/nutch_generate) to suggest that urls should be filtered. Ian. -- On 30 Jul 2012, at 19:06, Markus Jelsma wrote: > Hi, > > Either your regex is wrong, you haven't updated the CrawlDB with the new filters and/or you disabled filtering in the Generator. > > Cheers > > > > -----Original message----- >> From:Ian Piper <[email protected]> >> Sent: Mon 30-Jul-2012 20:01 >> To: [email protected] >> Subject: Why won't my crawl ignore these urls? >> >> Hi all, >> >> I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan... >> >> I have a job that crawls over a client's site. I want to exclude urls that look like this: >> >> http://[clientsite.net]/resources/type.aspx?type=[whatever] <http://[clientsite.net]/resources/type.aspx?type=[whatever]> >> >> and >> >> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] <http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> >> >> >> To achieve this I thought I could put this into conf/regex-urlfilter.txt: >> >> [...] >> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.* >> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.* >> [...] >> >> Yet when I next run the crawl I see things like this: >> >> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 <http://[clientsite.net]/resources/topic.aspx?topic=10> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 >> [...] >> fetching http://[clientsite.net]/resources/type.aspx?type=2 <http://[clientsite.net]/resources/type.aspx?type=2> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 >> [...] >> >> and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded. >> >> Is anyone able to explain what I have missed? Any guidance much appreciated. >> >> Thanks, >> >> >> Ian. >> -- >> Dr Ian Piper >> Tellura Information Services - the web, document and information people >> Registered in England and Wales: 5076715, VAT Number: 874 2060 29 >> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> >> Creator of monickr: http://monickr.com <http://monickr.com/> >> 01926 813736 | 07973 156616 >> -- >> >> -- Dr Ian Piper Tellura Information Services - the web, document and information people Registered in England and Wales: 5076715, VAT Number: 874 2060 29 http://www.tellura.co.uk/ Creator of monickr: http://monickr.com 01926 813736 | 07973 156616 --

