Hi, Thanks for the thoughts.
> Either your regex is wrong, It looks like the other regex structures which seem to work. > you haven't updated the CrawlDB with the new filters Possibly, except that I have no idea how one would do that. I am running the crawl and index update in one go, like this: /usr/share/nutch/runtime/local/bin/nutch crawl urls -solr [clientsite.net]/solr/ -depth 5 -topN 500 How would I ensure that the CrawlDB is updated with new filters if that doesn't do it? > and/or you disabled filtering in the Generator. I don't think I have done this - most of the filters in the file are being honoured, so presumably I haven't disabled them all. Again, it's not obvious how one would disable filtering, so I suppose it's possible - can you point me at where I would check that? Thanks, Ian. -- On 30 Jul 2012, at 19:06, Markus Jelsma wrote: > Hi, > > Either your regex is wrong, you haven't updated the CrawlDB with the new > filters and/or you disabled filtering in the Generator. > > Cheers > > > > -----Original message----- >> From:Ian Piper <[email protected]> >> Sent: Mon 30-Jul-2012 20:01 >> To: [email protected] >> Subject: Why won't my crawl ignore these urls? >> >> Hi all, >> >> I have been trying to get to the bottom of this problem for ages and cannot >> resolve it - you're my last hope, Obi-Wan... >> >> I have a job that crawls over a client's site. I want to exclude urls that >> look like this: >> >> http://[clientsite.net]/resources/type.aspx?type=[whatever] >> <http://[clientsite.net]/resources/type.aspx?type=[whatever]> >> >> and >> >> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] >> <http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> >> >> >> To achieve this I thought I could put this into conf/regex-urlfilter.txt: >> >> [...] >> -^http://([a-z0-9\-A-Z]*\.)*[clientsite.net]/resources/type.aspx.* >> -^http://([a-z0-9\-A-Z]*\.)*[clientsite.net]/resources/topic.aspx.* >> [...] >> >> Yet when I next run the crawl I see things like this: >> >> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 >> <http://[clientsite.net]/resources/topic.aspx?topic=10> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 >> [...] >> fetching http://[clientsite.net]/resources/type.aspx?type=2 >> <http://[clientsite.net]/resources/type.aspx?type=2> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 >> [...] >> >> and the corresponding pages seem to appear in the final Solr index. So >> clearly they are not being excluded. >> >> Is anyone able to explain what I have missed? Any guidance much appreciated. >> >> Thanks, >> >> >> Ian. >> -- >> Dr Ian Piper >> Tellura Information Services - the web, document and information people >> Registered in England and Wales: 5076715, VAT Number: 874 2060 29 >> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> >> Creator of monickr: http://monickr.com <http://monickr.com/> >> 01926 813736 | 07973 156616 >> -- >> >> -- Dr Ian Piper Tellura Information Services - the web, document and information people Registered in England and Wales: 5076715, VAT Number: 874 2060 29 http://www.tellura.co.uk/ Creator of monickr: http://monickr.com 01926 813736 | 07973 156616 --

