Thanks, I'll take another look at the regular expressions. Odd though - I simply elaborated examples that I found in various pieces of documentation.
Ian. -- On 30 Jul 2012, at 22:38, [email protected] wrote: > Why do not you test your regex, to see if it really takes the urls you want > to eliminate. It seems to me that your regex does not eliminate the type of > urls you specified. > > Alex. > > > > -----Original Message----- > From: Ian Piper <[email protected]> > To: user <[email protected]> > Sent: Mon, Jul 30, 2012 1:52 pm > Subject: Re: Why won't my crawl ignore these urls? > > > Hi again, > > Regarding disabling filters. I just checked in my nutch-default.xml and > nutch-site.xml files. There is no reference to crawl.generate in either, > which > seems (http://wiki.apache.org/nutch/bin/nutch_generate) to suggest that urls > should be filtered. > > > Ian. > -- > On 30 Jul 2012, at 19:06, Markus Jelsma wrote: > >> Hi, >> >> Either your regex is wrong, you haven't updated the CrawlDB with the new > filters and/or you disabled filtering in the Generator. >> >> Cheers >> >> >> >> -----Original message----- >>> From:Ian Piper <[email protected]> >>> Sent: Mon 30-Jul-2012 20:01 >>> To: [email protected] >>> Subject: Why won't my crawl ignore these urls? >>> >>> Hi all, >>> >>> I have been trying to get to the bottom of this problem for ages and cannot > resolve it - you're my last hope, Obi-Wan... >>> >>> I have a job that crawls over a client's site. I want to exclude urls that > look like this: >>> >>> http://[clientsite.net]/resources/type.aspx?type=[whatever] > <http://[clientsite.net]/resources/type.aspx?type=[whatever]> >>> >>> and >>> >>> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] > <http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> >>> >>> >>> To achieve this I thought I could put this into conf/regex-urlfilter.txt: >>> >>> [...] >>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.* >>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.* >>> [...] >>> >>> Yet when I next run the crawl I see things like this: >>> >>> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 > <http://[clientsite.net]/resources/topic.aspx?topic=10> >>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 >>> [...] >>> fetching http://[clientsite.net]/resources/type.aspx?type=2 > <http://[clientsite.net]/resources/type.aspx?type=2> >>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 >>> [...] >>> >>> and the corresponding pages seem to appear in the final Solr index. So > clearly they are not being excluded. >>> >>> Is anyone able to explain what I have missed? Any guidance much appreciated. >>> >>> Thanks, >>> >>> >>> Ian. >>> -- >>> Dr Ian Piper >>> Tellura Information Services - the web, document and information people >>> Registered in England and Wales: 5076715, VAT Number: 874 2060 29 >>> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> >>> Creator of monickr: http://monickr.com <http://monickr.com/> >>> 01926 813736 | 07973 156616 >>> -- >>> >>> > > -- > Dr Ian Piper > Tellura Information Services - the web, document and information people > Registered in England and Wales: 5076715, VAT Number: 874 2060 29 > http://www.tellura.co.uk/ > Creator of monickr: http://monickr.com > 01926 813736 | 07973 156616 > -- > > > > > >

