A couple of things I could think of are: (1) Make sure those regex excludes aren't below a "catch-all" include. If you had "+." right above those for example in the regex-urlfilter file, it is my understanding that Nutch will index them.
(2) I know everyone keeps saying this but make sure the regexes are correct. One thing I noticed is that your dots are not escaped. I would try making it more general and narrow it down, or use an online regex validation tool. If you're feeling lazy try the following: -^http://.*\.elaweb\.org\.uk/resources/type\..*<http://www.elaweb.org.uk/resources/type.aspx.*> -^http://.*\.elaweb\.org\.uk/resources/topic\..*<http://www.elaweb.org.uk/resources/topic.aspx.*> It's a little more general and easier to not screw up ;-) If that's not acceptable for your purposes let us know I'm sure someone could help with the specific regexes. On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper <[email protected]> wrote: > Hi all, > > I have been trying to get to the bottom of this problem for ages and > cannot resolve it - you're my last hope, Obi-Wan... > > I have a job that crawls over a client's site. I want to exclude urls that > look like this: > > http://[clientsite.net]/resources/type.aspx?type=[whatever] > > and > > http://[clientsite.net]/resources/topic.aspx?topic=[whatever] > > > To achieve this I thought I could put this into conf/regex-urlfilter.txt: > > [...] > -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.* > -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.* > [...] > > Yet when I next run the crawl I see things like this: > > fetching http://[clientsite.net]/resources/topic.aspx?topic=10 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 > [...] > fetching http://[clientsite.net]/resources/type.aspx?type=2 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 > [...] > > and the corresponding pages seem to appear in the final Solr index. So > clearly they are not being excluded. > > Is anyone able to explain what I have missed? Any guidance much > appreciated. > > Thanks, > > > Ian. > *-- * > *Dr Ian Piper* > Tellura Information Services - the web, document and information people > Registered in England and Wales: 5076715, VAT Number: 874 2060 29 > http://www.tellura.co.uk/ > Creator of monickr: http://monickr.com > 01926 813736 | 07973 156616 > *-- * > > >

