Re: Why won't my crawl ignore these urls?

AC Nutch Tue, 31 Jul 2012 12:23:25 -0700

A couple of things I could think of are:

(1) Make sure those regex excludes aren't below a "catch-all" include. If
you had "+." right above those for example in the regex-urlfilter file, it
is my understanding that Nutch will index them.


(2) I know everyone keeps saying this but make sure the regexes are
correct. One thing I noticed is that your dots are not escaped. I would try
making it more general and narrow it down, or use an online regex
validation tool. If you're feeling lazy try the following:

-^http://.*\.elaweb\.org\.uk/resources/type\..*<http://www.elaweb.org.uk/resources/type.aspx.*>
-^http://.*\.elaweb\.org\.uk/resources/topic\..*<http://www.elaweb.org.uk/resources/topic.aspx.*>

It's a little more general and easier to not screw up ;-) If that's not
acceptable for your purposes let us know I'm sure someone could help with
the specific regexes.



On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper <[email protected]> wrote:

> Hi all,
>
> I have been trying to get to the bottom of this problem for ages and
> cannot resolve it - you're my last hope, Obi-Wan...
>
> I have a job that crawls over a client's site. I want to exclude urls that
> look like this:
>
> http://[clientsite.net]/resources/type.aspx?type=[whatever]
>
> and
>
> http://[clientsite.net]/resources/topic.aspx?topic=[whatever]
>
>
> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
>
> [...]
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
> [...]
>
> Yet when I next run the crawl I see things like this:
>
> fetching http://[clientsite.net]/resources/topic.aspx?topic=10
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
> [...]
> fetching http://[clientsite.net]/resources/type.aspx?type=2
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
> [...]
>
> and the corresponding pages seem to appear in the final Solr index. So
> clearly they are not being excluded.
>
> Is anyone able to explain what I have missed? Any guidance much
> appreciated.
>
> Thanks,
>
>
> Ian.
> *-- *
> *Dr Ian Piper*
> Tellura Information Services - the web, document and information people
> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
> http://www.tellura.co.uk/
> Creator of monickr: http://monickr.com
> 01926 813736 | 07973 156616
> *-- *
>
>
>

Re: Why won't my crawl ignore these urls?

Reply via email to