Re: Why won't my crawl ignore these urls?

alxsss Mon, 30 Jul 2012 14:38:49 -0700

Why do not you test your regex, to see if it really takes the urls you want to 
eliminate. It seems to me that your regex does not eliminate the type of urls 
you specified.


Alex.



-----Original Message-----
From: Ian Piper <[email protected]>
To: user <[email protected]>
Sent: Mon, Jul 30, 2012 1:52 pm
Subject: Re: Why won't my crawl ignore these urls?


Hi again,

Regarding disabling filters. I just checked in my nutch-default.xml and 
nutch-site.xml files. There is no reference to crawl.generate in either, which 
seems (http://wiki.apache.org/nutch/bin/nutch_generate) to suggest that urls 
should be filtered.


Ian.
--
On 30 Jul 2012, at 19:06, Markus Jelsma wrote:

> Hi,
> 
> Either your regex is wrong, you haven't updated the CrawlDB with the new 
filters and/or you disabled filtering in the Generator.
> 
> Cheers
> 
> 
> 
> -----Original message-----
>> From:Ian Piper <[email protected]>
>> Sent: Mon 30-Jul-2012 20:01
>> To: [email protected]
>> Subject: Why won't my crawl ignore these urls?
>> 
>> Hi all,
>> 
>> I have been trying to get to the bottom of this problem for ages and cannot 
resolve it - you're my last hope, Obi-Wan...
>> 
>> I have a job that crawls over a client's site. I want to exclude urls that 
look like this:
>> 
>> http://[clientsite.net]/resources/type.aspx?type=[whatever] 
<http://[clientsite.net]/resources/type.aspx?type=[whatever]> 
>> 
>> and
>> 
>> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] 
<http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> 
>> 
>> 
>> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
>> 
>> [...]
>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
>> [...]
>> 
>> Yet when I next run the crawl I see things like this:
>> 
>> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 
<http://[clientsite.net]/resources/topic.aspx?topic=10> 
>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
>> [...]
>> fetching http://[clientsite.net]/resources/type.aspx?type=2 
<http://[clientsite.net]/resources/type.aspx?type=2> 
>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
>> [...]
>> 
>> and the corresponding pages seem to appear in the final Solr index. So 
clearly they are not being excluded.
>> 
>> Is anyone able to explain what I have missed? Any guidance much appreciated.
>> 
>> Thanks,
>> 
>> 
>> Ian.
>> -- 
>> Dr Ian Piper
>> Tellura Information Services - the web, document and information people
>> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
>> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> 
>> Creator of monickr: http://monickr.com <http://monickr.com/> 
>> 01926 813736 | 07973 156616
>> -- 
>> 
>> 

-- 
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales: 5076715, VAT Number: 874 2060 29
http://www.tellura.co.uk/
Creator of monickr: http://monickr.com
01926 813736 | 07973 156616
--

Re: Why won't my crawl ignore these urls?

Reply via email to