Re: Why won't my crawl ignore these urls?

Ian Piper Tue, 31 Jul 2012 06:45:22 -0700

Thanks,

I'll take another look at the regular expressions. Odd though - I simply 
elaborated examples that I found in various pieces of documentation.



Ian.
--

On 30 Jul 2012, at 22:38, [email protected] wrote:

> Why do not you test your regex, to see if it really takes the urls you want 
> to eliminate. It seems to me that your regex does not eliminate the type of 
> urls you specified.
> 
> Alex.
> 
> 
> 
> -----Original Message-----
> From: Ian Piper <[email protected]>
> To: user <[email protected]>
> Sent: Mon, Jul 30, 2012 1:52 pm
> Subject: Re: Why won't my crawl ignore these urls?
> 
> 
> Hi again,
> 
> Regarding disabling filters. I just checked in my nutch-default.xml and 
> nutch-site.xml files. There is no reference to crawl.generate in either, 
> which 
> seems (http://wiki.apache.org/nutch/bin/nutch_generate) to suggest that urls 
> should be filtered.
> 
> 
> Ian.
> --
> On 30 Jul 2012, at 19:06, Markus Jelsma wrote:
> 
>> Hi,
>> 
>> Either your regex is wrong, you haven't updated the CrawlDB with the new 
> filters and/or you disabled filtering in the Generator.
>> 
>> Cheers
>> 
>> 
>> 
>> -----Original message-----
>>> From:Ian Piper <[email protected]>
>>> Sent: Mon 30-Jul-2012 20:01
>>> To: [email protected]
>>> Subject: Why won't my crawl ignore these urls?
>>> 
>>> Hi all,
>>> 
>>> I have been trying to get to the bottom of this problem for ages and cannot 
> resolve it - you're my last hope, Obi-Wan...
>>> 
>>> I have a job that crawls over a client's site. I want to exclude urls that 
> look like this:
>>> 
>>> http://[clientsite.net]/resources/type.aspx?type=[whatever] 
> <http://[clientsite.net]/resources/type.aspx?type=[whatever]> 
>>> 
>>> and
>>> 
>>> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] 
> <http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> 
>>> 
>>> 
>>> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
>>> 
>>> [...]
>>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
>>> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
>>> [...]
>>> 
>>> Yet when I next run the crawl I see things like this:
>>> 
>>> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 
> <http://[clientsite.net]/resources/topic.aspx?topic=10> 
>>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
>>> [...]
>>> fetching http://[clientsite.net]/resources/type.aspx?type=2 
> <http://[clientsite.net]/resources/type.aspx?type=2> 
>>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
>>> [...]
>>> 
>>> and the corresponding pages seem to appear in the final Solr index. So 
> clearly they are not being excluded.
>>> 
>>> Is anyone able to explain what I have missed? Any guidance much appreciated.
>>> 
>>> Thanks,
>>> 
>>> 
>>> Ian.
>>> -- 
>>> Dr Ian Piper
>>> Tellura Information Services - the web, document and information people
>>> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
>>> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> 
>>> Creator of monickr: http://monickr.com <http://monickr.com/> 
>>> 01926 813736 | 07973 156616
>>> -- 
>>> 
>>> 
> 
> -- 
> Dr Ian Piper
> Tellura Information Services - the web, document and information people
> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
> http://www.tellura.co.uk/
> Creator of monickr: http://monickr.com
> 01926 813736 | 07973 156616
> -- 
> 
> 
> 
> 
> 
>

Re: Why won't my crawl ignore these urls?

Reply via email to