Re: Why won't my crawl ignore these urls?

Ian Piper Mon, 30 Jul 2012 13:52:20 -0700

Hi,

Thanks for the thoughts.


> Either your regex is wrong,

It looks like the other regex structures which seem to work.

> you haven't updated the CrawlDB with the new filters

Possibly, except that I have no idea how one would do that. I am running the 
crawl and index update in one go, like this:

/usr/share/nutch/runtime/local/bin/nutch crawl urls -solr 
[clientsite.net]/solr/ -depth 5 -topN 500

How would I ensure that the CrawlDB is updated with new filters if that doesn't 
do it?

> and/or you disabled filtering in the Generator.


I don't think I have done this - most of the filters in the file are being 
honoured, so presumably I haven't disabled them all. Again, it's not obvious 
how one would disable filtering, so I suppose it's possible - can you point me 
at where I would check that?

Thanks,


Ian.
--


On 30 Jul 2012, at 19:06, Markus Jelsma wrote:

> Hi,
> 
> Either your regex is wrong, you haven't updated the CrawlDB with the new 
> filters and/or you disabled filtering in the Generator.
> 
> Cheers
> 
> 
> 
> -----Original message-----
>> From:Ian Piper <[email protected]>
>> Sent: Mon 30-Jul-2012 20:01
>> To: [email protected]
>> Subject: Why won't my crawl ignore these urls?
>> 
>> Hi all,
>> 
>> I have been trying to get to the bottom of this problem for ages and cannot 
>> resolve it - you're my last hope, Obi-Wan...
>> 
>> I have a job that crawls over a client's site. I want to exclude urls that 
>> look like this:
>> 
>> http://[clientsite.net]/resources/type.aspx?type=[whatever] 
>> <http://[clientsite.net]/resources/type.aspx?type=[whatever]> 
>> 
>> and
>> 
>> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] 
>> <http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> 
>> 
>> 
>> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
>> 
>> [...]
>> -^http://([a-z0-9\-A-Z]*\.)*[clientsite.net]/resources/type.aspx.*
>> -^http://([a-z0-9\-A-Z]*\.)*[clientsite.net]/resources/topic.aspx.*
>> [...]
>> 
>> Yet when I next run the crawl I see things like this:
>> 
>> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 
>> <http://[clientsite.net]/resources/topic.aspx?topic=10> 
>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
>> [...]
>> fetching http://[clientsite.net]/resources/type.aspx?type=2 
>> <http://[clientsite.net]/resources/type.aspx?type=2> 
>> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
>> [...]
>> 
>> and the corresponding pages seem to appear in the final Solr index. So 
>> clearly they are not being excluded.
>> 
>> Is anyone able to explain what I have missed? Any guidance much appreciated.
>> 
>> Thanks,
>> 
>> 
>> Ian.
>> -- 
>> Dr Ian Piper
>> Tellura Information Services - the web, document and information people
>> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
>> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> 
>> Creator of monickr: http://monickr.com <http://monickr.com/> 
>> 01926 813736 | 07973 156616
>> -- 
>> 
>> 

-- 
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales: 5076715, VAT Number: 874 2060 29
http://www.tellura.co.uk/
Creator of monickr: http://monickr.com
01926 813736 | 07973 156616
--

Re: Why won't my crawl ignore these urls?

Reply via email to