Re: Forcing to crawl just FQDN

Alexander Aristov Sat, 06 Nov 2010 01:59:08 -0700

crawl-urlfilter is for one-stop command, crawl while regexp-urlfilter is for
general purpose use. as far as I know crawl-urlfilter is not used anymore in
current nutch versions.


Best Regards
Alexander Aristov


On 5 November 2010 21:34, Eddie Drapkin <[email protected]> wrote:

>  Are  you sure that you're using the crawl-urlfilter.txt file and not the
> regex-urlfilter.txt?  I'm not sure which one is the default (the script I
> use to generate configuration for nutch creates them both as copies, fwiw).
>
> Thanks,
> Eddie
>
>
> On 11/5/2010 12:46 PM, Eric Martin wrote:
>
>> I adjusted the crawl-urlfilter.txt file to read:
>>
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://ecasebriefs.com/blog/law/
>> +^http://lawnix.com/cases/cases-index/
>> +^http://oyez.org/
>> +^http://4lawnotes.com/
>> +^http://docstoc.com/documents/education/law-school/case-briefs
>> +^http://lawschoolcasebriefs.com/
>> +^http://dictionary.findlaw.com/
>> # skip everything else
>> -.
>>
>> But I am still fetching extrinsic URL's. Does nutch fetch the extrinsic
>> URL
>> regardless of the urlfilter and THEN ignore them during the crawl?
>>
>> Eric (law student)
>>
>>
>> -----Original Message-----
>> From: Eric Martin [mailto:[email protected]]
>> Sent: Thursday, November 04, 2010 5:26 PM
>> To: [email protected]
>> Subject: RE: Forcing to crawl just FQDN
>>
>> So,
>>
>> +^http://([a-z0-9]*\.)*ecasebriefs.com/blog/law/
>> +^http://([a-z0-9]*\.)*lawnix.com/cases/cases-index/
>>
>> Will only crawl the urls from that particular domain and directory?
>>
>> Thank you.
>>
>> -----Original Message-----
>> From: Edward Drapkin [mailto:[email protected]]
>> Sent: Thursday, November 04, 2010 5:13 PM
>> To: [email protected]
>> Subject: Re: Forcing to crawl just FQDN
>>
>> On 11/4/2010 7:10 PM, Eric Martin wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> Thanks for all the help so far. Between google here and Solr mailing, I
>>>
>> have
>>
>>> learned a tremendous amount.
>>>
>>>
>>>
>>> I was wondering how I can just search fqdn in the seed list? So, if I was
>>> crawling xyz.com it would only crawl the urls's with xyz.com and skip
>>>
>> those
>>
>>> urls leading off site. If I knew how to better phrase ti I am sure I
>>> could
>>> find more info on google.
>>>
>>>
>>>
>>> Eric
>>>
>>>
>>>  Use the urlfilter files in the configuration directory.
>>
>> Thanks,
>> Eddie
>>
>>
>

Re: Forcing to crawl just FQDN

Reply via email to