I adjusted the crawl-urlfilter.txt file to read:

# accept hosts in MY.DOMAIN.NAME
+^http://ecasebriefs.com/blog/law/
+^http://lawnix.com/cases/cases-index/
+^http://oyez.org/
+^http://4lawnotes.com/
+^http://docstoc.com/documents/education/law-school/case-briefs
+^http://lawschoolcasebriefs.com/
+^http://dictionary.findlaw.com/
# skip everything else
-.

But I am still fetching extrinsic URL's. Does nutch fetch the extrinsic URL
regardless of the urlfilter and THEN ignore them during the crawl? 

Eric (law student)


-----Original Message-----
From: Eric Martin [mailto:[email protected]] 
Sent: Thursday, November 04, 2010 5:26 PM
To: [email protected]
Subject: RE: Forcing to crawl just FQDN

So,

+^http://([a-z0-9]*\.)*ecasebriefs.com/blog/law/
+^http://([a-z0-9]*\.)*lawnix.com/cases/cases-index/

Will only crawl the urls from that particular domain and directory?

Thank you.

-----Original Message-----
From: Edward Drapkin [mailto:[email protected]] 
Sent: Thursday, November 04, 2010 5:13 PM
To: [email protected]
Subject: Re: Forcing to crawl just FQDN

On 11/4/2010 7:10 PM, Eric Martin wrote:
> Hello,
>
>
>
> Thanks for all the help so far. Between google here and Solr mailing, I
have
> learned a tremendous amount.
>
>
>
> I was wondering how I can just search fqdn in the seed list? So, if I was
> crawling xyz.com it would only crawl the urls's with xyz.com and skip
those
> urls leading off site. If I knew how to better phrase ti I am sure I could
> find more info on google.
>
>
>
> Eric
>
>
Use the urlfilter files in the configuration directory.

Thanks,
Eddie

Reply via email to