I adjusted the crawl-urlfilter.txt file to read: # accept hosts in MY.DOMAIN.NAME +^http://ecasebriefs.com/blog/law/ +^http://lawnix.com/cases/cases-index/ +^http://oyez.org/ +^http://4lawnotes.com/ +^http://docstoc.com/documents/education/law-school/case-briefs +^http://lawschoolcasebriefs.com/ +^http://dictionary.findlaw.com/ # skip everything else -.
But I am still fetching extrinsic URL's. Does nutch fetch the extrinsic URL regardless of the urlfilter and THEN ignore them during the crawl? Eric (law student) -----Original Message----- From: Eric Martin [mailto:[email protected]] Sent: Thursday, November 04, 2010 5:26 PM To: [email protected] Subject: RE: Forcing to crawl just FQDN So, +^http://([a-z0-9]*\.)*ecasebriefs.com/blog/law/ +^http://([a-z0-9]*\.)*lawnix.com/cases/cases-index/ Will only crawl the urls from that particular domain and directory? Thank you. -----Original Message----- From: Edward Drapkin [mailto:[email protected]] Sent: Thursday, November 04, 2010 5:13 PM To: [email protected] Subject: Re: Forcing to crawl just FQDN On 11/4/2010 7:10 PM, Eric Martin wrote: > Hello, > > > > Thanks for all the help so far. Between google here and Solr mailing, I have > learned a tremendous amount. > > > > I was wondering how I can just search fqdn in the seed list? So, if I was > crawling xyz.com it would only crawl the urls's with xyz.com and skip those > urls leading off site. If I knew how to better phrase ti I am sure I could > find more info on google. > > > > Eric > > Use the urlfilter files in the configuration directory. Thanks, Eddie

