crawl-urlfilter is for one-stop command, crawl while regexp-urlfilter is for general purpose use. as far as I know crawl-urlfilter is not used anymore in current nutch versions.
Best Regards Alexander Aristov On 5 November 2010 21:34, Eddie Drapkin <[email protected]> wrote: > Are you sure that you're using the crawl-urlfilter.txt file and not the > regex-urlfilter.txt? I'm not sure which one is the default (the script I > use to generate configuration for nutch creates them both as copies, fwiw). > > Thanks, > Eddie > > > On 11/5/2010 12:46 PM, Eric Martin wrote: > >> I adjusted the crawl-urlfilter.txt file to read: >> >> # accept hosts in MY.DOMAIN.NAME >> +^http://ecasebriefs.com/blog/law/ >> +^http://lawnix.com/cases/cases-index/ >> +^http://oyez.org/ >> +^http://4lawnotes.com/ >> +^http://docstoc.com/documents/education/law-school/case-briefs >> +^http://lawschoolcasebriefs.com/ >> +^http://dictionary.findlaw.com/ >> # skip everything else >> -. >> >> But I am still fetching extrinsic URL's. Does nutch fetch the extrinsic >> URL >> regardless of the urlfilter and THEN ignore them during the crawl? >> >> Eric (law student) >> >> >> -----Original Message----- >> From: Eric Martin [mailto:[email protected]] >> Sent: Thursday, November 04, 2010 5:26 PM >> To: [email protected] >> Subject: RE: Forcing to crawl just FQDN >> >> So, >> >> +^http://([a-z0-9]*\.)*ecasebriefs.com/blog/law/ >> +^http://([a-z0-9]*\.)*lawnix.com/cases/cases-index/ >> >> Will only crawl the urls from that particular domain and directory? >> >> Thank you. >> >> -----Original Message----- >> From: Edward Drapkin [mailto:[email protected]] >> Sent: Thursday, November 04, 2010 5:13 PM >> To: [email protected] >> Subject: Re: Forcing to crawl just FQDN >> >> On 11/4/2010 7:10 PM, Eric Martin wrote: >> >>> Hello, >>> >>> >>> >>> Thanks for all the help so far. Between google here and Solr mailing, I >>> >> have >> >>> learned a tremendous amount. >>> >>> >>> >>> I was wondering how I can just search fqdn in the seed list? So, if I was >>> crawling xyz.com it would only crawl the urls's with xyz.com and skip >>> >> those >> >>> urls leading off site. If I knew how to better phrase ti I am sure I >>> could >>> find more info on google. >>> >>> >>> >>> Eric >>> >>> >>> Use the urlfilter files in the configuration directory. >> >> Thanks, >> Eddie >> >> >

