They are intranet Urls. So I went with a generic description. They are not avaialble outside I start with http://Mydomain.com/guidance/wiki/index.php/sylebook
I think +^http://Mydomain\.com/guidance/ will work for me. Thank you so much for such a detailed explanation. Thanks again Raj -----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Monday, August 23, 2010 2:07 AM To: [email protected] Subject: Re: Tellling Nutch to skip certain Url I can't identify your urls. "http://mysite<http://mysite/> . Mydomain.com/guidance/wiki/index.php/sylebook." ?? "http://mysite<http://mysite/> . Mydomain.com/guidance/........" ???? What's the url you start with. Is it http://Mydomain.com/guidance/ or http://Mydomain.com/guidance/wiki/index.php/sylebook ??? Whatever. If the starting url starts with http://Mydomain.com/guidance/: ----------------------------- Open conf/nutch-site.xml and insert between <configuration> and </configuration>: <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored....</description> </property> <property> <name>urlfilter.regex.file</name> <value>crawl-urlfilter.txt</value> </property> ------------------------------- Then open conf/crawl-urlfilter.txt and change the default line: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ TO +^http://([a-z0-9]*\.)*Mydomain\.com/guidance/ (will crawl all found links like these examples http://Mydomain.com/guidance/ http://www.Mydomain.com/guidance/ http://xyz.Mydomain.com/guidance/ http://abcdi.xyzdo.Mydomain.com/guidance/ http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/ and so on, including all crazy subdomains) OR CHANGE IT LIKE THIS +^http://Mydomain\.com/guidance/ (will only crawl all found links starting with http://Mydomain.com/guidance/) OR CHANGE IT LIKE THIS +^http://www\.Mydomain\.com/guidance/ (will only crawl all found links starting with http://www.Mydomain.com/guidance/) OR CHANGE IT LIKE THIS +^http://([a-z0-9]*\.*)Mydomain\.com/guidance/ (will crawl all found links like these examples http://Mydomain.com/guidance/ http://www.Mydomain.com/guidance/ http://xyz.Mydomain.com/guidance/ BUT NOT: http://abcdi.xyzdo.Mydomain.com/guidance/ http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/ BE AWARE: it crawls also things like http://abcMydomain.com/guidance/ (but that's an external link that we've excluded by changing conf/nutch-site.xml. So it shouldn't.) OR ADD 2 LINES (OR MORE IF YOU HAVE OTHER COMBINATIONS): +^http://Mydomain\.com/guidance/ +^http://www\.Mydomain\.com/guidance/ (will only crawl all found links starting with http://Mydomain.com/guidance/ OR http://www.Mydomain.com/guidance/) Then check if the last lines of this file say: # skip everything else -. ---------------------------------------------- If you use the so called runbot script (Author: Susam Pal) you must change conf/regex-urlfilter.txt (that's what a tutorial said. I don't know why.): Find the lines: # accept anything else +. Change it: # accept anything else # +. Then add your line(s) like explained above. i added just one here. Don't forget the skip thang "-." !!! Example: +^http://Mydomain\.com/guidance/ # skip everything else -. # accept anything else # +. ------------------------------------------ Because configuration of nutch is somehow confusing for me (too many files) I always change both files to play it safe: conf/regex-urlfilter.txt and conf/crawl-urlfilter.txt. Only FYI: What we inserted are regular expressions. Lots of backslashes ("\"). Same thing: I like to play it safe even if they are missing in default line +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ I think that this line would match: http://MYXDOMAIN.NAME/, too. Or http://MY.DOMAINXNAME/ Am 23.08.2010 05:41, schrieb Nemani, Raj: > All, > > > > I am currently using Nutch to crawl an intranet site. I start the > crawl with one seed url as shown below. > > > > http://mysite<http://mysite/> . > Mydomain.com/guidance/wiki/index.php/sylebook. > > > > What I would like to do is to tell Nutch to skip all that URLS that do > not conform to the following the pattern > > > > http://mysite<http://mysite/> . Mydomain.com/guidance/........ > > > > Can anyone please help me with this issue? > > I appreciate your help > > > > Thanks > > Raj > > > > > > > >

