RE: Tellling Nutch to skip certain Url

Nemani, Raj Mon, 23 Aug 2010 07:37:41 -0700

They are intranet Urls.  So I went with a generic description.  They are
not avaialble outside
 I start with http://Mydomain.com/guidance/wiki/index.php/sylebook


I think +^http://Mydomain\.com/guidance/ will work for me.

Thank you so much for such a detailed explanation.

Thanks again
Raj

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Monday, August 23, 2010 2:07 AM
To: [email protected]
Subject: Re: Tellling Nutch to skip certain Url

I can't identify your urls.

"http://mysite<http://mysite/>  .
Mydomain.com/guidance/wiki/index.php/sylebook." ??

"http://mysite<http://mysite/>  . 
Mydomain.com/guidance/........" ????

What's the url you start with. Is it
http://Mydomain.com/guidance/
or
http://Mydomain.com/guidance/wiki/index.php/sylebook ???

Whatever. If the starting url starts with
http://Mydomain.com/guidance/:

-----------------------------
Open conf/nutch-site.xml and insert between <configuration> and
</configuration>:

<property>
   <name>db.ignore.external.links</name>
   <value>true</value>
   <description>If true, outlinks leading from a page to external hosts
will be ignored....</description> </property> <property>
   <name>urlfilter.regex.file</name>
   <value>crawl-urlfilter.txt</value>
</property>
-------------------------------

Then open conf/crawl-urlfilter.txt and change the default line:

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

TO
+^http://([a-z0-9]*\.)*Mydomain\.com/guidance/

(will crawl all found links like these examples
http://Mydomain.com/guidance/ http://www.Mydomain.com/guidance/
http://xyz.Mydomain.com/guidance/
http://abcdi.xyzdo.Mydomain.com/guidance/
http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/
and so on, including all crazy subdomains)

OR CHANGE IT LIKE THIS
+^http://Mydomain\.com/guidance/

(will only crawl all found links starting with
http://Mydomain.com/guidance/)

OR CHANGE IT LIKE THIS
+^http://www\.Mydomain\.com/guidance/

(will only crawl all found links starting with
http://www.Mydomain.com/guidance/)

OR CHANGE IT LIKE THIS
+^http://([a-z0-9]*\.*)Mydomain\.com/guidance/

(will crawl all found links like these examples
http://Mydomain.com/guidance/ http://www.Mydomain.com/guidance/
http://xyz.Mydomain.com/guidance/ BUT NOT:
http://abcdi.xyzdo.Mydomain.com/guidance/
http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/
BE AWARE:
it crawls also things like
http://abcMydomain.com/guidance/ (but that's an external link that we've
excluded by changing conf/nutch-site.xml. So it shouldn't.)

OR ADD 2 LINES (OR MORE IF YOU HAVE OTHER COMBINATIONS):
+^http://Mydomain\.com/guidance/
+^http://www\.Mydomain\.com/guidance/

(will only crawl all found links starting with
http://Mydomain.com/guidance/ OR
http://www.Mydomain.com/guidance/)

Then check if the last lines of this file say:
# skip everything else
-.

----------------------------------------------
If you use the so called runbot script (Author: Susam Pal) you must
change conf/regex-urlfilter.txt (that's what a tutorial said. I don't
know why.):

Find the lines:
# accept anything else
+.

Change it:
# accept anything else
# +.

Then add your line(s) like explained above. i added just one here. Don't
forget the skip thang "-." !!!

Example:
+^http://Mydomain\.com/guidance/
# skip everything else
-.

# accept anything else
# +.
------------------------------------------

Because configuration of nutch is somehow confusing for me (too many
files) I always change both files to play it safe:
conf/regex-urlfilter.txt and conf/crawl-urlfilter.txt.

Only FYI:
What we inserted are regular expressions. Lots of backslashes ("\").
Same thing: I like to play it safe even if they are missing in default
line
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

I think that this line would match:
http://MYXDOMAIN.NAME/, too.
Or http://MY.DOMAINXNAME/


Am 23.08.2010 05:41, schrieb Nemani, Raj:
> All,
>
>
>
> I am currently using Nutch to crawl an intranet site.  I start the 
> crawl with one seed url as shown below.
>
>
>
> http://mysite<http://mysite/>  .
> Mydomain.com/guidance/wiki/index.php/sylebook.
>
>
>
> What I would like to do is to tell Nutch to skip all that URLS that do

> not conform to the following the pattern
>
>
>
> http://mysite<http://mysite/>  . Mydomain.com/guidance/........
>
>
>
> Can anyone please help me with this issue?
>
> I appreciate your help
>
>
>
> Thanks
>
> Raj
>
>
>
>
>
>
>
>

RE: Tellling Nutch to skip certain Url

Reply via email to