Re: Tellling Nutch to skip certain Url

Volli Sun, 22 Aug 2010 23:11:10 -0700

I can't identify your urls.

"http://mysite<http://mysite/>  .
Mydomain.com/guidance/wiki/index.php/sylebook." ??

"http://mysite<http://mysite/> .Mydomain.com/guidance/........" ????


What's the url you start with. Is it
http://Mydomain.com/guidance/
or
http://Mydomain.com/guidance/wiki/index.php/sylebook ???

Whatever. If the starting url starts with
http://Mydomain.com/guidance/:

-----------------------------

Open conf/nutch-site.xml and insert between <configuration>and </configuration>:


<property>
  <name>db.ignore.external.links</name>
  <value>true</value>

<description>If true, outlinks leading from a page toexternal hosts will be ignored....</description>

</property>
<property>
  <name>urlfilter.regex.file</name>
  <value>crawl-urlfilter.txt</value>
</property>
-------------------------------

Then open conf/crawl-urlfilter.txt and change the default line:

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

TO
+^http://([a-z0-9]*\.)*Mydomain\.com/guidance/

(will crawl all found links like these examples
http://Mydomain.com/guidance/
http://www.Mydomain.com/guidance/
http://xyz.Mydomain.com/guidance/
http://abcdi.xyzdo.Mydomain.com/guidance/
http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/
and so on, including all crazy subdomains)

OR CHANGE IT LIKE THIS
+^http://Mydomain\.com/guidance/

(will only crawl all found links starting with
http://Mydomain.com/guidance/)

OR CHANGE IT LIKE THIS
+^http://www\.Mydomain\.com/guidance/

(will only crawl all found links starting with
http://www.Mydomain.com/guidance/)

OR CHANGE IT LIKE THIS
+^http://([a-z0-9]*\.*)Mydomain\.com/guidance/

(will crawl all found links like these examples
http://Mydomain.com/guidance/
http://www.Mydomain.com/guidance/
http://xyz.Mydomain.com/guidance/
BUT NOT:
http://abcdi.xyzdo.Mydomain.com/guidance/
http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/
BE AWARE:
it crawls also things like

http://abcMydomain.com/guidance/ (but that's an externallink that we've excluded by changing conf/nutch-site.xml. Soit shouldn't.)


OR ADD 2 LINES (OR MORE IF YOU HAVE OTHER COMBINATIONS):
+^http://Mydomain\.com/guidance/
+^http://www\.Mydomain\.com/guidance/

(will only crawl all found links starting with
http://Mydomain.com/guidance/
OR
http://www.Mydomain.com/guidance/)

Then check if the last lines of this file say:
# skip everything else
-.

----------------------------------------------

If you use the so called runbot script (Author: Susam Pal)you must change conf/regex-urlfilter.txt (that's what atutorial said. I don't know why.):


Find the lines:
# accept anything else
+.

Change it:
# accept anything else
# +.

Then add your line(s) like explained above. i added just onehere. Don't forget the skip thang "-." !!!


Example:
+^http://Mydomain\.com/guidance/
# skip everything else
-.

# accept anything else
# +.
------------------------------------------

Because configuration of nutch is somehow confusing for me(too many files) I always change both files to play it safe:

conf/regex-urlfilter.txt and conf/crawl-urlfilter.txt.

Only FYI:

What we inserted are regular expressions. Lots ofbackslashes ("\"). Same thing: I like to play it safe evenif they are missing in default line

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

I think that this line would match:
http://MYXDOMAIN.NAME/, too.
Or http://MY.DOMAINXNAME/


Am 23.08.2010 05:41, schrieb Nemani, Raj:

All,



I am currently using Nutch to crawl an intranet site.  I start the crawl
with one seed url as shown below.



http://mysite<http://mysite/>  .
Mydomain.com/guidance/wiki/index.php/sylebook.



What I would like to do is to tell Nutch to skip all that URLS that do
not conform to the following the pattern



http://mysite<http://mysite/>  . Mydomain.com/guidance/........



Can anyone please help me with this issue?

I appreciate your help



Thanks

Raj

Re: Tellling Nutch to skip certain Url

Reply via email to