I can't identify your urls.
"http://mysite<http://mysite/> .
Mydomain.com/guidance/wiki/index.php/sylebook." ??
"http://mysite<http://mysite/> .
Mydomain.com/guidance/........" ????
What's the url you start with. Is it
http://Mydomain.com/guidance/
or
http://Mydomain.com/guidance/wiki/index.php/sylebook ???
Whatever. If the starting url starts with
http://Mydomain.com/guidance/:
-----------------------------
Open conf/nutch-site.xml and insert between <configuration>
and </configuration>:
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to
external hosts will be ignored....</description>
</property>
<property>
<name>urlfilter.regex.file</name>
<value>crawl-urlfilter.txt</value>
</property>
-------------------------------
Then open conf/crawl-urlfilter.txt and change the default line:
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
TO
+^http://([a-z0-9]*\.)*Mydomain\.com/guidance/
(will crawl all found links like these examples
http://Mydomain.com/guidance/
http://www.Mydomain.com/guidance/
http://xyz.Mydomain.com/guidance/
http://abcdi.xyzdo.Mydomain.com/guidance/
http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/
and so on, including all crazy subdomains)
OR CHANGE IT LIKE THIS
+^http://Mydomain\.com/guidance/
(will only crawl all found links starting with
http://Mydomain.com/guidance/)
OR CHANGE IT LIKE THIS
+^http://www\.Mydomain\.com/guidance/
(will only crawl all found links starting with
http://www.Mydomain.com/guidance/)
OR CHANGE IT LIKE THIS
+^http://([a-z0-9]*\.*)Mydomain\.com/guidance/
(will crawl all found links like these examples
http://Mydomain.com/guidance/
http://www.Mydomain.com/guidance/
http://xyz.Mydomain.com/guidance/
BUT NOT:
http://abcdi.xyzdo.Mydomain.com/guidance/
http://www.uvwb9.6abc.x4yz.Mydomain.com/guidance/
BE AWARE:
it crawls also things like
http://abcMydomain.com/guidance/ (but that's an external
link that we've excluded by changing conf/nutch-site.xml. So
it shouldn't.)
OR ADD 2 LINES (OR MORE IF YOU HAVE OTHER COMBINATIONS):
+^http://Mydomain\.com/guidance/
+^http://www\.Mydomain\.com/guidance/
(will only crawl all found links starting with
http://Mydomain.com/guidance/
OR
http://www.Mydomain.com/guidance/)
Then check if the last lines of this file say:
# skip everything else
-.
----------------------------------------------
If you use the so called runbot script (Author: Susam Pal)
you must change conf/regex-urlfilter.txt (that's what a
tutorial said. I don't know why.):
Find the lines:
# accept anything else
+.
Change it:
# accept anything else
# +.
Then add your line(s) like explained above. i added just one
here. Don't forget the skip thang "-." !!!
Example:
+^http://Mydomain\.com/guidance/
# skip everything else
-.
# accept anything else
# +.
------------------------------------------
Because configuration of nutch is somehow confusing for me
(too many files) I always change both files to play it safe:
conf/regex-urlfilter.txt and conf/crawl-urlfilter.txt.
Only FYI:
What we inserted are regular expressions. Lots of
backslashes ("\"). Same thing: I like to play it safe even
if they are missing in default line
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
I think that this line would match:
http://MYXDOMAIN.NAME/, too.
Or http://MY.DOMAINXNAME/
Am 23.08.2010 05:41, schrieb Nemani, Raj:
All,
I am currently using Nutch to crawl an intranet site. I start the crawl
with one seed url as shown below.
http://mysite<http://mysite/> .
Mydomain.com/guidance/wiki/index.php/sylebook.
What I would like to do is to tell Nutch to skip all that URLS that do
not conform to the following the pattern
http://mysite<http://mysite/> . Mydomain.com/guidance/........
Can anyone please help me with this issue?
I appreciate your help
Thanks
Raj