Hi Julien,
Many thanks for your help ! Your answer will save me a lot of hours in
research and test.I have found the plugin and testing it.
I have just one more question. Because I am really noob, I never installed a
plugin in Nutch. If I have understood correctly,
I need the following 3 files:
in the conf I have to add a domain-urlfilter.txt: this file will declare
which TLD will be crawled/fetched (I mean, URLs that don't match the TLD are
not fetched)
in the plugins I have to add a folder called urlfilter-domain with 2
files inside: DomainURLFilter.jar and plugin.xml. For this 2 files I
don't have to change anything in the configuration.
Am I right or should I change something?
Thanks again for your help !
Warm regards
Mauro
Hi Mauro,
Have a look at the domain filter plugin in the SVN version of the code. It
will allow you to filter based on the TLD.
HTH
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
2009/3/19 Mauro Vignati vig...@gmail.com
Hi,
I'm testing Nutch and until now everything works fine (ok, some hours
spent
in reading, testing, testing and testing but it's normal.
I have a noob question: I have to crawl websites only within a ccTLD.
In the crawl-urlfilter.txt should I wright so:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*.ch/
or so
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*ch/
The difference is the dot before the ch ccTLD. I mean, the dot before
the
bracket is already dividing the ccTLD and the name (or the root and a
subdomain) or sould I add one like in the first exemple? In the
installation
guide I can see:
+^http://([a-z0-9]*\.)*apache.org/
Is crawling every subdomain of apache.org (xxx.apache.org) or is
crawling apache.org?
Many thanks for any help
Mauro