Re: Crawling a ccTLD

2009-03-23 Thread Mauro Vignati
Hi Julien,
Many thanks for your help ! Your answer will save me a lot of hours in
research and test.I have found the plugin and testing it.
I have just one more question. Because I am really noob, I never installed a
plugin in Nutch. If I have understood correctly,
I need the following 3 files:

in the conf I have to add a domain-urlfilter.txt: this file will declare
which TLD will be crawled/fetched (I mean, URLs that don't match the TLD are
not fetched)

in the plugins I have to add a folder called urlfilter-domain with 2
files inside: DomainURLFilter.jar and  plugin.xml. For this 2 files I
don't have to change anything in the configuration.

Am I right or should I change something?

Thanks again for your help !
Warm regards
Mauro


Hi Mauro,

 Have a look at the domain filter plugin in the SVN version of the code. It
 will allow you to filter based on the TLD.

 HTH

 Julien

 --
 DigitalPebble Ltd
 http://www.digitalpebble.com


 2009/3/19 Mauro Vignati vig...@gmail.com

  Hi,
  I'm testing Nutch and until now everything works fine (ok, some hours
 spent
  in reading, testing, testing and testing but it's normal.
  I have a noob question: I have to crawl websites only within a ccTLD.
 
  In the crawl-urlfilter.txt should I wright so:
 
  # accept hosts in MY.DOMAIN.NAME
  +^http://([a-z0-9]*\.)*.ch/
 
 
  or so
 
  # accept hosts in MY.DOMAIN.NAME
  +^http://([a-z0-9]*\.)*ch/
 
 
  The difference is the dot before the ch ccTLD. I mean, the dot before
 the
  bracket is already dividing the ccTLD and the name (or the root and a
  subdomain) or sould I add one like in the first exemple? In the
  installation
  guide I can see:
 
  +^http://([a-z0-9]*\.)*apache.org/
 
  Is crawling every subdomain of apache.org (xxx.apache.org) or is
  crawling apache.org?
 
  Many thanks for any help
  Mauro
 



Re: Crawling a ccTLD

2009-03-21 Thread Julien Nioche
Hi Mauro,

Have a look at the domain filter plugin in the SVN version of the code. It
will allow you to filter based on the TLD.

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/3/19 Mauro Vignati vig...@gmail.com

 Hi,
 I'm testing Nutch and until now everything works fine (ok, some hours spent
 in reading, testing, testing and testing but it's normal.
 I have a noob question: I have to crawl websites only within a ccTLD.

 In the crawl-urlfilter.txt should I wright so:

 # accept hosts in MY.DOMAIN.NAME
 +^http://([a-z0-9]*\.)*.ch/


 or so

 # accept hosts in MY.DOMAIN.NAME
 +^http://([a-z0-9]*\.)*ch/


 The difference is the dot before the ch ccTLD. I mean, the dot before the
 bracket is already dividing the ccTLD and the name (or the root and a
 subdomain) or sould I add one like in the first exemple? In the
 installation
 guide I can see:

 +^http://([a-z0-9]*\.)*apache.org/

 Is crawling every subdomain of apache.org (xxx.apache.org) or is
 crawling apache.org?

 Many thanks for any help
 Mauro