Hi,
I'm testing Nutch and until now everything works fine (ok, some hours spent
in reading, testing, testing and testing but it's normal.
I have a noob question: I have to crawl websites only within a ccTLD.
In the crawl-urlfilter.txt should I wright so:
# accept hosts in MY.DOMAIN.NAME
We have 27 datanode and replication factor is 1. (data size is ~6.75 TB)
We have specified two different disks for dfs data directory on each
datanode by using
property dfs.data.dir in hadoop-site.xml file of conf directory.
(value of property dfs.data.dir : /mnt/hadoop-dfs/data,
Hi, all:
I am new to Nutch, can anyone please tell me what do I do to index
some text files in a local directory using nutch's crawler?
Thanks
Victor
Hi Mauro,
Have a look at the domain filter plugin in the SVN version of the code. It
will allow you to filter based on the TLD.
HTH
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
2009/3/19 Mauro Vignati vig...@gmail.com
Hi,
I'm testing Nutch and until now everything works fine