Crawling a ccTLD

2009-03-21 Thread Mauro Vignati
Hi, I'm testing Nutch and until now everything works fine (ok, some hours spent in reading, testing, testing and testing but it's normal. I have a noob question: I have to crawl websites only within a ccTLD. In the crawl-urlfilter.txt should I wright so: # accept hosts in MY.DOMAIN.NAME

Problem : data distribution is non uniform between two different disks on datanode.

2009-03-21 Thread Vaibhav J
We have 27 datanode and replication factor is 1. (data size is ~6.75 TB) We have specified two different disks for dfs data directory on each datanode by using property dfs.data.dir in hadoop-site.xml file of conf directory. (value of property dfs.data.dir : /mnt/hadoop-dfs/data,

Indexing the local file system

2009-03-21 Thread Huang, Zijian(Victor)
Hi, all: I am new to Nutch, can anyone please tell me what do I do to index some text files in a local directory using nutch's crawler? Thanks Victor

Re: Crawling a ccTLD

2009-03-21 Thread Julien Nioche
Hi Mauro, Have a look at the domain filter plugin in the SVN version of the code. It will allow you to filter based on the TLD. HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/3/19 Mauro Vignati vig...@gmail.com Hi, I'm testing Nutch and until now everything works fine