tittutomen wrote:
>
> Hi,
>
> I've been trying to set up a Nutch-hadoop distributed environment to crawl
> a 3 Million URL list.
>
> My experience so far been is:
>
> 1. Nutch is working fine on a single machine environ. Here I wrote a
> script file which calls nutch crawl command first to crawl 1000 urls. Then
> it crawls the next 1000 urls. The first two indexes formed in these
> processes are merged together to form another merged.index. It will
> repeatedly crawl for 1000 urls and merge with the previous one. This is
> stable enough and goes on smoothly.
>
> 2. I tried to create a Distributed environment. I tried with 4 machines.
> There are 2 Master nodes each with 2 GB RAM, one for Namenode and another
> for JobTracker. The rest 2 machines are 1 GB RAM. I made all the 4
> machines into slave nodes. I run the same script to take 5000 URLs from a
> list of 3 Million URLs and start crawling. Then the rest 5000 will be
> called and merged with the previous one. I found here the DFS environ is
> not stable. After running for 2/3 cycles it breaks in different ways.
> Either the crawl fails or the merging fails.
>
> Now after trying with several different configurations like running the
> both masters on a single node, running only 3 slaves etc. still I found it
> is not going beyond more then 2/3 cycles.
>
> Could anybody suggest where I'm going wrong or if there is a better
> alternative? I have read docs claiming Nutch in 100+ machines. So is that
> mean it runs only once? How much time could we make the DFS environ
> stable? Do I have to restart DFS before beginning every crawl/merge cycle?
>
> There are lot of errors like Datanode missing,
> FileAlreadyCreatedException, JobFailed, RPCExceptions etc.
>
> I will appreciate help in this regard. And I'm open to share my knowledge
> so far also. Please write!
>
> Thanks in advance!!!
>
>
>
Another improvement I found is when i restarted the DFS environ. It takes
time but I think it is making the system stable. Don't know though whether
it is the correct way to go...
Thanks
-Subas
--
View this message in context:
http://www.nabble.com/Nutch---DFS-environment.-Is-it-stable--tp25746827p25763446.html
Sent from the Nutch - User mailing list archive at Nabble.com.