On Fri, Oct 26, 2012 at 11:47 AM, Kartashov, Andy <[email protected]> wrote: > I successfully ran a job on a cluster on foo1 in pseudo-distributed mode and > are now trying to try fully-dist'ed one. > > a. I created another instance foo2 on EC2.
It seems like you're trying to use the start-dfs.sh style startup scripts to manually run a cluster on EC2. This is doable, but it's not very easy due to the mismatch in expectations between EC2 style deployments and start-dfs.sh. Setting up a manually started cluster requires a bit of up-front work, and EC2 spin-up/spin-down cycles mean you end up redoing that work frequently. You might consider using whirr, http://whirr.apache.org/ as a more automated way of deploying Hadoop clusters on EC2. Of course, setting up a manual cluster can be a really good way to understand how all the parts work together, and doing it on EC2 should work just fine. > Installed hadoop on it and copied conf/ folder from foo1 to foo2. I created > /hadoop/dfs/data folder on the local linux system on foo2. > > b. on foo1 I created file conf/slaves and added: > localhost > <hostname-of-foo2> I'd strongly recommend being consistent with the naming, don't mix "localhost" and DNS names. EC2 has "ec2.internal" in /etc/resolv.conf by default, so you can "ping ip-10-42-120-3" and it should work just fine. Then make conf/master list your first host by name, and make conf/slaves list all your hosts by name. Note that for small clusters, running a DN and a NN on a single host is an acceptable compromise and works OK. % cat conf/master ip-10-42-120-3 % cat conf/slaves ip-10-42-120-3 ip-10-42-115-32 % You also should make sure that your user account can ssh to all the nodes: % for h in $(cat conf/slaves); do ssh -oStrictHostKeyChecking=no $h hostname; done - answer "yes" to any "allow untrusted certificate" messages - if you get "permission denied" messages you'll need to set up the authorized_keys properly. - after this loop succeeds you should be able to run it again and get a clean list of hostnames. > At this point I cannot find an answer on what to do next. > > I started NN, DN, SNN, JT, TT on foor1. After I ran "hadoop fsck /user/bar > -files -blocks -locations", it showed # of datanode as 1. I was expecting DN > and TT on foo2 to be started by foo1. But it didn’t happen, so I started them > myself and tried the the command again. Still one DD. You don't need to start the daemons individually, and doing so is very difficult to get right. I virtually never do so -- I use the start-dfs.sh script to start the daemons (NN, DN, TT, etc). The "master" and "slaves" config files are parsed by the start-*.sh scripts, not by the daemons themselves. And, the daemons don't start themselves -- for a manual cluster, the start-*.sh scripts are responsible. (In a production deployment such as CDH, there is a /etc/init.d script which is managed by the distro packaging to start and manage the daemons.) -andy
