Hadoop - cluster set-up (for DUMMIES)... or how I did it

Kartashov, Andy Fri, 02 Nov 2012 09:36:10 -0700

Hello Hadoopers,

After weeks of struggle, numerous error debugging and the like I finally 
managed to set-up a fully distributed cluster. I decided to share my experience 
with the new comers.
 In case the experts on here disagree with some of the facts mentioned here-in 
feel free to correct or add your comments.


Example Cluster Topology:
Node 1 – NameNode+JobTracker
Node 2 – SecondaryNameNode
Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N

Configuration set-up after you installed Hadoop:

Firstly, you will need to find every host address of your respective Node by 
running:
$hostname –f

Your /etc/hadoop/ folder contains subfolders of your configuration files.  Your 
installation will create a default folder conf.empty. Copy it to, say 
conf.cluster and make sure your soft link conf-> points to conf.cluster

You can see what it points now to by running:
$ alternatives --display hadoop-conf

Make a new link and set it to point to conf.cluster:
$ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf 
/etc/hadoop/conf.cluster 50
$ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
Run the display again to check proper configuration
$ alternatives --display hadoop-conf

Let’s go inside conf.cluster
$cd conf.cluster/

As a minimum, we will need to modify the following files:
1.      core-site.xml
<property>
  <name>fs.defaultFS</name>
    <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your 
NameNode -Node1 which you found with “hostname –f” above
  </property>

2.      mapred-site.xml
  <property>
    <name>mapred.job.tracker</name>
    <!--<value><host-name>:8021</value> --> # it is host-name of your NameNode 
– Node 1  as well, since we intend to run NameNode and JobTracker on the same 
machine
    <value>hdfs://ip-10-62-62-235.ec2.internal:8021</value>
  </property>

3.      masters # if this file doesn’t exist yet, create it and add one line:
<host-name> # it is the host-name of your Node2 – running SecondaryNameNode

4.      slaves # if this file doesn’t exist yet, create it and add your 
host-names ( one per line):
<host-name> # it is the host-name of your Node3 – running DataNode1
<host-name> # it is the host-name of your Node4 – running DataNode2
….
<host-name> # it is the host-name of your NodeN – running DataNodeN


5.      If you are not comfortable touching hdfs-site.xml, no problem, after 
you format your NameNode, it will create dfs/name dfs/data etc. folder 
structure in your local Linux default /tmp/hadoop-hdfs/directory. You could 
later change this to a different folder by specifying hdfs-site.xml  but please 
learn on the file structure/permissions/owners of those directories /dfs/data 
dfs/name dfs/namesecondary etc that were created for you by default first.

Let’s format HDFS namespace: (note we format it as hdfs user)
$ sudo –u hdfs hadoop  namenode –format
NOTE – that you only run this command ONCE on the NameNode only!

I only added the following property to my hdfs-site.xml on the NameNode- Node1 
for the SecondaryNameNode to use:

<property>
  <name>dfs.namenode.http-address</name>
  <value>namenode.host.address:50070</value>   # I change this to 0.0.0.0:50070 
for EC2 environment
  <description>
    Needed for running SNN
    The address and the base port on which the dfs NameNode Web UI will listen.
    If the port is 0, the server will start on a free port.
  </description>
</property>other SNN properties for hdfs-site.xml

6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2 
(SNN) , Node3,4,..N (DNs+TTs). Make sure your conf soft link points to tis 
directory (see above).

7.              Now we ready to start daemons:

        Everytime you start a respective Daemon, a log report is written.  This 
is the FIRST place to look for potential problems.
Unless you change the property in hadoop-env.sh, found in your 
/conf/conf.cluster/ directory, namely “#export 
HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each 
respective Node to:
NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or 
“/var/log/hadoop-0.20-mapreduce/” or else, depending on the version of your MR. 
Respective Daemon will have a respective filename ending with .log

                I came across a lot of errors playing with this, as follows:
a.      Error: connection refused
This is normally caused by your firewall. Try running “sudo 
/etc/init.d/iptables status”.  I bet it is running. Solution: either add 
allowed ports or temporarily turn off iptables by running “sudo service 
iptables stop”
Try to restart your Daemon (that is refused connection) and check your 
respective /var/log/…. Datanode or TaskTracker or else .log file again.
This solved my problems with connections. You can test connection by running  
“telnet <ip-address> <port>” of the Node you are trying to connect to.
b.      Binding exception.
This happens when you try to start a Daemon on the machine that is not supposed 
to run this Daemon. For example,  trying to start JobTracker on a slave 
machine.  This is a given.  JobTracker is already running on your MasterNode -  
Node1 hence the binding Exception.
c.      Java heap size or Java Child exception were thrown when I ran too small 
of an instance on EC2. Increasing it from tiny to small or from small to 
medium, solved the issue.
d.      DataNode running on slave throws an Exception about DataNode id 
–mismatch. This happened when I tried to duplicate an instance on EC2, and as a 
result ended up with two different DataNodes with the same id. Deleting 
/tmp/hadoop-hdfs/dfs/data directory on the replicated Instance and restarting 
dataNode Daemon solved this issue.
Now, that you fixed your above errors and restarted respective Daemons your 
..log files should be clean of any errors.

Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered with 
the Master Namenode - Node1.
“$hadoop dfsadmin –printTopology”
Should display all your host-addresses you mentioned in the 
/conf.cluster/slaves file.


8.      Let’s create some structure inside hdfs:
 Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is up 
and running
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

 Create MapReduce /var directories (YARN requires different structure)
sudo -u hdfs hadoop fs -mkdir /var
sudo -u hdfs hadoop fs -mkdir /var/lib
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 
/var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Verify the HDFS File Structure
$ sudo -u hdfs hadoop fs -ls -R /

You should see:
drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 
/var/lib/hadoop-hdfs
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 
/var/lib/hadoop-hdfs/cache
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 
/var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 
/var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 
/var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Create a Home Directory for each MapReduce User
Create a home directory for each MapReduce user. It is best to do this on the 
NameNode; for example:
$ sudo -u hdfs hadoop fs -mkdir  /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.


p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
1. check your firewall (iptables) if running and what ports are allowed
2. add hostname (by running "$hostname -f") inside your 
/conf/conf.cluster/slaves on NameNode1 ONLY!
3. start DataNode + TaskTracker on the newly added Node
4. restart DataNode / JobTracker on your NameNode1
5. Check that your DataNode registered by running "hadoop dfsadmin 
-printTopology".
6. If I am duplicating an instance on EC2 currently running DataNode, before I 
start above two Daemons I make sure I delete  data inside /var/log/hadoop-hdfs, 
/var/log/hadoop-mapreduce and /tmp/hadoop-hdfs folders. Starting DataNode and 
TaskTracker Daemon will recreate new files afresh.

Happy Hadooping.
NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont 
confidentiels, protégés par le droit d'auteur et peuvent être couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autorisée est 
interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, 
supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à 
l'environnement avant d'imprimer le présent courriel

Hadoop - cluster set-up (for DUMMIES)... or how I did it

Reply via email to