Hi,
We are currently doing a POC for HBase in our system. We have
written a bulk upload job to upload our data from a text file into
HBase. We are using a 3-node cluster, one master which also works as
slave (running as namenode, jobtracker, HMaster, datanode,
tasktracker, HQuorumpeer and HRegionServer) and 2 slaves (datanode,
tasktracker, HQuorumpeer and HRegionServer running). The problem is
that we are getting lower performance from distributed cluster than
what we were getting from single-node pseudo distributed node. The
upload is taking about 30 minutes on an individual machine, whereas
it is taking 2 hrs on the cluster. We have replication set to 3, so
all parts should ideally be available on all nodes, so we doubt if the
problem is network latency. scp of files between nodes gives a speed
of about 12 MB/s, which I believe should be good enough for this to
function. Please correct me if I am wrong here. The nodes are all 4
core machines with 8 GB RAM. We are spawning 4 simultaneous map tasks
on each node, and the job does not have any reduce phase. Any help is
greatly appreciated.
Thanks,
Hari Shankar