Thanks for your help I have taken my replication down to 2 but If I am not mistaken replication also has the benefit of rendering the cluster more fault by duplicating info on different nodes so that if one goes down data is note necessarily lost. I such case i would like to keep it a least at 2.
I have set dfs.replication at 2 but this process time has not changed at all. How could I change my configuration to avoid this hotspot issue you talked about. As Kevin has advised I have also upped: hbase.hstore.blockingStoreFiles to 100 hbase.hregion.memstore.block.multiplier to 7 hbase.hregion.memstore.flush.size to 256 MB hbase.regionserver.optionallogflushinterval to 30s However map importTsv is still around 1minutes for 1% of map tasks so over an hour total. Currently I have 42 running map tasks and an average of 28 tasks/node a lot of my map tasks end up in "failed to report status for 601 seconds" My cluster is 3 ubuntu machines: 2 cores 4 threads 3.4+ GHz with 16gb ram With bulk load the process finishes in around 20 minutes. But I am suprised that it takes more than an hour to insert 5 GB of data in hbase without bulkload I feel there is something I'm not getting.
