Hi, I've been posting questions in the mailing-list quiet often lately, and here goes another one about data locality I read the excellent blog post about data locality that Lars George wrote at http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
I understand data locality in hbase as locating a region in a region-server where most of its data blocks reside. So that way fast data access is guranteed when running a MR because each map/reduce task is run for each region in the tasktracker where the region co-locates. But what if the data blocks of the region are evenly spread over multiple region-servers? Does a MR task has to remotely access the data blocks from other regionservers? How good is hbase locating datablocks where a region resides? Also is it correct to say that if i set smaller data block size data locality gets worse, and if data block size gets bigger data locality gets better. Best regards, -- *Benjamin Kim* *benkimkimben at gmail*
