Re: Data locality in HBase

Ben Kim Wed, 20 Jun 2012 21:58:06 -0700

Hi Lars,
I appreciate a lot for your reply.

As you told, a regionserver processes hfiles so that all data blocks are
located in the same physical machine unless the regionserver failes.
I ran following hadoop command to see location of a HFile


*hadoop fsck
/hbase/testtable/9488ef7fbd23b62b9bf85b722c015e90/testcf/08dc1940944b4952b23f0cbee51bcea8
-files -locations -blocks*

here is the output...

FSCK started by hadoop from /203.235.211.142 for path
> /hbase/testtable/9488ef7fbd23b62b9bf85b722c015e90/testcf/08dc1940944b4952b23f0cbee51bcea8
> at Thu Jun 21 13:40:11 KST 2012
> /hbase/testtable/9488ef7fbd23b62b9bf85b722c015e90/testcf/08dc1940944b4952b23f0cbee51bcea8
> 727156659 bytes, 11 block(s):  OK
> 0. blk_1832396139416350298_1296638 len=67108864 repl=3 [hadoop-145:50010,
> hadoop-143:50010, hadoop-144:50010]
> 1. blk_8910330590545256327_1296640 len=67108864 repl=3 [hadoop-143:50010,
> hadoop-157:50010, hadoop-159:50010]
> 2. blk_-3868612696419011016_1296640 len=67108864 repl=3 [hadoop-145:50010,
> hadoop-143:50010, hadoop-156:50010]
> 3. blk_-7551946394410945015_1296640 len=67108864 repl=3 [hadoop-145:50010,
> hadoop-143:50010, hadoop-157:50010]
> 4. blk_-1875839158119319613_1296640 len=67108864 repl=3 [hadoop-145:50010,
> hadoop-143:50010, hadoop-146:50010]
> 5. blk_-6953623390282045248_1296640 len=67108864 repl=3 [hadoop-143:50010,
> hadoop-157:50010, hadoop-144:50010]
> 6. blk_-3016727256928339770_1296640 len=67108864 repl=3 [hadoop-143:50010,
> hadoop-146:50010, hadoop-159:50010]
> 7. blk_3526351456802007773_1296640 len=67108864 repl=3 [hadoop-143:50010,
> hadoop-160:50010, hadoop-156:50010]
> 8. blk_5134681308608742320_1296640 len=67108864 repl=3 [hadoop-145:50010,
> hadoop-143:50010, hadoop-144:50010]
> 9. blk_-6875541109589395450_1296640 len=67108864 repl=3 [hadoop-145:50010,
> hadoop-143:50010, hadoop-156:50010]
> 10. blk_-553661064097182668_1296640 len=56068019 repl=3 [hadoop-143:50010,
> hadoop-146:50010, hadoop-159:50010]
>
> Status: HEALTHY
>  Total size:    727156659 B
>  Total dirs:    0
>  Total files:   1
>  Total blocks (validated):      11 (avg. block size 66105150 B)
>  Minimally replicated blocks:   11 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          9
>  Number of racks:               1
> FSCK ended at Thu Jun 21 13:40:11 KST 2012 in 4 milliseconds
>

As you see, data blocks of the HFile are stored across two different
datanodes (hadoop-145 and hadoop-143).

Let say a map task runs on hadoop-145 and needs to access the block 7. Then
the map task needs to remotely access the block 7 on hadoop-143 server.
Almost half of the data blocks are stored & accessed remotely. Referring
from the above example, It's hard to say that the data locality is being
applied to HBase.

Ben


On Fri, Jun 15, 2012 at 5:21 PM, Lars George <[email protected]> wrote:

> Hi Ben,
>
> See inline...
>
> On Jun 15, 2012, at 6:56 AM, Ben Kim wrote:
>
> > Hi,
> >
> > I've been posting questions in the mailing-list quiet often lately, and
> > here goes another one about data locality
> > I read the excellent blog post about data locality that Lars George wrote
> > at http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
> >
> > I understand data locality in hbase as locating a region in a
> region-server
> > where most of its data blocks reside.
>
> The opposite is happening, i.e. the region server process triggers for all
> data it writes to be located on the same physical machine.
>
> > So that way fast data access is guranteed when running a MR because each
> > map/reduce task is run for each region in the tasktracker where the
> region
> > co-locates.
>
> Correct.
>
> > But what if the data blocks of the region are evenly spread over multiple
> > region-servers?
>
> This will not happen, unless the original server fails. Then the region is
> moved to another that now needs to do a lot of remote reads over the
> network. This is way there is work being done to allow for custom placement
> policies in HDFS. That way you can store the entire region and all copies
> as complete units on three data nodes. In case of a failure you can then
> move the region to one of the two copies. This is not available yet though,
> but it is being worked on (so I heard).
>
> > Does a MR task has to remotely access the data blocks from other
> > regionservers?
>
> For the above failure case, it would be the region server accessing the
> remote data, yes.
>
> > How good is hbase locating datablocks where a region resides?
>
> That is again the wrong way around. HBase has no clue as to where blocks
> reside, nor does it know that the file system in fact uses separate blocks.
> HBase stores files, HDFS does the block magic underneath the hood, and
> transparent to HBase.
>
> > Also is it correct to say that if i set smaller data block size data
> > locality gets worse, and if data block size gets bigger  data locality
> gets
> > better.
>
> This is not applicable here, I am assuming this stems from the above
> confusion about which system is handling the blocks, HBase or HDFS. See
> above.
>
> HTH,
> Lars
>
> >
> > Best regards,
> > --
> >
> > *Benjamin Kim*
> > *benkimkimben at gmail*
>
>


-- 

*Benjamin Kim*
*benkimkimben at gmail*

Re: Data locality in HBase

Reply via email to