> Date: Fri, 29 Oct 2010 10:01:24 -0700
> Subject: Re: HBase random access in HDFS and block indices
> From: [email protected]
> To: [email protected]
>
> On Fri, Oct 29, 2010 at 6:41 AM, Sean Bigdatafun
> <[email protected]> wrote:
> > I have the same doubt here. Let's say I have a totally random read pattern
> > (uniformly distributed).
> >
> > Now let's assume my total data size stored in HBase is 100TB on 10
> > machines(not a big deal considering nowaday's disks), and the total size of
> > my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> > cache hit probablity. Under random read pattern, each read is bound to
> > experience the "open-> read index -> .... -> read datablock" sequence, which
> > would be expensive.
> >
> > Any comment?
> >
>
> If totally random, as per Alvin's suggestion, yes, just turn off block
> caching since it is doing you no good.
>
> But totally random is unusual in practise, no?
>
> St.Ack
Uhm... not exactly.
One of the benefits of HBase is that it should scale in a *near* linear fashion.
So if we don't know how the data is to be accessed, or we know that there are a
couple of access patterns that are orthogonal to each other, putting the data
in to the cloud in a 'random' fashion should provide consistent read access
times.
So the design of 'random' stored data shouldn't be that unusual. It just means
you're going to have a couple of different indexes. ;-)