Re: [ hbase ] performance of Get from MR Job

Michael Segel Wed, 27 Jun 2012 08:49:03 -0700

I'm not sure as to what you are attempting to do with your data. 

There are a couple of things to look at.


Looking at the issue, you have (K,V) pair. That's Key, Value. 
But the value isn't necessarily a single element. It could be a set of 
elements. 

You have to consider that rather than store versions of a cell using the 
timestamp to indicate the revision of the data. (There are some design issues 
with this concept) Or you could incorporate the timestamp in your column name. 

While a get() is really a scan() that returns one row, it should be faster than 
what you are experiencing. 
Schema design is a bit tricky to master because its going to be data dependent 
along with your use case. 


On Jun 25, 2012, at 2:32 AM, Marcin Cylke wrote:

> On 21/06/12 14:33, Michael Segel wrote:
>> I think the version issue is the killer factor here. 
>> Usually performing a simple get() where you are getting the latest version 
>> of the data on the row/cell occurs in some constant time k. This is constant 
>> regardless of the size of the cluster and should scale in a near linear 
>> curve.  
>> 
>> As JD C points out, if your storing temporal data, you should make time part 
>> of your schema. 
> 
> I've rewritten my job to load data and not fill individual timestamps
> for columns, but rather add timestamp to rowkey. Now it looks like this
> 
> [previous key][Long.MAX_VALUE-timestamp]
> (without braces)
> 
> My keys look like this now:
> 
> 488892772259223372035596613844
> 
> and I'm issuing a scan like this:
> 
> Scan scan = new Scan("488892772259");
> scan.setMaxVersions(1);
> 
> So I'm searching for my key without timestamp part added. What I'm
> getting back is all the rows that start with "488892772259".
> 
> Now the performance is even worse than before (with versioned data).
> 
> What I'm also observing is the "hugeness" of my tables and influence of
> compression on the performance:
> 
> My initial data - stored in Hive table - is ~ 1.5GB. When I load it into
> HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size
> down to ~1.5GB, but it also dramatically reduces performance.
> 
> To sum up, here are rough times of execution and rates of requests that
> I've been observing (for each option I've added GET/SCAN throughput and
> rough execution time):
> 
> - versioned data (uncompressed table)
>    - with misses (asking for non-existent key) - ~400 gets/sec - ~1h
>    - with hits (asking for existing keys) - ~150gets/sec - ~20h
> - single version (with complex key)
>    - uncompressed - ~30 scans/sec - ~25h
>    - compressed with LZO - ~15 scans/sec - ~30h
> 
> If that would be necessary I could provide complete data - with time
> distribution of the number of gets/scans.
> 
> This performance issues are very strange to me - do You have any
> suggestions as to what's causing so big increase in the time of execution?
> 
> Regards
> Marcin
>

Re: [ hbase ] performance of Get from MR Job

Reply via email to