On 21/06/12 14:33, Michael Segel wrote:
> I think the version issue is the killer factor here. 
> Usually performing a simple get() where you are getting the latest version of 
> the data on the row/cell occurs in some constant time k. This is constant 
> regardless of the size of the cluster and should scale in a near linear 
> curve.  
> 
> As JD C points out, if your storing temporal data, you should make time part 
> of your schema. 

I've rewritten my job to load data and not fill individual timestamps
for columns, but rather add timestamp to rowkey. Now it looks like this

[previous key][Long.MAX_VALUE-timestamp]
(without braces)

My keys look like this now:

488892772259223372035596613844

and I'm issuing a scan like this:

Scan scan = new Scan("488892772259");
scan.setMaxVersions(1);

So I'm searching for my key without timestamp part added. What I'm
getting back is all the rows that start with "488892772259".

Now the performance is even worse than before (with versioned data).

What I'm also observing is the "hugeness" of my tables and influence of
compression on the performance:

My initial data - stored in Hive table - is ~ 1.5GB. When I load it into
HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size
down to ~1.5GB, but it also dramatically reduces performance.

To sum up, here are rough times of execution and rates of requests that
I've been observing (for each option I've added GET/SCAN throughput and
rough execution time):

- versioned data (uncompressed table)
    - with misses (asking for non-existent key) - ~400 gets/sec - ~1h
    - with hits (asking for existing keys) - ~150gets/sec - ~20h
- single version (with complex key)
    - uncompressed - ~30 scans/sec - ~25h
    - compressed with LZO - ~15 scans/sec - ~30h

If that would be necessary I could provide complete data - with time
distribution of the number of gets/scans.

This performance issues are very strange to me - do You have any
suggestions as to what's causing so big increase in the time of execution?

Regards
Marcin

Reply via email to