performance using versions as dimension

Oliver Meyn Wed, 19 May 2010 13:53:27 -0700

Hi All,

I'm new to hbase and columnar storage schemas, so any comments youhave on the schema or the actual problem at hand are very muchwelcome. I'm using 0.20.4, initially testing as standalone on mydevelopment laptop (OS X), all settings default except for datadirectory, and accessing hbase through the Java api.

In my initial testing I have 50 Gateways, each of which areresponsible for 100 unique Devices, each of which report their powerusage every second. So that's 5000 total, unique Devices. Here arethe queries I'd like to answer:


1) What is the current power consumption of Device X?

2) What is the average power consumption of Device X btw Date 1 andDate 2?

3) What is the current power consumption at Gateway Y?

4) What is the average power consumption at Gateway Y btw Date 1 andDate 2?

I'm imagining this as two tables - "devices" and "gateways". Thedevices table has a column family called "device_samples" which onlyhas one column "power" and 5000 rows (one for each device). Every newsample gets written to the power column of its device at the timestampfrom the original sample sent by the Device. Now I can answer query 1with a simple get, and I can answer query 2 using the api setTimeRangecall on another simple get (and do my own math to average theresults). This works great so far - with 50k versions in each cellquery 1 is less than 50ms, and query 2 is only marginally more (on mydev machine, remember).

The gateways table could just hold the list of its deviceids and thenI have to manually fetch its 100 device entries from the devicestable, but that proves to be quite slow. So at the cost of disks Itried a schema such that it has a cf "gateway_samples" where each rowis a gateway id (so exactly 50 rows), and it has a column for each ofits 100 devices (so each row has 100 columns, but the cf has 5000columns). Each sample is written to those cells in the same way asthe devices table. Then I should be able to answer query 3 with a"get latest versions from the whole row" and do my own sums, andsimilarly query 4. In practice though, this works as expected (50ms)with very little data in the gateways table (50k total keyvalues), butonce I've run the devices for a bit (~1.5M total keyvalues) a singlerow fetch takes 600ms.

Granted these are performance numbers from a dev machine with hbaserunning in standalone mode, so have no bearing on reality. But itfeels like I'm doing something wrong when the devices table respondsvery quickly and the gateways doesn't. I've tried moving hbase to anold linux machine with the client still running from my dev machineand got basically the same results with a bit extra time for thenetwork.


Any and all advice is appreciated.

Thanks,
Oliver

performance using versions as dimension

Reply via email to