Re: performance using versions as dimension

Oliver Meyn Thu, 20 May 2010 12:06:15 -0700

Hi Travis,

Thanks for the suggestions. As it happens I simplified the problemfor my original question and now the details start to matter. Thegateways actually have some smarts, and will only send a device sampleif the device's power consumption has changed "significantly", wherethat significance is configurable per gateway. That means the latestpower sample for a device could have been sent hours (or even days)ago, so scanning for the latest result is trickier than "go back atleast one second and you're guaranteed a result". That's why I likedthe versions style where the sample on top of the "stack" is thelatest, regardless of when that was.

If you have any thoughts on that wrinkle, I'm happy to hear them. Inthe meantime I'm trying out your schema anyway, and another variant inwhich each sample gets a new column (named after its timestamp)against a row key of gw_id.dev_id.


Thanks,
Oliver

On 20-May-10, at 9:34 AM, Hegner, Travis wrote:

Oliver,

It may be an assumption I've made, but it seems to me that hbase ismost efficient at handling a larger number of rows than timestamps,or even columns for that matter (I think it's on the Hbase main pageI read "Billions of rows x Millions of columns x thousands ofversions", which leads to my assumption).

Perhaps you should consider testing with each datapoint stored as anindividual row, with a row id like: <unix_time><gw_id><dev_id> or<unix_time>.<gw_id>.<dev_id>

With that method, you could answer query 1 by finding the last entryfor a given "<gw_id><dev_id>", query 3 by getting all the latest<dev_id>'s for any given gw, and queries 2 and 4 by simply grabbinga range of rows and parsing through the results since they arealready ordered by the timestamp that they arrived.

This way, you are really only "getting" what you actually need, andto scan for the latest entry of any given device, your only havingto scan through 5000 very small rows at most.


Just a thought, HTH,

Travis Hegner
http://www.travishegner.com/


-----Original Message-----
From: Oliver Meyn [mailto:[email protected]]
Sent: Thursday, May 20, 2010 8:59 AM
To: [email protected]
Subject: Re: performance using versions as dimension

Thanks for the quick reply Jonathan:

On 19-May-10, at 5:04 PM, Jonathan Gray wrote:


What you have is 100 columns in each row.  Each column has lots of
versions.  You want values from all 100 columns within a specified
TimeRange... correct?


This is the 2nd of my two gateway queries - this one being slow I
understand and your explanation makes sense.  The first query is a
simple "get me the latest version from every column for this row" and
that is what, to me, is perplexingly slow.  To be clear, there's a
good chance that each of those columns will have a different
timestamp, but "the latest reading" is what I'm interested in.


I need to think on what the best way to implement this would be,
perhaps with a better understanding now you can too :)


I know it's something of a religious topic, but as of 0.20.4, is using
versions as a data dimension legitimate?  Because I could easily
approach millions of versions per column, am I in danger of running
into the elsewhere-mentioned row-split problem (each of my cell values
is a double)?  I ask because if that's going to be a problem then I
need to rethink my schema anyway, and then we don't need to waste
cycles on the current problem.

Thanks again,
Oliver

-----Original Message-----
From: Oliver Meyn [mailto:[email protected]]
Sent: Wednesday, May 19, 2010 1:53 PM
To: [email protected]
Subject: performance using versions as dimension

Hi All,

I'm new to hbase and columnar storage schemas, so any comments you
have on the schema or the actual problem at hand are very much
welcome.  I'm using 0.20.4, initially testing as standalone on my
development laptop (OS X), all settings default except for data
directory, and accessing hbase through the Java api.

In my initial testing I have 50 Gateways, each of which are
responsible for 100 unique Devices, each of which report their power
usage every second.  So that's 5000 total, unique Devices.  Here are
the queries I'd like to answer:

1) What is the current power consumption of Device X?
2) What is the average power consumption of Device X btw Date 1 and
Date 2?
3) What is the current power consumption at Gateway Y?
4) What is the average power consumption at Gateway Y btw Date 1 and
Date 2?

I'm imagining this as two tables - "devices" and "gateways".  The
devices table has a column family called "device_samples" which only
has one column "power" and 5000 rows (one for each device).  Every
new
sample gets written to the power column of its device at the
timestamp
from the original sample sent by the Device.  Now I can answer
query 1
with a simple get, and I can answer query 2 using the api
setTimeRange
call on another simple get (and do my own math to average the
results).  This works great so far - with 50k versions in each cell

query 1 is less than 50ms, and query 2 is only marginally more (onmy

dev machine, remember).

The gateways table could just hold the list of its deviceids andthen

I have to manually fetch its 100 device entries from the devices
table, but that proves to be quite slow.  So at the cost of disks I

tried a schema such that it has a cf "gateway_samples" where eachrowis a gateway id (so exactly 50 rows), and it has a column for eachof

its 100 devices (so each row has 100 columns, but the cf has 5000
columns).  Each sample is written to those cells in the same way as
the devices table.  Then I should be able to answer query 3 with a
"get latest versions from the whole row" and do my own sums, and

similarly query 4. In practice though, this works as expected(50ms)

with very little data in the gateways table (50k total keyvalues),
but
once I've run the devices for a bit (~1.5M total keyvalues) a single
row fetch takes 600ms.

Granted these are performance numbers from a dev machine with hbase
running in standalone mode, so have no bearing on reality.  But it
feels like I'm doing something wrong when the devices table responds

very quickly and the gateways doesn't. I've tried moving hbase toan

old linux machine with the client still running from my dev machine
and got basically the same results with a bit extra time for the
network.

Any and all advice is appreciated.

Thanks,
Oliver

Re: performance using versions as dimension

Reply via email to