RE: schema help

Jimson K. James Thu, 25 Aug 2011 20:34:55 -0700

Hi Ian,

Can you just get me some reference to the key sorted architecture in
hbase?
Seems there is not much documentation out there.

-----Original Message-----
From: Ian Varley [mailto:[email protected]] 
Sent: Thursday, August 25, 2011 8:33 PM
To: [email protected]
Subject: Re: schema help

The rows don't need to be inserted in order; they're maintained in
key-sorted order on the disk based on the architecture of HBase, which
stores data sorted in memory and periodically flushes to immutable files
in HDFS (which are later compacted to make read access more efficient).
HBase keeps track of which physical files might contain a given key
range, and only reads the ones it needs to.

To do a query through the java API, you could create a scanner with a
startrow that is the concatenation of your value for fieldA and the
start time, and an endrow that has the current time.

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html

Ian

On Aug 25, 2011, at 9:53 AM, Rita wrote:

Thanks for your reponse.

30 million rows is the best case :-)

Couple of questions about doing, [fieldA][time] as my key:
  Would I have to insert in order?
  If no, how would hbase know to stop scanning the entire table?
  How would a query actually look like, if my key was [fieldA time]?

As a matter of fact, I can do 100% of my queries. I will leave the 5%
out of my project/schema.

On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley
<[email protected]<mailto:[email protected]>> wrote:
Rita,

There's no need to create separate tables here--the table is really just
a "namespace" for keys. A better option would probably be having one
table with "[fieldA][time]" (the two fields concatenated) as your row
key. Then, you can seek directly to the start of your records in
constant time, and then scan forward until you get to the end of the
data (linear time in the size of data you expect to get back).

The downside of this is that for the 5% of your queries that aren't in
this form, you may have to do a full table scan. (Alternately, you could
also maintain secondary indexes that help you get the data back with
less than a full table scan; that would depend on the nature of the
queries).

In general, a good rule of thumb when designing a schema in HBase is,
think first about how you'd ideally like to access the data. Then
structure the data to match that access pattern. (This is obviously not
ideal if you have lots of different access patterns, but then, that's
what relational databases are for. Most commercial relational DBs
wouldn't blink at doing analytical queries against 30 million rows.)

Ian

On Aug 25, 2011, at 9:03 AM, Rita wrote:

Hello,

I am trying to solve a time related problem. I can certainly use
opentsdb
for this but was wondering if anyone had a clever way to create this
type of
schema.

I have an inventory table,

time (unix epoch), fieldA, fieldB, data

There are about 30 million of these entries.

95% of my queries will look like this:
show me where fieldA=zCORE from range [1314180693 to now]

for fieldA, there is a possibility of 4000 unique items.
for fieldB, there is a possibility of 2 unique items (bool).

So, I was thinking of creating 4000*2 tables and place the data like
that so
I can easly scan.

Any thoughts about this? Will hbase freak out if i have 8000 tables?

--
--- Get your facts first, then you can distort them as you please.--

--
--- Get your facts first, then you can distort them as you please.--

***** Confidentiality Statement/Disclaimer *****

This message and any attachments is intended for the sole use of the intended 
recipient. It may contain confidential information. Any unauthorized use, 
dissemination or modification is strictly prohibited. If you are not the 
intended recipient, please notify the sender immediately then delete it from 
all your systems, and do not copy, use or print. Internet communications are 
not secure and it is the responsibility of the recipient to make sure that it 
is virus/malicious code exempt.
The company/sender cannot be responsible for any unauthorized alterations or 
modifications made to the contents. If you require any form of confirmation of 
the contents, please contact the company/sender. The company/sender is not 
liable for any errors or omissions in the content of this message.

RE: schema help

Reply via email to