Re: schema help

Doug Meil Fri, 26 Aug 2011 12:15:08 -0700

+1 on everything said so far...

Sean, you might also want to check this:
http://hbase.apache.org/book.html#architecture






On 8/26/11 2:50 PM, "lars hofhansl" <[email protected]> wrote:

>In nutshell a change to HBase is performed like this:
>1. the WAL entry is written and sync'ed to disk
>2. The memstore is updated (that's just a cache in memory).
>3. When memstore reaches a certain size it is flushed to create a new
>file.
>4. When a certain number of files is reached, they are compacted
>(combined into fewer files)
>
>
>When you do a read, HBase scans the memstore and all relevant store files.
>It does that similar to what a mergesort does.
>
>-- Lars
>
>
>
>________________________________
>From: Sheng Chen <[email protected]>
>To: [email protected]
>Sent: Thursday, August 25, 2011 11:08 PM
>Subject: Re: schema help
>
>If the rows are added with random keys and flushed periodically, is it
>possible that every hfile holds almost the whole key range?
>Will it affect the random read performance, before the compaction is done?
>
>Thanks.
>
>Sean
>
>2011/8/25 Ian Varley <[email protected]>
>
>> The rows don't need to be inserted in order; they're maintained in
>> key-sorted order on the disk based on the architecture of HBase, which
>> stores data sorted in memory and periodically flushes to immutable
>>files in
>> HDFS (which are later compacted to make read access more efficient).
>>HBase
>> keeps track of which physical files might contain a given key range, and
>> only reads the ones it needs to.
>>
>> To do a query through the java API, you could create a scanner with a
>> startrow that is the concatenation of your value for fieldA and the
>>start
>> time, and an endrow that has the current time.
>>
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
>>
>> Ian
>>
>> On Aug 25, 2011, at 9:53 AM, Rita wrote:
>>
>> Thanks for your reponse.
>>
>> 30 million rows is the best case :-)
>>
>> Couple of questions about doing, [fieldA][time] as my key:
>>  Would I have to insert in order?
>>  If no, how would hbase know to stop scanning the entire table?
>>  How would a query actually look like, if my key was [fieldA time]?
>>
>> As a matter of fact, I can do 100% of my queries. I will leave the 5%
>>out
>> of my project/schema.
>>
>>
>> On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[email protected]
>> <mailto:[email protected]>> wrote:
>> Rita,
>>
>> There's no need to create separate tables here--the table is really
>>just a
>> "namespace" for keys. A better option would probably be having one table
>> with "[fieldA][time]" (the two fields concatenated) as your row key.
>>Then,
>> you can seek directly to the start of your records in constant time, and
>> then scan forward until you get to the end of the data (linear time in
>>the
>> size of data you expect to get back).
>>
>> The downside of this is that for the 5% of your queries that aren't in
>>this
>> form, you may have to do a full table scan. (Alternately, you could also
>> maintain secondary indexes that help you get the data back with less
>>than a
>> full table scan; that would depend on the nature of the queries).
>>
>> In general, a good rule of thumb when designing a schema in HBase is,
>>think
>> first about how you'd ideally like to access the data. Then structure
>>the
>> data to match that access pattern. (This is obviously not ideal if you
>>have
>> lots of different access patterns, but then, that's what relational
>> databases are for. Most commercial relational DBs wouldn't blink at
>>doing
>> analytical queries against 30 million rows.)
>>
>> Ian
>>
>> On Aug 25, 2011, at 9:03 AM, Rita wrote:
>>
>> Hello,
>>
>> I am trying to solve a time related problem. I can certainly use
>>opentsdb
>> for this but was wondering if anyone had a clever way to create this
>>type
>> of
>> schema.
>>
>> I have an inventory table,
>>
>> time (unix epoch), fieldA, fieldB, data
>>
>>
>> There are about 30 million of these entries.
>>
>> 95% of my queries will look like this:
>> show me where fieldA=zCORE from range [1314180693 to now]
>>
>> for fieldA, there is a possibility of 4000 unique items.
>> for fieldB, there is a possibility of 2 unique items (bool).
>>
>> So, I was thinking of creating 4000*2 tables and place the data like
>>that
>> so
>> I can easly scan.
>>
>> Any thoughts about this? Will hbase freak out if i have 8000 tables?
>>
>>
>>
>>
>>
>>
>> --
>> --- Get your facts first, then you can distort them as you please.--
>>
>>
>>
>>
>> --
>> --- Get your facts first, then you can distort them as you please.--
>>

Re: schema help

Reply via email to