+1 on everything said so far... Sean, you might also want to check this: http://hbase.apache.org/book.html#architecture
On 8/26/11 2:50 PM, "lars hofhansl" <[email protected]> wrote: >In nutshell a change to HBase is performed like this: >1. the WAL entry is written and sync'ed to disk >2. The memstore is updated (that's just a cache in memory). >3. When memstore reaches a certain size it is flushed to create a new >file. >4. When a certain number of files is reached, they are compacted >(combined into fewer files) > > >When you do a read, HBase scans the memstore and all relevant store files. >It does that similar to what a mergesort does. > >-- Lars > > > >________________________________ >From: Sheng Chen <[email protected]> >To: [email protected] >Sent: Thursday, August 25, 2011 11:08 PM >Subject: Re: schema help > >If the rows are added with random keys and flushed periodically, is it >possible that every hfile holds almost the whole key range? >Will it affect the random read performance, before the compaction is done? > >Thanks. > >Sean > >2011/8/25 Ian Varley <[email protected]> > >> The rows don't need to be inserted in order; they're maintained in >> key-sorted order on the disk based on the architecture of HBase, which >> stores data sorted in memory and periodically flushes to immutable >>files in >> HDFS (which are later compacted to make read access more efficient). >>HBase >> keeps track of which physical files might contain a given key range, and >> only reads the ones it needs to. >> >> To do a query through the java API, you could create a scanner with a >> startrow that is the concatenation of your value for fieldA and the >>start >> time, and an endrow that has the current time. >> >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html >> >> Ian >> >> On Aug 25, 2011, at 9:53 AM, Rita wrote: >> >> Thanks for your reponse. >> >> 30 million rows is the best case :-) >> >> Couple of questions about doing, [fieldA][time] as my key: >> Would I have to insert in order? >> If no, how would hbase know to stop scanning the entire table? >> How would a query actually look like, if my key was [fieldA time]? >> >> As a matter of fact, I can do 100% of my queries. I will leave the 5% >>out >> of my project/schema. >> >> >> On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[email protected] >> <mailto:[email protected]>> wrote: >> Rita, >> >> There's no need to create separate tables here--the table is really >>just a >> "namespace" for keys. A better option would probably be having one table >> with "[fieldA][time]" (the two fields concatenated) as your row key. >>Then, >> you can seek directly to the start of your records in constant time, and >> then scan forward until you get to the end of the data (linear time in >>the >> size of data you expect to get back). >> >> The downside of this is that for the 5% of your queries that aren't in >>this >> form, you may have to do a full table scan. (Alternately, you could also >> maintain secondary indexes that help you get the data back with less >>than a >> full table scan; that would depend on the nature of the queries). >> >> In general, a good rule of thumb when designing a schema in HBase is, >>think >> first about how you'd ideally like to access the data. Then structure >>the >> data to match that access pattern. (This is obviously not ideal if you >>have >> lots of different access patterns, but then, that's what relational >> databases are for. Most commercial relational DBs wouldn't blink at >>doing >> analytical queries against 30 million rows.) >> >> Ian >> >> On Aug 25, 2011, at 9:03 AM, Rita wrote: >> >> Hello, >> >> I am trying to solve a time related problem. I can certainly use >>opentsdb >> for this but was wondering if anyone had a clever way to create this >>type >> of >> schema. >> >> I have an inventory table, >> >> time (unix epoch), fieldA, fieldB, data >> >> >> There are about 30 million of these entries. >> >> 95% of my queries will look like this: >> show me where fieldA=zCORE from range [1314180693 to now] >> >> for fieldA, there is a possibility of 4000 unique items. >> for fieldB, there is a possibility of 2 unique items (bool). >> >> So, I was thinking of creating 4000*2 tables and place the data like >>that >> so >> I can easly scan. >> >> Any thoughts about this? Will hbase freak out if i have 8000 tables? >> >> >> >> >> >> >> -- >> --- Get your facts first, then you can distort them as you please.-- >> >> >> >> >> -- >> --- Get your facts first, then you can distort them as you please.-- >>
