I think moving all column 2 into another table would help utilize block cache more efficiently.
On Wed, Aug 11, 2010 at 4:45 PM, Yu Bady <[email protected]> wrote: > Thank St.Ack very much for the helpful answers. > > Inline also. > > On Wed, Aug 11, 2010 at 11:11 PM, Stack <[email protected]> wrote: > > > Inline below. > > > > On Tue, Aug 10, 2010 at 10:55 PM, Yu Bady <[email protected]> wrote: > > > Hi, > > > > > > > > > We are going to use HBase to store our large volume of pretty > structured > > > data. > > > > > > Every day, we will have about 24 new roles added to one table. After > > three > > > months, there will be about 4,000,000,000 new rows in the table. > > > > > > > Sounds fine. > > > > > By the way, in the table, each row will have about 8 column families > and > > > each column family will have 2-3 columns. But each cell just contains > 20 > > > bytes data. > > > > > > > Why 8 column families? You'll be doing accesses against individual > > column families? If you could do with yes, that'd be better but 8 > > should be fine. > > > > > > > > > > So I have following questions: > > > > > > 1. How many rows can HBase supports in one table? > > > > > > > I don't know. I know of tables of 30B small rows. > > > > > > > 2. After one year, there will be about 16,000,000,000 rows in the > table. > > If > > > the row numbers are too large, is it helpful to solve the problem by > > > splitting the original table into several tables? How to split one > table > > > into several tables? > > > > > > > How big are your cells? > > > > > Each cell contain a string less than 20 bytes. In fact, each cell holds > either an integer number or a double number. Quite a few cell will have no > value, which means its value is 0/0.0. > > > > As far as hbase is concerned, there is no real difference hosting many > > vs one table. > > > > > 3. Any other suggestions? > > > > > > > Tell us more about how you intend to access the table -- the kinda of > > queries -- otherwise, sounds fine. Can you try things out in the > > small first to learn edgecases yourself first? > > > > > Let me give an example here. To ease the description, suppose we only have > 2 column families instead of 8. > > We have some logs. Each log line contains several fields as follows: > user 1|val_a | val_b | .... > > After processing the logs, the values will be filled into the HBase by > map/reduce: > > | column family a | column family b | > | column 1 | column 2 | column 1 | column 2 | > ------------------------------------------------------------------- > user 1 | val_a | | val_b | | > > Then we will run map/reduce against the HBase table to aggregate some value > of column1 for each column family and the result will be filled in column > 2. > That is, the map/reduce will read value in column 1 and write result value > to column 2 for each column family. > > The query to the Hbase table will only access the value in column 2 but it > may access both column families at the same time. > > Of cause, we can merge the two column families into one as follows: > | column family a-b > | > | column_a_1|column_a_2|column_b_1|column_b_2| > ------------------------------------------------------------------- > user 1 | val_a | | val_b | > | > > Does it benefit the performance? What is the rule for column family > organization? > > What's your suggestion on the placement of column 2? leave it as current > design or move it out into another table? If we move all column 2 into > another table, it will increase space consumption. Does it? > > > > > > > > St.Ack > > >
