Re: Do we need to split the table into two when there are two many rows in one table?

Ted Yu Thu, 12 Aug 2010 10:53:53 -0700

I think moving all column 2 into another table would help utilize block
cache more efficiently.


On Wed, Aug 11, 2010 at 4:45 PM, Yu Bady <[email protected]> wrote:

> Thank St.Ack very much for the helpful answers.
>
> Inline also.
>
> On Wed, Aug 11, 2010 at 11:11 PM, Stack <[email protected]> wrote:
>
> > Inline below.
> >
> > On Tue, Aug 10, 2010 at 10:55 PM, Yu Bady <[email protected]> wrote:
> > > Hi,
> > >
> > >
> > > We are going to use HBase to store our large volume of pretty
> structured
> > > data.
> > >
> > > Every day, we will have about 24 new roles added to one table. After
> > three
> > > months, there will be about 4,000,000,000 new rows in the table.
> > >
> >
> > Sounds fine.
> >
> > > By the way,  in the table, each row will have about 8 column families
> and
> > > each column family will have 2-3 columns. But each cell just contains
> 20
> > > bytes data.
> > >
> >
> > Why 8 column families?  You'll be doing accesses against individual
> > column families?   If you could do with yes, that'd be better but 8
> > should be fine.
> >
> >
> > >
> > > So I have following questions:
> > >
> > > 1. How many rows can HBase supports in one table?
> > >
> >
> > I don't know.  I know of tables of 30B small rows.
> >
> >
> > > 2. After one year, there will be about 16,000,000,000 rows in the
> table.
> > If
> > > the row numbers are too large, is it helpful to solve the problem by
> > > splitting the original table into several tables? How to split one
> table
> > > into several tables?
> > >
> >
> > How big are your cells?
> >
>
>
> Each cell contain a string less than 20 bytes. In fact, each cell holds
>  either an integer number or a double number. Quite a few cell will have no
> value, which means its value is 0/0.0.
>
>
> > As far as hbase is concerned, there is no real difference hosting many
> > vs one table.
> >
> > > 3. Any other suggestions?
> > >
> >
> > Tell us more about how you intend to access the table -- the kinda of
> > queries -- otherwise, sounds fine.  Can you try things out in the
> > small first to learn edgecases yourself first?
> >
> >
> Let me give an example here.  To ease the description, suppose we only have
> 2 column families instead of 8.
>
> We have some logs. Each log line contains several fields as follows:
>         user 1|val_a | val_b | ....
>
> After processing the logs, the values will be filled into the HBase by
> map/reduce:
>
>            |  column family a        |       column family b   |
>            | column 1 | column 2 | column 1 | column 2  |
> -------------------------------------------------------------------
> user 1 |    val_a     |                 |  val_b       |                  |
>
> Then we will run map/reduce against the HBase table to aggregate some value
> of column1 for each column family and the result will be filled in column
> 2.
> That is, the map/reduce will read value in column 1 and write result value
> to column 2 for each column family.
>
> The query to the Hbase table will only access the value in column 2 but it
> may access both column families at the same time.
>
> Of cause, we can merge the two column families into one as follows:
>            |                  column family  a-b
>            |
>            | column_a_1|column_a_2|column_b_1|column_b_2|
> -------------------------------------------------------------------
> user 1 |    val_a         |                     |  val_b       |
>     |
>
> Does it benefit the performance? What is the rule for column family
> organization?
>
> What's your suggestion on the placement of column 2? leave it as current
> design or move it out into another table? If we move all column 2 into
> another table, it will increase space consumption. Does it?
>
>
>
>
>
>
> > St.Ack
> >
>

Re: Do we need to split the table into two when there are two many rows in one table?

Reply via email to