Re: schema optimisation - go for multiple tables, rows or column families?

kisalay Mon, 09 Jan 2012 05:17:48 -0800

Tom,

I would want to add to what Jonathan suggested. The approach (1) of having
multiple problems:
a> As Jonathan suggested, regions are created on a per table basis, so data
from different tables will fall in different regions. There is no guarantee
on what servers are these regions allocated.
b> The greater problem that I perceive with the approach 1 is that small
metadata table may not be split well into regions (as the splitting is size
based) and hence can become a hot-spot, as a lot of keys will fall in one
region.


There is more. If you store the two data in different column-families, they
will in-turn be stored in different store-files. So when you fetch the two
of them, you will indeed be fetching data from two different store-files,
and possibly from two different physical nodes.

So, I would ask you: Can you store both meta and measurement data as two
different columns in the same column-family ? In that case one fetch on the
key for both the data-points will resolve to same region, same store file.

just a thought

~Kisalay


On Mon, Jan 9, 2012 at 5:21 PM, Jonathan Hsieh <[email protected]> wrote:

> Hi Tom,
>
> In the case you describe -- two HTables -- there is no guarantee that they
> will end up going to the same region server.  If you have multiple tables,
> these are different regions and which can (and most likely will) be
> distributed to different regionserver machines.  The fact that both tables
> use the same rowkeys doesn't matter.
>
> If you use (2), the single table with column family approach, they would be
> located in the same region and thus the same regionserver.
>
> Given your concerns, and depending on your read patterns (do you do a lot
> of scans of only the meta data?), I'd probably take approach (2) or (3).
>
> Jon.
>
> On Mon, Jan 9, 2012 at 2:01 AM, Tom <[email protected]> wrote:
>
> > Hello,
> >
> > I got most, but not all, answers about schemas from the HBase Book and
> the
> > "Definite Guide".
> > Let's say there is a single row key and I use this key to add to two
> > tables, one row each (case (1)).
> > Could someone please confirm that even though the tables are different,
> > based on the key, this data will end up in the same or at least adjacent
> > regions? (I.e. my hbase client has to deal with two HTable instances but
> > only one region server needs to be looked up)?
> >
> > Thank you,
> > Tom
> >
> > Background:
> > I have two types of data: meta data (low volume) and measurement data
> > (high volume); and I get requests coming in where, based on an ID, I need
> > my HBase client to be able to access both metadata and measurement data
> for
> > this ID quickly. I want to reduce communication overhead (lookups, number
> > of tcp connections etc).
> >
> > In regards to dealing with the two types of data in Hbase, I see these
> > three design choices, which one to go for?
> >
> > (1) Multiple tables - single key - single column family
> >
> > (2) Single table - single key - multiple column families (the HBase Book
> > advises against that in section 6.2).
> >
> > (3) Single table - multiple keys (all made in such a way that they will
> be
> > co-located and system wide hot spots are avoided) - single column family
> >
> >
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // [email protected]
>

Re: schema optimisation - go for multiple tables, rows or column families?

Reply via email to