Re: How to efficiently join HBase tables?

Bill Graham Tue, 31 May 2011 17:36:18 -0700

We use Pig to join HBase tables using HBaseStorage which has worked well. If
you're using HBase >= 0.89 you'll need to build from the trunk or the Pig
0.8 branch.



On Tue, May 31, 2011 at 5:18 PM, Jason Rutherglen <
[email protected]> wrote:

> > The Hive-HBase integration allows you to create Hive tables that are
> backed
> > by HBase
>
> In addition, HBase can be made to go faster for MapReduce jobs, if the
> HFile's could be used directly in HDFS, rather than proxying through
> the RegionServer.
>
> I'd imagine that join operations do not require realtime-ness, and so
> faster batch jobs using Hive -> frozen HBase files in HDFS could be
> the optimal way to go?
>
> On Tue, May 31, 2011 at 1:41 PM, Patrick Angeles <[email protected]>
> wrote:
> > On Tue, May 31, 2011 at 3:19 PM, Eran Kutner <[email protected]> wrote:
> >
> >> For my need I don't really need the general case, but even if I did I
> think
> >> it can probably be done simpler.
> >> The main problem is getting the data from both tables into the same MR
> job,
> >> without resorting to lookups. So without the theoretical
> >> MutliTableInputFormat, I could just copy all the data from both tables
> into
> >> a temp table, just append the source table name to the row keys to make
> >> sure
> >> there are no conflicts. When all the data from both tables is in the
> same
> >> temp table, run a MR job. For each row the mapper should emit a key
> which
> >> is
> >> composed of all the values of the join fields in that row (the value can
> be
> >> emitted as is). This will cause all the rows from both tables, with same
> >> join field values to arrive at the reducer together. The reducer could
> then
> >> iterate over them and produce the Cartesian product as needed.
> >>
> >> I still don't like having to copy all the data into a temp table just
> >> because I can't feed two tables into the MR job.
> >>
> >
> > Loading the smaller table in memory is called a map join, versus a
> > reduce-side join (a.k.a. common join). One reason to prefer a map join is
> > you avoid the shuffle phase which potentially involves several trips to
> disk
> > for the intermediate records due to spills, and also once through the
> > network to get each intermediate KV pair to the right reducer. With a map
> > join, everything is local, except for the part where you load the small
> > table.
> >
> >
> >>
> >> As Jason Rutherglen mentioned above, Hive can do joins. I don't know if
> it
> >> can do them for HBase and it will not suit my needs, but it would be
> >> interesting to know how is it doing them, if anyone knows.
> >>
> >
> > The Hive-HBase integration allows you to create Hive tables that are
> backed
> > by HBase. You can do joins on those tables (and also with standard Hive
> > tables). It might be worth trying out in your case as it lets you easily
> see
> > the load characteristics and the job runtime without much coding
> investment.
> >
> > There are probably some specific optimizations that can be applied to
> your
> > situation, but it's hard to say without knowing your use-case.
> >
> > Regards,
> >
> > - Patrick
> >
> >
> >> -eran
> >>
> >>
> >>
> >> On Tue, May 31, 2011 at 22:02, Ted Dunning <[email protected]>
> wrote:
> >>
> >> > The Cartesian product often makes an honest-to-god join not such a
> good
> >> > idea
> >> > on large data.  The common alternative is co-group
> >> > which is basically like doing the hard work of the join, but involves
> >> > stopping just before emitting the cartesian product.  This allows
> >> > you to inject whatever cleverness you need at this point.
> >> >
> >> > Common kinds of cleverness include down-sampling of problematically
> large
> >> > sets of candidates.
> >> >
> >> > On Tue, May 31, 2011 at 11:56 AM, Michael Segel
> >> > <[email protected]>wrote:
> >> >
> >> > > So the underlying problem that the OP was trying to solve was how to
> >> join
> >> > > two tables from HBase.
> >> > > Unfortunately I goofed.
> >> > > I gave a quick and dirty solution that is a bit incomplete. They row
> >> key
> >> > in
> >> > > the temp table has to be unique and I forgot about the Cartesian
> >> > > product. So my solution wouldn't work in the general case.
> >> > >
> >> >
> >>
> >
>

Re: How to efficiently join HBase tables?

Reply via email to