Thanks everyone for all the helpful insights! -eran
On Wed, Jun 1, 2011 at 03:41, Jason Rutherglen <[email protected]>wrote: > > I'd imagine that join operations do not require realtime-ness, and so > > faster batch jobs using Hive -> frozen HBase files in HDFS could be > > the optimal way to go? > > In addition to lessening the load on the perhaps live RegionServer. > There's no Jira for this, I'm tempted to open one. > > On Tue, May 31, 2011 at 5:18 PM, Jason Rutherglen > <[email protected]> wrote: > >> The Hive-HBase integration allows you to create Hive tables that are > backed > >> by HBase > > > > In addition, HBase can be made to go faster for MapReduce jobs, if the > > HFile's could be used directly in HDFS, rather than proxying through > > the RegionServer. > > > > I'd imagine that join operations do not require realtime-ness, and so > > faster batch jobs using Hive -> frozen HBase files in HDFS could be > > the optimal way to go? > > > > On Tue, May 31, 2011 at 1:41 PM, Patrick Angeles <[email protected]> > wrote: > >> On Tue, May 31, 2011 at 3:19 PM, Eran Kutner <[email protected]> wrote: > >> > >>> For my need I don't really need the general case, but even if I did I > think > >>> it can probably be done simpler. > >>> The main problem is getting the data from both tables into the same MR > job, > >>> without resorting to lookups. So without the theoretical > >>> MutliTableInputFormat, I could just copy all the data from both tables > into > >>> a temp table, just append the source table name to the row keys to make > >>> sure > >>> there are no conflicts. When all the data from both tables is in the > same > >>> temp table, run a MR job. For each row the mapper should emit a key > which > >>> is > >>> composed of all the values of the join fields in that row (the value > can be > >>> emitted as is). This will cause all the rows from both tables, with > same > >>> join field values to arrive at the reducer together. The reducer could > then > >>> iterate over them and produce the Cartesian product as needed. > >>> > >>> I still don't like having to copy all the data into a temp table just > >>> because I can't feed two tables into the MR job. > >>> > >> > >> Loading the smaller table in memory is called a map join, versus a > >> reduce-side join (a.k.a. common join). One reason to prefer a map join > is > >> you avoid the shuffle phase which potentially involves several trips to > disk > >> for the intermediate records due to spills, and also once through the > >> network to get each intermediate KV pair to the right reducer. With a > map > >> join, everything is local, except for the part where you load the small > >> table. > >> > >> > >>> > >>> As Jason Rutherglen mentioned above, Hive can do joins. I don't know if > it > >>> can do them for HBase and it will not suit my needs, but it would be > >>> interesting to know how is it doing them, if anyone knows. > >>> > >> > >> The Hive-HBase integration allows you to create Hive tables that are > backed > >> by HBase. You can do joins on those tables (and also with standard Hive > >> tables). It might be worth trying out in your case as it lets you easily > see > >> the load characteristics and the job runtime without much coding > investment. > >> > >> There are probably some specific optimizations that can be applied to > your > >> situation, but it's hard to say without knowing your use-case. > >> > >> Regards, > >> > >> - Patrick > >> > >> > >>> -eran > >>> > >>> > >>> > >>> On Tue, May 31, 2011 at 22:02, Ted Dunning <[email protected]> > wrote: > >>> > >>> > The Cartesian product often makes an honest-to-god join not such a > good > >>> > idea > >>> > on large data. The common alternative is co-group > >>> > which is basically like doing the hard work of the join, but involves > >>> > stopping just before emitting the cartesian product. This allows > >>> > you to inject whatever cleverness you need at this point. > >>> > > >>> > Common kinds of cleverness include down-sampling of problematically > large > >>> > sets of candidates. > >>> > > >>> > On Tue, May 31, 2011 at 11:56 AM, Michael Segel > >>> > <[email protected]>wrote: > >>> > > >>> > > So the underlying problem that the OP was trying to solve was how > to > >>> join > >>> > > two tables from HBase. > >>> > > Unfortunately I goofed. > >>> > > I gave a quick and dirty solution that is a bit incomplete. They > row > >>> key > >>> > in > >>> > > the temp table has to be unique and I forgot about the Cartesian > >>> > > product. So my solution wouldn't work in the general case. > >>> > > > >>> > > >>> > >> > > >
