> you somehow need to flush all in-memory data *and* perform a > major compaction
This makes sense. Without compaction the linear HDFS scan isn't possible. I suppose one could compact 'offline' in a different Map Reduce job. However that would have it's own issues. > The files do have a flag if they were made by a major compaction, > so you scan only those and ignore the newer ones - but then you are trailing This could be ok in many cases. The key would be to create a sync'd cut off point enabling a frozen point-in-time 'view' of the data. I'm not sure how that would be implemented. On Wed, Jun 1, 2011 at 6:54 AM, Lars George <[email protected]> wrote: > Hi Jason, > > This was discussed in the past, using the HFileInputFormat. The issue > is that you somehow need to flush all in-memory data *and* perform a > major compaction - or else you would need all the logic of the > ColumnTracker in the HFIF. Since that needs to scan all storage files > in parallel to achieve its job, the MR task would not really be able > to use the same approach. > > Running a major compaction creates a lot of churn, so it is > questionable what the outcome is. The files do have a flag if they > were made by a major compaction, so you scan only those and ignore the > newer ones - but then you are trailing, and you still do not handle > delete markers/updates in newer files. No easy feat. > > Lars > > On Wed, Jun 1, 2011 at 2:41 AM, Jason Rutherglen > <[email protected]> wrote: >>> I'd imagine that join operations do not require realtime-ness, and so >>> faster batch jobs using Hive -> frozen HBase files in HDFS could be >>> the optimal way to go? >> >> In addition to lessening the load on the perhaps live RegionServer. >> There's no Jira for this, I'm tempted to open one. >> >> On Tue, May 31, 2011 at 5:18 PM, Jason Rutherglen >> <[email protected]> wrote: >>>> The Hive-HBase integration allows you to create Hive tables that are backed >>>> by HBase >>> >>> In addition, HBase can be made to go faster for MapReduce jobs, if the >>> HFile's could be used directly in HDFS, rather than proxying through >>> the RegionServer. >>> >>> I'd imagine that join operations do not require realtime-ness, and so >>> faster batch jobs using Hive -> frozen HBase files in HDFS could be >>> the optimal way to go? >>> >>> On Tue, May 31, 2011 at 1:41 PM, Patrick Angeles <[email protected]> >>> wrote: >>>> On Tue, May 31, 2011 at 3:19 PM, Eran Kutner <[email protected]> wrote: >>>> >>>>> For my need I don't really need the general case, but even if I did I >>>>> think >>>>> it can probably be done simpler. >>>>> The main problem is getting the data from both tables into the same MR >>>>> job, >>>>> without resorting to lookups. So without the theoretical >>>>> MutliTableInputFormat, I could just copy all the data from both tables >>>>> into >>>>> a temp table, just append the source table name to the row keys to make >>>>> sure >>>>> there are no conflicts. When all the data from both tables is in the same >>>>> temp table, run a MR job. For each row the mapper should emit a key which >>>>> is >>>>> composed of all the values of the join fields in that row (the value can >>>>> be >>>>> emitted as is). This will cause all the rows from both tables, with same >>>>> join field values to arrive at the reducer together. The reducer could >>>>> then >>>>> iterate over them and produce the Cartesian product as needed. >>>>> >>>>> I still don't like having to copy all the data into a temp table just >>>>> because I can't feed two tables into the MR job. >>>>> >>>> >>>> Loading the smaller table in memory is called a map join, versus a >>>> reduce-side join (a.k.a. common join). One reason to prefer a map join is >>>> you avoid the shuffle phase which potentially involves several trips to >>>> disk >>>> for the intermediate records due to spills, and also once through the >>>> network to get each intermediate KV pair to the right reducer. With a map >>>> join, everything is local, except for the part where you load the small >>>> table. >>>> >>>> >>>>> >>>>> As Jason Rutherglen mentioned above, Hive can do joins. I don't know if it >>>>> can do them for HBase and it will not suit my needs, but it would be >>>>> interesting to know how is it doing them, if anyone knows. >>>>> >>>> >>>> The Hive-HBase integration allows you to create Hive tables that are backed >>>> by HBase. You can do joins on those tables (and also with standard Hive >>>> tables). It might be worth trying out in your case as it lets you easily >>>> see >>>> the load characteristics and the job runtime without much coding >>>> investment. >>>> >>>> There are probably some specific optimizations that can be applied to your >>>> situation, but it's hard to say without knowing your use-case. >>>> >>>> Regards, >>>> >>>> - Patrick >>>> >>>> >>>>> -eran >>>>> >>>>> >>>>> >>>>> On Tue, May 31, 2011 at 22:02, Ted Dunning <[email protected]> wrote: >>>>> >>>>> > The Cartesian product often makes an honest-to-god join not such a good >>>>> > idea >>>>> > on large data. The common alternative is co-group >>>>> > which is basically like doing the hard work of the join, but involves >>>>> > stopping just before emitting the cartesian product. This allows >>>>> > you to inject whatever cleverness you need at this point. >>>>> > >>>>> > Common kinds of cleverness include down-sampling of problematically >>>>> > large >>>>> > sets of candidates. >>>>> > >>>>> > On Tue, May 31, 2011 at 11:56 AM, Michael Segel >>>>> > <[email protected]>wrote: >>>>> > >>>>> > > So the underlying problem that the OP was trying to solve was how to >>>>> join >>>>> > > two tables from HBase. >>>>> > > Unfortunately I goofed. >>>>> > > I gave a quick and dirty solution that is a bit incomplete. They row >>>>> key >>>>> > in >>>>> > > the temp table has to be unique and I forgot about the Cartesian >>>>> > > product. So my solution wouldn't work in the general case. >>>>> > > >>>>> > >>>>> >>>> >>> >> >
