My previous post had some stale text after my signature - sorry. Reposting after chopping the stale text off.
Turns out that my assumption of tables being partitioned the same way may be too restrictive. I need to account for join partitions not being co-located on the same tablet server. So the CompositeInputFormat is not applicable as I'd initially thought. That said, I hadn't gotten very far with it, and in particular, couldn't for the life of me figure out how to configure the mapred.join.expr to work on Accumulo's rfile directory structure. I ended up extending AccumuloInputFormat to do the join. The record reader would read table A using AccumuloInputFormat's scannerIterator and issue BatchScanner lookups to get table B's matching records, similar to Keith's suggestion above. Thanks, Ameet On Wed, Oct 17, 2012 at 10:10 AM, ameet kini <[email protected]> wrote: > > Turns out that my assumption of tables being partitioned the same way may > be too restrictive. I need to account for join partitions not being > co-located on the same tablet server. So the CompositeInputFormat is not > applicable as I'd initially thought. That said, I hadn't gotten very far > with it, and in particular, couldn't for the life of me figure out how to > configure the mapred.join.expr to work on Accumulo's rfile directory > structure. > > I ended up extending AccumuloInputFormat to do the join. The record reader > would read table A using AccumuloInputFormat's scannerIterator and issue > BatchScanner lookups to get table B's matching records, similar to Keith's > suggestion above. > > Thanks, > Ameet > > > > > > > On Thu, Oct 11, 2012 at 2:57 PM, Billie Rinaldi <[email protected]> wrote: > >> On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <[email protected]> wrote: >> >>> I have a related problem where I need to do a 1-1 join (every row in >>> table A joins with a unique row in table B and vice versa). My join >>> key is the row id of the table. In the past, I've used Hadoop's >>> CompositeInputFormat to do a map-side join over data in HDFS >>> (described here >>> http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/) My >>> tables in Accumulo seem to fit the eligibility criteria of >>> CompositeInputFormat: both tables are sorted by the join key, since >>> the join key is the row id in my case, and the tables are partitioned >>> the same way (i.e., same split points). >>> >>> Has anyone tried using CompositeInputFormat over Accumulo tables? Is >>> it possible to configure CompositeInputFormat with >>> AccumuloInputFormat? >>> >> >> I haven't tried it. If you do, let us know how it works out. >> >> Billie >> >> >>> >>> Thanks, >>> Ameet >>> >>> >>> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <[email protected]> wrote: >>> > Yeah, that would certainly work. >>> > >>> > You could run two map only jobs (could run concurrently). A job that >>> > reads D1 and writes to Table3 and a job that reads D2 and writes >>> > Table3. Map reduce may be faster, unless you want the final result >>> > in Accumulo in which case this may be faster. The two map reduce jobs >>> > could also produce files to bulk import into table3. >>> > >>> > Keith >>> > >>> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets >>> > <[email protected]> wrote: >>> >> Can you use a new table to join and then scan the new table? Use the >>> foreign >>> >> key as the rowid. Basically create your own materialized view. >>> >> >> >
