Consider using pig to perform the join. There is an Accumulo-Pig github project. You can load all three tables and then join fairly easily. Pig basically writes the M/R jobs for you.
Using a common row value, I've run many M/R jobs in parallel to load data into a Accumulo table which creates an effective join. This technique was fast enough for my particular project. It's effectiveness depends on many variables. On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <[email protected]> wrote: > Hello, > > I am interested in learning what the best solution/practices might be to > join 3 accumulo tables by running a map reduce job. Interested in getting > feedback on best practices and such. Heres a pseudo code of what I want to > accomplish: > > > AccumuloInputFormat accepts tableA > Global variable <table_list> has table names: tableB, tableC > > In a mapper, for example, you would do something like this: > > for each row in TableA > if (row.family == "abc" && row.qualifier == "xyz") value = getValue() > if (foundvalue) { > > for each table in table_list > scan table with (this rowid && family = "def") > for each entry found in scan > write to final_table (rowid, value_as_family, > tablename_as_qualifier, entry_as_value_string) > > }//end if foundvalue > > }//end for loop > > > This is a simple version of what I want to do. In my non mapreduce java > code I would do this by calling a using different scanners per table in the > list. Couple questions: > > > - how bad/good is performance when using scanners withing mappers? > - if I get one mapper per range in tableA, do I reset scanners? how? or > would I set up a scanner in the setup() of mapper ? --> i have no clue how > this will play out so thinking out loud here. > - any optimization suggestions? or examples of creating > join_tables/indexes out there that I can refer to? > > > Thank you for all suggestions. >
