Keith, You hit the problem that I purposely didn't ask. -Accumulo inputformat doesn't support multiple tables at this point and -I can't run three mappers in parallel on different tables and combine/send their output to a reducer (that I know of).
If all three tables had the same rowid (eg: rowA exists in table 1, 2 and 3) then we can write the row from each table w/a different family/qualifier/value to a new table. So it will be three mappers run sequentially and end result is a join... this is the best I came up with so far. If rowids are different accross three tables then I would have to reformat my rowid from all three tables (normalize) prior to writing the fourth/final table. Is calling a scanner on the other two tables from within a mapper (that takes the first table as the input) bad? Any clues on how that could be done in mapreduce? On Wed, Apr 17, 2013 at 10:59 AM, Keith Turner <[email protected]> wrote: > If I am understaning you correctly, you are proposing for each row a > mapper gets to look that row up in two other tables? This would > result in a lot of little round trip RPC calls and random disk > accesses. > > I think a better solution would be to read all three tables into your > mappers, and do the join in the reduce. This solution will avoid all > of the little RPC calls and do lots of sequential I/O instead of > random accesses. Between the map and reduce, you could track which > table each row came from. Any filtering could be done in the mapper > or by iterators. Unfortunately Accumulo does not have the needed > input format for this out of the box. There is a ticket, > ACCUMULO-391. > > > > On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <[email protected]> wrote: > > Hello, > > > > I am interested in learning what the best solution/practices might be to > > join 3 accumulo tables by running a map reduce job. Interested in getting > > feedback on best practices and such. Heres a pseudo code of what I want > to > > accomplish: > > > > > > AccumuloInputFormat accepts tableA > > Global variable <table_list> has table names: tableB, tableC > > > > In a mapper, for example, you would do something like this: > > > > for each row in TableA > > if (row.family == "abc" && row.qualifier == "xyz") value = getValue() > > if (foundvalue) { > > > > for each table in table_list > > scan table with (this rowid && family = "def") > > for each entry found in scan > > write to final_table (rowid, value_as_family, > tablename_as_qualifier, > > entry_as_value_string) > > > > }//end if foundvalue > > > > }//end for loop > > > > > > This is a simple version of what I want to do. In my non mapreduce java > code > > I would do this by calling a using different scanners per table in the > list. > > Couple questions: > > > > > > - how bad/good is performance when using scanners withing mappers? > > - if I get one mapper per range in tableA, do I reset scanners? how? or > > would I set up a scanner in the setup() of mapper ? --> i have no clue > how > > this will play out so thinking out loud here. > > - any optimization suggestions? or examples of creating > join_tables/indexes > > out there that I can refer to? > > > > > > Thank you for all suggestions. >
