Re: joining accumulo tables with mapreduce

David Medinets Wed, 17 Apr 2013 18:03:44 -0700

Consider using pig to perform the join. There is an Accumulo-Pig github
project. You can load all three tables and then join fairly easily. Pig
basically writes the M/R jobs for you.


Using a common row value, I've run many M/R jobs in parallel to load data
into a Accumulo table which creates an effective join. This technique was
fast enough for my particular project. It's effectiveness depends on many
variables.


On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <[email protected]> wrote:

> Hello,
>
>  I am interested in learning what the best solution/practices might be to
> join 3 accumulo tables by running a map reduce job. Interested in getting
> feedback on best practices and such. Heres a pseudo code of what I want to
> accomplish:
>
>
> AccumuloInputFormat accepts tableA
> Global variable <table_list> has table names: tableB, tableC
>
> In a mapper, for example, you would do something like this:
>
> for each row in TableA
>  if (row.family == "abc" && row.qualifier == "xyz") value = getValue()
>  if (foundvalue) {
>
>   for each table in table_list
>     scan table with (this rowid && family = "def")
>     for each entry found in scan
>       write to final_table (rowid, value_as_family,
> tablename_as_qualifier, entry_as_value_string)
>
> }//end if foundvalue
>
> }//end for loop
>
>
> This is a simple version of what I want to do. In my non mapreduce java
> code I would do this by calling a using different scanners per table in the
> list. Couple questions:
>
>
> - how bad/good is performance when using scanners withing mappers?
> - if I get one mapper per range in tableA, do I reset scanners? how? or
> would I set up a scanner in the setup() of mapper ? --> i have no clue how
> this will play out so thinking out loud here.
> - any optimization suggestions? or examples of creating
> join_tables/indexes out there that I can refer to?
>
>
> Thank you for all suggestions.
>

Re: joining accumulo tables with mapreduce

Reply via email to