Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

ameet kini Wed, 17 Oct 2012 07:14:19 -0700

My previous post had some stale text after my signature - sorry. Reposting
after chopping the stale text off.


Turns out that my assumption of tables being partitioned the same way may
be too restrictive. I need to account for join partitions not being
co-located on the same tablet server. So the CompositeInputFormat is not
applicable as I'd initially thought. That said, I hadn't gotten very far
with it, and in particular, couldn't for the life of me figure out how to
configure the mapred.join.expr to work on Accumulo's rfile directory
structure.

I ended up extending AccumuloInputFormat to do the join. The record reader
would read table A using AccumuloInputFormat's scannerIterator and issue
BatchScanner lookups to get table B's matching records, similar to Keith's
suggestion above.

Thanks,
Ameet

On Wed, Oct 17, 2012 at 10:10 AM, ameet kini <[email protected]> wrote:

>
> Turns out that my assumption of tables being partitioned the same way may
> be too restrictive. I need to account for join partitions not being
> co-located on the same tablet server. So the CompositeInputFormat is not
> applicable as I'd initially thought. That said, I hadn't gotten very far
> with it, and in particular, couldn't for the life of me figure out how to
> configure the mapred.join.expr to work on Accumulo's rfile directory
> structure.
>
> I ended up extending AccumuloInputFormat to do the join. The record reader
> would read table A using AccumuloInputFormat's scannerIterator and issue
> BatchScanner lookups to get table B's matching records, similar to Keith's
> suggestion above.
>
> Thanks,
> Ameet
>
>
>
>
>
>
> On Thu, Oct 11, 2012 at 2:57 PM, Billie Rinaldi <[email protected]> wrote:
>
>> On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <[email protected]> wrote:
>>
>>> I have a related problem where I need to do a 1-1 join (every row in
>>> table A joins with a unique row in table B and vice versa). My join
>>> key is the row id of the table. In the past, I've used Hadoop's
>>> CompositeInputFormat to do a map-side join over data in HDFS
>>> (described here
>>> http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/)  My
>>> tables in Accumulo seem to fit the eligibility criteria of
>>> CompositeInputFormat: both tables are sorted by the join key, since
>>> the join key is the row id in my case, and the tables are partitioned
>>> the same way (i.e., same split points).
>>>
>>> Has anyone tried using CompositeInputFormat over Accumulo tables? Is
>>> it possible to configure CompositeInputFormat with
>>> AccumuloInputFormat?
>>>
>>
>> I haven't tried it.  If you do, let us know how it works out.
>>
>> Billie
>>
>>
>>>
>>> Thanks,
>>> Ameet
>>>
>>>
>>> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <[email protected]> wrote:
>>> > Yeah, that would certainly work.
>>> >
>>> > You could run two map only jobs (could run concurrently).  A job that
>>> > reads D1 and writes to Table3 and a job that reads D2 and writes
>>> > Table3.   Map reduce may be faster, unless you want the final result
>>> > in Accumulo in which case this may be faster.  The two map reduce jobs
>>> > could also produce files to bulk import into table3.
>>> >
>>> > Keith
>>> >
>>> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets
>>> > <[email protected]> wrote:
>>> >> Can you use a new table to join and then scan the new table? Use the
>>> foreign
>>> >> key as the rowid. Basically create your own materialized view.
>>>
>>
>>
>

Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

Reply via email to