Re: Map Reduce on accumulo

Billie Rinaldi Fri, 07 Dec 2012 06:51:12 -0800

On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <[email protected]> wrote:

> Thank you for the clarification. You mentioned that "Using the input
> format, unless you override the autosplitting in it, you will get 1 mapper
> per tablet." in your initial response. Again, pardon me for the newbie
> question, but how do I find out if autosplitting is overriden or not?
>


You override autosplitting with the command
AccumuloInputFormat.disableAutoAdjustRanges(Configuration).  So if you
haven't done that, it will fit mappers to tablets.

Billie



>
> Aji
>
>
>
> On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[email protected]> wrote:
>
>> Your first two presumptions are correct. You will get 3 mappers and each
>> mapper will have data for only one tablet.
>>
>> Each mapper will function exactly as a scanner for the range of the
>> tablet, so you will get things in lexicographical order. So the mapper for
>> tablet A will get all items for rowA in order before getting items for rowB.
>>
>> John
>>
>>
>>
>> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[email protected]> wrote:
>>
>>> Thank you John for your response. I do have a few followup questions.
>>> Let me use a better example. Lets say my table and tabletserver
>>> distributions are as follows:
>>>
>>> ---------------------------------------------
>>> MyTable:
>>>
>>> rowA | f1 | q1 | v1
>>> rowA | f2 | q2 | v2
>>> rowA | f3 | q3 | v3
>>>
>>> rowB | f1 | q1 | v1
>>> rowB | f1 | q2 | v2
>>>
>>> rowC | f1 | q1 | v1
>>>
>>> rowD | f1 | q1 | v1
>>> rowD | f1 | q2 | v2
>>>
>>> rowE | f1 | q1 | v1
>>>
>>> ---------------------------------------------
>>>
>>> TabletServer1: Tablet A: rowA, rowC
>>> TabletServer2: Tablet B: rowB
>>> TabletServer2: Tablet C: rowD
>>>
>>> --------------------------------------------
>>>
>>> In this example, if I have a map reduce job that reads from the table
>>> above and writes to MyTable2 table using
>>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
>>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>>>
>>> Lets not focus on what the map reduce job itself is. From
>>> your explanation below sounds like if autosplitting is not overriden then
>>> we get *three mappers* total. Is that right?
>>>
>>> Further, I will be right in assuming that a mapper will NOT get data
>>> from multiple tablets. Correct?
>>>
>>> I am also very confused on what the *order of input to the mapper* will
>>> be. Would mapper_at_tabletA get
>>> -all data from rowA before it gets all data from rowC or
>>> -all data from rowC before it gets all data from rowA or
>>> -something like:
>>>    rowA | f1 | q1 | v1
>>>    rowA | f2 | q2 | v2
>>>    rowC | f1 | q1 | v1
>>>    rowA | f3 | q3 | v3
>>>
>>> I know these are a lot of question but I really like to get a good
>>> understanding of the architecture. Thank you!
>>> Aji
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[email protected]> wrote:
>>>
>>>> A tablet consists of both an in memory portion and 0 to many files in
>>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>>>> boost to the natural locality you get when you write data to HDFS, but if a
>>>> tablet migrates that locality could be lost until data is compacted
>>>> (rewritten). Locality could be retained due to data replication, but
>>>> Accumulo does not make extraordinary effort to attempt to get a little bit
>>>> of locality, as data will eventually be rewritten and locality restored.
>>>>
>>>> As for your example, if all data for a given row is inserted at the
>>>> same time, then it is guaranteed to be in the same file. There is no
>>>> atomicity guarantee regarding HDFS blocks though, so depending on the block
>>>> size and the amount of data in the file (and it's distribution), it is
>>>> possible for a few entries to span files even though they are adjacent.
>>>>
>>>> Using the input format, unless you override the autosplitting in it,
>>>> you will get 1 mapper per tablet. If you disable auto-splitting, then you
>>>> get one mapper per range you specify.
>>>>
>>>> Hope this helps, let me know if you have other questions or need
>>>> clarification.
>>>>
>>>> John
>>>>
>>>>
>>>>
>>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[email protected]> wrote:
>>>>
>>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my
>>>>> apologies if it was already asked in which case please forward me a link 
>>>>> to
>>>>> the answers.Thank you.
>>>>>
>>>>> If my environment set up is as follows:
>>>>> -64MB HDFS block
>>>>> -5 tablet servers
>>>>> -10 tablets of size 1GB each per tablet server
>>>>>
>>>>> If I have a table like below:
>>>>> rowA | f1 | q1 | v1
>>>>> rowA | f1 | q2 | v2
>>>>>
>>>>> rowB | f1 | q1 | v3
>>>>>
>>>>> rowC | f1 | q1 | v4
>>>>> rowC | f2 | q1 | v5
>>>>> rowC | f3 | q3 | v6
>>>>>
>>>>> From the little documentation, I know all data about rowA will go one
>>>>> tablet which may or may not contain data about other rows ie its all or
>>>>> none. So my questions are:
>>>>>
>>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>>>>> tablet is split into multiple HDFS blocks (8 in this case) so would they 
>>>>> be
>>>>> stored on the same or different datanode(s) or does it not matter?
>>>>>
>>>>> In the example above, would all data about RowC (or A or B) go onto
>>>>> the same HDFS block or different HDFS blocks?
>>>>>
>>>>> When executing a map reduce job how many mappers would I get? (one per
>>>>> hdfs block? or per tablet? or per server?)
>>>>>
>>>>> Thank you in advance for any and all suggestions.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Map Reduce on accumulo

Reply via email to