Re: Map Reduce on accumulo

Aji Janis Thu, 06 Dec 2012 14:33:20 -0800

Thank you for the clarification. You mentioned that "Using the input
format, unless you override the autosplitting in it, you will get 1 mapper
per tablet." in your initial response. Again, pardon me for the newbie
question, but how do I find out if autosplitting is overriden or not?


Aji



On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[email protected]> wrote:

> Your first two presumptions are correct. You will get 3 mappers and each
> mapper will have data for only one tablet.
>
> Each mapper will function exactly as a scanner for the range of the
> tablet, so you will get things in lexicographical order. So the mapper for
> tablet A will get all items for rowA in order before getting items for rowB.
>
> John
>
>
>
> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[email protected]> wrote:
>
>> Thank you John for your response. I do have a few followup questions.
>> Let me use a better example. Lets say my table and tabletserver
>> distributions are as follows:
>>
>> ---------------------------------------------
>> MyTable:
>>
>> rowA | f1 | q1 | v1
>> rowA | f2 | q2 | v2
>> rowA | f3 | q3 | v3
>>
>> rowB | f1 | q1 | v1
>> rowB | f1 | q2 | v2
>>
>> rowC | f1 | q1 | v1
>>
>> rowD | f1 | q1 | v1
>> rowD | f1 | q2 | v2
>>
>> rowE | f1 | q1 | v1
>>
>> ---------------------------------------------
>>
>> TabletServer1: Tablet A: rowA, rowC
>> TabletServer2: Tablet B: rowB
>> TabletServer2: Tablet C: rowD
>>
>> --------------------------------------------
>>
>> In this example, if I have a map reduce job that reads from the table
>> above and writes to MyTable2 table using
>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>>
>> Lets not focus on what the map reduce job itself is. From
>> your explanation below sounds like if autosplitting is not overriden then
>> we get *three mappers* total. Is that right?
>>
>> Further, I will be right in assuming that a mapper will NOT get data from
>> multiple tablets. Correct?
>>
>> I am also very confused on what the *order of input to the mapper* will
>> be. Would mapper_at_tabletA get
>> -all data from rowA before it gets all data from rowC or
>> -all data from rowC before it gets all data from rowA or
>> -something like:
>>    rowA | f1 | q1 | v1
>>    rowA | f2 | q2 | v2
>>    rowC | f1 | q1 | v1
>>    rowA | f3 | q3 | v3
>>
>> I know these are a lot of question but I really like to get a good
>> understanding of the architecture. Thank you!
>> Aji
>>
>>
>>
>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[email protected]> wrote:
>>
>>> A tablet consists of both an in memory portion and 0 to many files in
>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>>> boost to the natural locality you get when you write data to HDFS, but if a
>>> tablet migrates that locality could be lost until data is compacted
>>> (rewritten). Locality could be retained due to data replication, but
>>> Accumulo does not make extraordinary effort to attempt to get a little bit
>>> of locality, as data will eventually be rewritten and locality restored.
>>>
>>> As for your example, if all data for a given row is inserted at the same
>>> time, then it is guaranteed to be in the same file. There is no atomicity
>>> guarantee regarding HDFS blocks though, so depending on the block size and
>>> the amount of data in the file (and it's distribution), it is possible for
>>> a few entries to span files even though they are adjacent.
>>>
>>> Using the input format, unless you override the autosplitting in it, you
>>> will get 1 mapper per tablet. If you disable auto-splitting, then you get
>>> one mapper per range you specify.
>>>
>>> Hope this helps, let me know if you have other questions or need
>>> clarification.
>>>
>>> John
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[email protected]> wrote:
>>>
>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my
>>>> apologies if it was already asked in which case please forward me a link to
>>>> the answers.Thank you.
>>>>
>>>> If my environment set up is as follows:
>>>> -64MB HDFS block
>>>> -5 tablet servers
>>>> -10 tablets of size 1GB each per tablet server
>>>>
>>>> If I have a table like below:
>>>> rowA | f1 | q1 | v1
>>>> rowA | f1 | q2 | v2
>>>>
>>>> rowB | f1 | q1 | v3
>>>>
>>>> rowC | f1 | q1 | v4
>>>> rowC | f2 | q1 | v5
>>>> rowC | f3 | q3 | v6
>>>>
>>>> From the little documentation, I know all data about rowA will go one
>>>> tablet which may or may not contain data about other rows ie its all or
>>>> none. So my questions are:
>>>>
>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>>>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>>>> stored on the same or different datanode(s) or does it not matter?
>>>>
>>>> In the example above, would all data about RowC (or A or B) go onto the
>>>> same HDFS block or different HDFS blocks?
>>>>
>>>> When executing a map reduce job how many mappers would I get? (one per
>>>> hdfs block? or per tablet? or per server?)
>>>>
>>>> Thank you in advance for any and all suggestions.
>>>>
>>>
>>>
>>
>

Re: Map Reduce on accumulo

Reply via email to