On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <[email protected]> wrote: > Thank you for the clarification. You mentioned that "Using the input > format, unless you override the autosplitting in it, you will get 1 mapper > per tablet." in your initial response. Again, pardon me for the newbie > question, but how do I find out if autosplitting is overriden or not? >
You override autosplitting with the command AccumuloInputFormat.disableAutoAdjustRanges(Configuration). So if you haven't done that, it will fit mappers to tablets. Billie > > Aji > > > > On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[email protected]> wrote: > >> Your first two presumptions are correct. You will get 3 mappers and each >> mapper will have data for only one tablet. >> >> Each mapper will function exactly as a scanner for the range of the >> tablet, so you will get things in lexicographical order. So the mapper for >> tablet A will get all items for rowA in order before getting items for rowB. >> >> John >> >> >> >> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[email protected]> wrote: >> >>> Thank you John for your response. I do have a few followup questions. >>> Let me use a better example. Lets say my table and tabletserver >>> distributions are as follows: >>> >>> --------------------------------------------- >>> MyTable: >>> >>> rowA | f1 | q1 | v1 >>> rowA | f2 | q2 | v2 >>> rowA | f3 | q3 | v3 >>> >>> rowB | f1 | q1 | v1 >>> rowB | f1 | q2 | v2 >>> >>> rowC | f1 | q1 | v1 >>> >>> rowD | f1 | q1 | v1 >>> rowD | f1 | q2 | v2 >>> >>> rowE | f1 | q1 | v1 >>> >>> --------------------------------------------- >>> >>> TabletServer1: Tablet A: rowA, rowC >>> TabletServer2: Tablet B: rowB >>> TabletServer2: Tablet C: rowD >>> >>> -------------------------------------------- >>> >>> In this example, if I have a map reduce job that reads from the table >>> above and writes to MyTable2 table using >>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * >>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. >>> >>> Lets not focus on what the map reduce job itself is. From >>> your explanation below sounds like if autosplitting is not overriden then >>> we get *three mappers* total. Is that right? >>> >>> Further, I will be right in assuming that a mapper will NOT get data >>> from multiple tablets. Correct? >>> >>> I am also very confused on what the *order of input to the mapper* will >>> be. Would mapper_at_tabletA get >>> -all data from rowA before it gets all data from rowC or >>> -all data from rowC before it gets all data from rowA or >>> -something like: >>> rowA | f1 | q1 | v1 >>> rowA | f2 | q2 | v2 >>> rowC | f1 | q1 | v1 >>> rowA | f3 | q3 | v3 >>> >>> I know these are a lot of question but I really like to get a good >>> understanding of the architecture. Thank you! >>> Aji >>> >>> >>> >>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[email protected]> wrote: >>> >>>> A tablet consists of both an in memory portion and 0 to many files in >>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance >>>> boost to the natural locality you get when you write data to HDFS, but if a >>>> tablet migrates that locality could be lost until data is compacted >>>> (rewritten). Locality could be retained due to data replication, but >>>> Accumulo does not make extraordinary effort to attempt to get a little bit >>>> of locality, as data will eventually be rewritten and locality restored. >>>> >>>> As for your example, if all data for a given row is inserted at the >>>> same time, then it is guaranteed to be in the same file. There is no >>>> atomicity guarantee regarding HDFS blocks though, so depending on the block >>>> size and the amount of data in the file (and it's distribution), it is >>>> possible for a few entries to span files even though they are adjacent. >>>> >>>> Using the input format, unless you override the autosplitting in it, >>>> you will get 1 mapper per tablet. If you disable auto-splitting, then you >>>> get one mapper per range you specify. >>>> >>>> Hope this helps, let me know if you have other questions or need >>>> clarification. >>>> >>>> John >>>> >>>> >>>> >>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[email protected]> wrote: >>>> >>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my >>>>> apologies if it was already asked in which case please forward me a link >>>>> to >>>>> the answers.Thank you. >>>>> >>>>> If my environment set up is as follows: >>>>> -64MB HDFS block >>>>> -5 tablet servers >>>>> -10 tablets of size 1GB each per tablet server >>>>> >>>>> If I have a table like below: >>>>> rowA | f1 | q1 | v1 >>>>> rowA | f1 | q2 | v2 >>>>> >>>>> rowB | f1 | q1 | v3 >>>>> >>>>> rowC | f1 | q1 | v4 >>>>> rowC | f2 | q1 | v5 >>>>> rowC | f3 | q3 | v6 >>>>> >>>>> From the little documentation, I know all data about rowA will go one >>>>> tablet which may or may not contain data about other rows ie its all or >>>>> none. So my questions are: >>>>> >>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One >>>>> tablet is split into multiple HDFS blocks (8 in this case) so would they >>>>> be >>>>> stored on the same or different datanode(s) or does it not matter? >>>>> >>>>> In the example above, would all data about RowC (or A or B) go onto >>>>> the same HDFS block or different HDFS blocks? >>>>> >>>>> When executing a map reduce job how many mappers would I get? (one per >>>>> hdfs block? or per tablet? or per server?) >>>>> >>>>> Thank you in advance for any and all suggestions. >>>>> >>>> >>>> >>> >> >
