Thank you John for your response. I do have a few followup questions. Let me use a better example. Lets say my table and tabletserver distributions are as follows:
--------------------------------------------- MyTable: rowA | f1 | q1 | v1 rowA | f2 | q2 | v2 rowA | f3 | q3 | v3 rowB | f1 | q1 | v1 rowB | f1 | q2 | v2 rowC | f1 | q1 | v1 rowD | f1 | q1 | v1 rowD | f1 | q2 | v2 rowE | f1 | q1 | v1 --------------------------------------------- TabletServer1: Tablet A: rowA, rowC TabletServer2: Tablet B: rowB TabletServer2: Tablet C: rowD -------------------------------------------- In this example, if I have a map reduce job that reads from the table above and writes to MyTable2 table using org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. Lets not focus on what the map reduce job itself is. From your explanation below sounds like if autosplitting is not overriden then we get *three mappers* total. Is that right? Further, I will be right in assuming that a mapper will NOT get data from multiple tablets. Correct? I am also very confused on what the *order of input to the mapper* will be. Would mapper_at_tabletA get -all data from rowA before it gets all data from rowC or -all data from rowC before it gets all data from rowA or -something like: rowA | f1 | q1 | v1 rowA | f2 | q2 | v2 rowC | f1 | q1 | v1 rowA | f3 | q3 | v3 I know these are a lot of question but I really like to get a good understanding of the architecture. Thank you! Aji On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[email protected]> wrote: > A tablet consists of both an in memory portion and 0 to many files in > HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance > boost to the natural locality you get when you write data to HDFS, but if a > tablet migrates that locality could be lost until data is compacted > (rewritten). Locality could be retained due to data replication, but > Accumulo does not make extraordinary effort to attempt to get a little bit > of locality, as data will eventually be rewritten and locality restored. > > As for your example, if all data for a given row is inserted at the same > time, then it is guaranteed to be in the same file. There is no atomicity > guarantee regarding HDFS blocks though, so depending on the block size and > the amount of data in the file (and it's distribution), it is possible for > a few entries to span files even though they are adjacent. > > Using the input format, unless you override the autosplitting in it, you > will get 1 mapper per tablet. If you disable auto-splitting, then you get > one mapper per range you specify. > > Hope this helps, let me know if you have other questions or need > clarification. > > John > > > > On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[email protected]> wrote: > >> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies >> if it was already asked in which case please forward me a link to the >> answers.Thank you. >> >> If my environment set up is as follows: >> -64MB HDFS block >> -5 tablet servers >> -10 tablets of size 1GB each per tablet server >> >> If I have a table like below: >> rowA | f1 | q1 | v1 >> rowA | f1 | q2 | v2 >> >> rowB | f1 | q1 | v3 >> >> rowC | f1 | q1 | v4 >> rowC | f2 | q1 | v5 >> rowC | f3 | q3 | v6 >> >> From the little documentation, I know all data about rowA will go one >> tablet which may or may not contain data about other rows ie its all or >> none. So my questions are: >> >> How are the tablets mapped to a Datanode or HDFS block? Obviously, One >> tablet is split into multiple HDFS blocks (8 in this case) so would they be >> stored on the same or different datanode(s) or does it not matter? >> >> In the example above, would all data about RowC (or A or B) go onto the >> same HDFS block or different HDFS blocks? >> >> When executing a map reduce job how many mappers would I get? (one per >> hdfs block? or per tablet? or per server?) >> >> Thank you in advance for any and all suggestions. >> > >
