Thank you for the clarification. You mentioned that "Using the input format, unless you override the autosplitting in it, you will get 1 mapper per tablet." in your initial response. Again, pardon me for the newbie question, but how do I find out if autosplitting is overriden or not?
Aji On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[email protected]> wrote: > Your first two presumptions are correct. You will get 3 mappers and each > mapper will have data for only one tablet. > > Each mapper will function exactly as a scanner for the range of the > tablet, so you will get things in lexicographical order. So the mapper for > tablet A will get all items for rowA in order before getting items for rowB. > > John > > > > On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[email protected]> wrote: > >> Thank you John for your response. I do have a few followup questions. >> Let me use a better example. Lets say my table and tabletserver >> distributions are as follows: >> >> --------------------------------------------- >> MyTable: >> >> rowA | f1 | q1 | v1 >> rowA | f2 | q2 | v2 >> rowA | f3 | q3 | v3 >> >> rowB | f1 | q1 | v1 >> rowB | f1 | q2 | v2 >> >> rowC | f1 | q1 | v1 >> >> rowD | f1 | q1 | v1 >> rowD | f1 | q2 | v2 >> >> rowE | f1 | q1 | v1 >> >> --------------------------------------------- >> >> TabletServer1: Tablet A: rowA, rowC >> TabletServer2: Tablet B: rowB >> TabletServer2: Tablet C: rowD >> >> -------------------------------------------- >> >> In this example, if I have a map reduce job that reads from the table >> above and writes to MyTable2 table using >> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * >> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. >> >> Lets not focus on what the map reduce job itself is. From >> your explanation below sounds like if autosplitting is not overriden then >> we get *three mappers* total. Is that right? >> >> Further, I will be right in assuming that a mapper will NOT get data from >> multiple tablets. Correct? >> >> I am also very confused on what the *order of input to the mapper* will >> be. Would mapper_at_tabletA get >> -all data from rowA before it gets all data from rowC or >> -all data from rowC before it gets all data from rowA or >> -something like: >> rowA | f1 | q1 | v1 >> rowA | f2 | q2 | v2 >> rowC | f1 | q1 | v1 >> rowA | f3 | q3 | v3 >> >> I know these are a lot of question but I really like to get a good >> understanding of the architecture. Thank you! >> Aji >> >> >> >> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[email protected]> wrote: >> >>> A tablet consists of both an in memory portion and 0 to many files in >>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance >>> boost to the natural locality you get when you write data to HDFS, but if a >>> tablet migrates that locality could be lost until data is compacted >>> (rewritten). Locality could be retained due to data replication, but >>> Accumulo does not make extraordinary effort to attempt to get a little bit >>> of locality, as data will eventually be rewritten and locality restored. >>> >>> As for your example, if all data for a given row is inserted at the same >>> time, then it is guaranteed to be in the same file. There is no atomicity >>> guarantee regarding HDFS blocks though, so depending on the block size and >>> the amount of data in the file (and it's distribution), it is possible for >>> a few entries to span files even though they are adjacent. >>> >>> Using the input format, unless you override the autosplitting in it, you >>> will get 1 mapper per tablet. If you disable auto-splitting, then you get >>> one mapper per range you specify. >>> >>> Hope this helps, let me know if you have other questions or need >>> clarification. >>> >>> John >>> >>> >>> >>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[email protected]> wrote: >>> >>>> NOTE: I am fairly sure this hasn't been asked on here yet - my >>>> apologies if it was already asked in which case please forward me a link to >>>> the answers.Thank you. >>>> >>>> If my environment set up is as follows: >>>> -64MB HDFS block >>>> -5 tablet servers >>>> -10 tablets of size 1GB each per tablet server >>>> >>>> If I have a table like below: >>>> rowA | f1 | q1 | v1 >>>> rowA | f1 | q2 | v2 >>>> >>>> rowB | f1 | q1 | v3 >>>> >>>> rowC | f1 | q1 | v4 >>>> rowC | f2 | q1 | v5 >>>> rowC | f3 | q3 | v6 >>>> >>>> From the little documentation, I know all data about rowA will go one >>>> tablet which may or may not contain data about other rows ie its all or >>>> none. So my questions are: >>>> >>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One >>>> tablet is split into multiple HDFS blocks (8 in this case) so would they be >>>> stored on the same or different datanode(s) or does it not matter? >>>> >>>> In the example above, would all data about RowC (or A or B) go onto the >>>> same HDFS block or different HDFS blocks? >>>> >>>> When executing a map reduce job how many mappers would I get? (one per >>>> hdfs block? or per tablet? or per server?) >>>> >>>> Thank you in advance for any and all suggestions. >>>> >>> >>> >> >
