Ok. Thanks.
On Wed, Apr 10, 2013 at 2:01 PM, Jean-Marc Spaggiari < [email protected]> wrote: > Hi Greame, > > No. The reducer will simply write on the table the same way you are doing a > regular Put. If a split is required because of the size, then the region > will be split, but at the end, there will not necessary be any region > split. > > In the usecase described below, all the 600 lines will "simply" go into the > only region in the table and no split will occur. > > The goal is to partition the data for the reducer only. Not in the table. > > JM > > 2013/4/10 Graeme Wallace <[email protected]> > > > Whats the behavior then if you return hash % num_reducers and you have no > > splits defined. When the reducer writes to the table does the region > server > > local to the reducer create a new region ? > > > > Graeme > > > > > > On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari < > > [email protected]> wrote: > > > > > So. > > > > > > I looked at the code, and I have one comment/suggestion here. > > > > > > If the table we are outputing to has regions, then partitions are build > > > around that, and that's fine. But if the table is totally empty with a > > > single region, even if we setNumReduceTasks to 2 or more, all the keys > > will > > > go on the same first reducer because of this: > > > if (this.startKeys.length == 1){ > > > return 0; > > > } > > > I think it will be better to return something like keycrc%numPartitions > > > instead. That still allow the application to spread the reducing > process > > > over multinode(racks) even if there is only one region in the table. > > > > > > In my usecase, I have millions of lines producing some statistics. At > the > > > end, I will have only about 600 lines, but it will take a lot of map > and > > > reduce time to go from millions to 600, that's why I'm looking to have > > more > > > than one reducer. However, with only 600 lines, it's very difficult to > > > pre-split the table. Keys are all very close. > > > > > > Does anyone see anything wrong with changing this default behaviour > when > > > startKeys.length == 1? If not, I will open a JIRA and upload a patch. > > > > > > JM > > > > > > 2013/4/10 Jean-Marc Spaggiari <[email protected]> > > > > > > > Thanks Ted. > > > > > > > > It's exactly where I was looking at now. I was close. I will take a > > > deeper > > > > look. > > > > > > > > Thanks Nitin for the link. I will read that too. > > > > > > > > JM > > > > > > > > 2013/4/10 Nitin Pawar <[email protected]> > > > > > > > >> To add what Ted said, > > > >> > > > >> the same discussion happened on the question Jean asked > > > >> > > > >> https://issues.apache.org/jira/browse/HBASE-1287 > > > >> > > > >> > > > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <[email protected]> > wrote: > > > >> > > > >> > Jean-Marc: > > > >> > Take a look at HRegionPartitioner which is in both mapred and > > > mapreduce > > > >> > packages: > > > >> > > > > >> > * This is used to partition the output keys into groups of keys. > > > >> > > > > >> > * Keys are grouped according to the regions that currently exist > > > >> > > > > >> > * so that each reducer fills a single region so load is > > distributed. > > > >> > > > > >> > Cheers > > > >> > > > > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari < > > > >> > [email protected]> wrote: > > > >> > > > > >> > > Hi Nitin, > > > >> > > > > > >> > > You got my question correctly. > > > >> > > > > > >> > > However, I'm wondering how it's working when it's done into > HBase. > > > Do > > > >> > > we have defaults partionners so we have the same garantee that > > > records > > > >> > > mapping to one key go to the same reducer. Or do we have to > > > implement > > > >> > > this one our own. > > > >> > > > > > >> > > JM > > > >> > > > > > >> > > 2013/4/10 Nitin Pawar <[email protected]>: > > > >> > > > I hope i understood what you are asking is this . If not then > > > >> pardon me > > > >> > > :) > > > >> > > > from the hadoop developer handbook few lines > > > >> > > > > > > >> > > > The*Partitioner* class determines which partition a given > (key, > > > >> value) > > > >> > > pair > > > >> > > > will go to. The default partitioner computes a hash value for > > the > > > >> key > > > >> > and > > > >> > > > assigns the partition based on this result. It garantees that > > all > > > >> the > > > >> > > > records mapping to one key go to same reducer > > > >> > > > > > > >> > > > You can write your custom partitioner as well > > > >> > > > here is the link : > > > >> > > > > > > >> > http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari < > > > >> > > > [email protected]> wrote: > > > >> > > > > > > >> > > >> Hi, > > > >> > > >> > > > >> > > >> quick question. How are the data from the map tasks > > partitionned > > > >> for > > > >> > > >> the reducers? > > > >> > > >> > > > >> > > >> If there is 1 reducer, it's easy, but if there is more, are > all > > > >> they > > > >> > > >> same keys garanteed to end on the same reducer? Or not > > necessary? > > > >> If > > > >> > > >> they are not, how can we provide a partionning function? > > > >> > > >> > > > >> > > >> Thanks, > > > >> > > >> > > > >> > > >> JM > > > >> > > >> > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > -- > > > >> > > > Nitin Pawar > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> -- > > > >> Nitin Pawar > > > >> > > > > > > > > > > > > > > > > > > > -- > > Graeme Wallace > > CTO > > FareCompare.com > > O: 972 588 1414 > > M: 214 681 9018 > > > -- Graeme Wallace CTO FareCompare.com O: 972 588 1414 M: 214 681 9018
