Re: MapReduce: Reducers partitions.

Graeme Wallace Wed, 10 Apr 2013 12:06:51 -0700

Ok. Thanks.


On Wed, Apr 10, 2013 at 2:01 PM, Jean-Marc Spaggiari <
[email protected]> wrote:

> Hi Greame,
>
> No. The reducer will simply write on the table the same way you are doing a
> regular Put. If a split is required because of the size, then the region
> will be split, but at the end, there will not necessary be any region
> split.
>
> In the usecase described below, all the 600 lines will "simply" go into the
> only region in the table and no split will occur.
>
> The goal is to partition the data for the reducer only. Not in the table.
>
> JM
>
> 2013/4/10 Graeme Wallace <[email protected]>
>
> > Whats the behavior then if you return hash % num_reducers and you have no
> > splits defined. When the reducer writes to the table does the region
> server
> > local to the reducer create a new region ?
> >
> > Graeme
> >
> >
> > On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
> > [email protected]> wrote:
> >
> > > So.
> > >
> > > I looked at the code, and I have one comment/suggestion here.
> > >
> > > If the table we are outputing to has regions, then partitions are build
> > > around that, and that's fine. But if the table is totally empty with a
> > > single region, even if we setNumReduceTasks to 2 or more, all the keys
> > will
> > > go on the same first reducer because of this:
> > >     if (this.startKeys.length == 1){
> > >       return 0;
> > >     }
> > > I think it will be better to return something like keycrc%numPartitions
> > > instead. That still allow the application to spread the reducing
> process
> > > over multinode(racks) even if there is only one region in the table.
> > >
> > > In my usecase, I have millions of lines producing some statistics. At
> the
> > > end, I will have only about 600 lines, but it will take a lot of map
> and
> > > reduce time to go from millions to 600, that's why I'm looking to have
> > more
> > > than one reducer. However, with only 600 lines, it's very difficult to
> > > pre-split the table. Keys are all very close.
> > >
> > > Does anyone see anything wrong with changing this default behaviour
> when
> > > startKeys.length == 1? If not, I will open a JIRA and upload a patch.
> > >
> > > JM
> > >
> > > 2013/4/10 Jean-Marc Spaggiari <[email protected]>
> > >
> > > > Thanks Ted.
> > > >
> > > > It's exactly where I was looking at now. I was close. I will take a
> > > deeper
> > > > look.
> > > >
> > > > Thanks Nitin for the link. I will read that too.
> > > >
> > > > JM
> > > >
> > > > 2013/4/10 Nitin Pawar <[email protected]>
> > > >
> > > >> To add what Ted said,
> > > >>
> > > >> the same discussion happened on the question Jean asked
> > > >>
> > > >> https://issues.apache.org/jira/browse/HBASE-1287
> > > >>
> > > >>
> > > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <[email protected]>
> wrote:
> > > >>
> > > >> > Jean-Marc:
> > > >> > Take a look at HRegionPartitioner which is in both mapred and
> > > mapreduce
> > > >> > packages:
> > > >> >
> > > >> >  * This is used to partition the output keys into groups of keys.
> > > >> >
> > > >> >  * Keys are grouped according to the regions that currently exist
> > > >> >
> > > >> >  * so that each reducer fills a single region so load is
> > distributed.
> > > >> >
> > > >> > Cheers
> > > >> >
> > > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > > >> > [email protected]> wrote:
> > > >> >
> > > >> > > Hi Nitin,
> > > >> > >
> > > >> > > You got my question correctly.
> > > >> > >
> > > >> > > However, I'm wondering how it's working when it's done into
> HBase.
> > > Do
> > > >> > > we have defaults partionners so we have the same garantee that
> > > records
> > > >> > > mapping to one key go to the same reducer. Or do we have to
> > > implement
> > > >> > > this one our own.
> > > >> > >
> > > >> > > JM
> > > >> > >
> > > >> > > 2013/4/10 Nitin Pawar <[email protected]>:
> > > >> > > > I hope i understood what you are asking is this . If not then
> > > >> pardon me
> > > >> > > :)
> > > >> > > > from the hadoop developer handbook few lines
> > > >> > > >
> > > >> > > > The*Partitioner* class determines which partition a given
> (key,
> > > >> value)
> > > >> > > pair
> > > >> > > > will go to. The default partitioner computes a hash value for
> > the
> > > >> key
> > > >> > and
> > > >> > > > assigns the partition based on this result. It garantees that
> > all
> > > >> the
> > > >> > > > records mapping to one key go to same reducer
> > > >> > > >
> > > >> > > > You can write your custom partitioner as well
> > > >> > > > here is the link :
> > > >> > > >
> > > >>
> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > > >> > > > [email protected]> wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> quick question. How are the data from the map tasks
> > partitionned
> > > >> for
> > > >> > > >> the reducers?
> > > >> > > >>
> > > >> > > >> If there is 1 reducer, it's easy, but if there is more, are
> all
> > > >> they
> > > >> > > >> same keys garanteed to end on the same reducer? Or not
> > necessary?
> > > >>  If
> > > >> > > >> they are not, how can we provide a partionning function?
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >>
> > > >> > > >> JM
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Nitin Pawar
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Nitin Pawar
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Graeme Wallace
> > CTO
> > FareCompare.com
> > O: 972 588 1414
> > M: 214 681 9018
> >
>



-- 
Graeme Wallace
CTO
FareCompare.com
O: 972 588 1414
M: 214 681 9018

Re: MapReduce: Reducers partitions.

Reply via email to