Re: virtual nodes + map reduce = too many mappers

2013-02-17 Thread cem
Thanks Eric for the appreciation :) Default split size is 64K rows. ColumnFamilyInputFormat first collects all tokens and create a split for each. if you have 256 vnode for each node that it creates 256 splits even if you have no data at all. current split size will only work if you have a vnode t

Re: virtual nodes + map reduce = too many mappers

2013-02-16 Thread Edward Capriolo
Split size does not have to equal block size. http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html An abstract InputFormat that returns CombineFileSplit's in InputFormat.getSplits(JobConf, int) method. Splits are constructed from the files under the in

Re: virtual nodes + map reduce = too many mappers

2013-02-16 Thread Jonathan Ellis
Wouldn't you have more than 256 splits anyway, given a normal amount of data? (Default split size is 64k rows.) On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo wrote: > Seems like the hadoop Input format should combine the splits that are > on the same node into the same map task, like Hadoop's

Re: virtual nodes + map reduce = too many mappers

2013-02-16 Thread Eric Evans
On Sat, Feb 16, 2013 at 9:13 AM, Edward Capriolo wrote: > No one had ever tried vnodes with hadoop until the OP did, or they > would have noticed this. No one extensively used it with secondary > indexes either from the last ticket I mentioned. > > My mistake they are not a default. > > I do think

Re: virtual nodes + map reduce = too many mappers

2013-02-16 Thread Edward Capriolo
No one had ever tried vnodes with hadoop until the OP did, or they would have noticed this. No one extensively used it with secondary indexes either from the last ticket I mentioned. My mistake they are not a default. I do think vnodes are awesome, its great that c* has the longer release cylcle.

Re: virtual nodes + map reduce = too many mappers

2013-02-15 Thread Eric Evans
On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo wrote: > Seems like the hadoop Input format should combine the splits that are > on the same node into the same map task, like Hadoop's > CombinedInputFormat can. I am not sure who recommends vnodes as the > default, because this is now the second p

Re: virtual nodes + map reduce = too many mappers

2013-02-15 Thread Edward Capriolo
Seems like the hadoop Input format should combine the splits that are on the same node into the same map task, like Hadoop's CombinedInputFormat can. I am not sure who recommends vnodes as the default, because this is now the second problem (that I know of) of this class where vnodes has extra over

virtual nodes + map reduce = too many mappers

2013-02-15 Thread cem
Hi All, I have just started to use virtual nodes. I set the number of nodes to 256 as recommended. The problem that I have is when I run a mapreduce job it creates node * 256 mappers. It creates node * 256 splits. this effects the performance since the range queries have a lot of overhead. Any s