My point is that if you have over 16MB of data per node, you're going
to get thousands of map tasks (that is: hundreds per node) with or
without vnodes.

On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> Every map reduce task typically has a minimum Xmx of 256MB memory. See
> mapred.child.java.opts...
> So if you have a 10 node cluster with 256 vnodes... You will need to spawn
> 2,560 map tasks to complete a job.
> And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
> slots.
>
> Wouldnt it be better if the input format spawned 10 map tasks instead of
> 2,560?
>
>
> On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> I still don't see the hole in the following reasoning:
>>
>> - Input splits are 64k by default.  At this size, map processing time
>> dominates job creation.
>> - Therefore, if job creation time dominates, you have a toy data set
>> (< 64K * 256 vnodes = 16 MB)
>>
>> Adding complexity to our inputformat to improve performance for this
>> niche does not sound like a good idea to me.
>>
>> On Thu, Mar 28, 2013 at 8:40 AM, cem <cayiro...@gmail.com> wrote:
>> > Hi Alicia ,
>> >
>> > Cassandra input format creates mappers as many as vnodes. It is a known
>> > issue. You need to lower the number of vnodes :(
>> >
>> > I have a simple solution for that and ready to write a patch. Should I
>> > create a ticket about that? I don't know the procedure about that.
>> >
>> >  Regards,
>> > Cem
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong <lccali...@gmail.com>
>> > wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
>> >> vnodes.
>> >>
>> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
>> >>
>> >> May I know, is the normal since is vnodes?  If yes, this have slow the
>> >> M/R
>> >> job to finish/complete.
>> >>
>> >>
>> >> Thanks
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced

Reply via email to