Yes it is a problem at the first stage. What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized.
On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <donta...@gmail.com> wrote: > Hmmm. I was actually thinking about the very first step. How are you going > to create the maps. Suppose you are on a block-less filesystem and you have > a custom Format that is going to give you the splits dynamically. This mean > that you are going to store the file as a whole and create the splits as > you continue to read the file. Wouldn't it be a bottleneck from 'disk' > point of view??Are you not going away from the distributed paradigm?? > > Am I taking it in the correct way. Please correct me if I am getting it > wrong. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <jayunit...@gmail.com> wrote: > >> Well, to be more clear, I'm wondering how hadoop-mapreduce can be >> optimized in a block-less filesystem... And am thinking about application >> tier ways to simulate blocks - i.e. by making the granularity of partitions >> smaller. >> >> Wondering, if there is a way to hack an increased numbers of partitions >> as a mechanism to simulate blocks - or wether this is just a bad idea >> altogether :) >> >> >> >> >> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <donta...@gmail.com>wrote: >> >>> Hello Jay, >>> >>> What are you going to do in your custom InputFormat and >>> partitioner?Is your InputFormat is going to create larger splits which will >>> overlap with larger blocks?If that is the case, IMHO, then you are going to >>> reduce the no. of mappers thus reducing the parallelism. Also, much larger >>> block size will put extra overhead when it comes to disk I/O. >>> >>> Warm Regards, >>> Tariq >>> https://mtariq.jux.com/ >>> cloudfront.blogspot.com >>> >>> >>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <jayunit...@gmail.com> wrote: >>> >>>> Hi guys: >>>> >>>> Im wondering - if I'm running mapreduce jobs on a cluster with large >>>> block sizes - can i increase performance with either: >>>> >>>> 1) A custom FileInputFormat >>>> >>>> 2) A custom partitioner >>>> >>>> 3) -DnumReducers >>>> >>>> Clearly, (3) will be an issue due to the fact that it might overload >>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to >>>> "use" partitions as a "poor mans" block. >>>> >>>> Just a thought - not sure if anyone has tried (1) or (2) before in >>>> order to simulate blocks and increase locality by utilizing the partition >>>> API. >>>> >>>> -- >>>> Jay Vyas >>>> http://jayunit100.blogspot.com >>>> >>> >>> >> >> >> -- >> Jay Vyas >> http://jayunit100.blogspot.com >> > > -- Jay Vyas http://jayunit100.blogspot.com