What do you mean "increasing the size"? Im talking more about increasing the number of partitions... Which actually decreases individual file size.
On Apr 30, 2013, at 4:09 PM, Mohammad Tariq <[email protected]> wrote: > Increasing the size can help us to an extent, but increasing it further might > cause problems during copy and shuffle. If the partitions are too big to be > held in the memory, we'll end up with disk based shuffle which is gonna be > slower than RAM based shuffle, thus delaying the entire reduce phase. > Furthermore N/W might get overwhelmed. > > I think keeping it "considerably" high will definitely give you some boost. > But it'll require a high level tinkering. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <[email protected]> wrote: >> Yes it is a problem at the first stage. What I'm wondering, though, is >> wether the intermediate results - which happen after the mapper phase - can >> be optimized. >> >> >> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <[email protected]> wrote: >>> Hmmm. I was actually thinking about the very first step. How are you going >>> to create the maps. Suppose you are on a block-less filesystem and you have >>> a custom Format that is going to give you the splits dynamically. This mean >>> that you are going to store the file as a whole and create the splits as >>> you continue to read the file. Wouldn't it be a bottleneck from 'disk' >>> point of view??Are you not going away from the distributed paradigm?? >>> >>> Am I taking it in the correct way. Please correct me if I am getting it >>> wrong. >>> >>> Warm Regards, >>> Tariq >>> https://mtariq.jux.com/ >>> cloudfront.blogspot.com >>> >>> >>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <[email protected]> wrote: >>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be >>>> optimized in a block-less filesystem... And am thinking about application >>>> tier ways to simulate blocks - i.e. by making the granularity of >>>> partitions smaller. >>>> >>>> Wondering, if there is a way to hack an increased numbers of partitions as >>>> a mechanism to simulate blocks - or wether this is just a bad idea >>>> altogether :) >>>> >>>> >>>> >>>> >>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <[email protected]> wrote: >>>>> Hello Jay, >>>>> >>>>> What are you going to do in your custom InputFormat and >>>>> partitioner?Is your InputFormat is going to create larger splits which >>>>> will overlap with larger blocks?If that is the case, IMHO, then you are >>>>> going to reduce the no. of mappers thus reducing the parallelism. Also, >>>>> much larger block size will put extra overhead when it comes to disk I/O. >>>>> >>>>> Warm Regards, >>>>> Tariq >>>>> https://mtariq.jux.com/ >>>>> cloudfront.blogspot.com >>>>> >>>>> >>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <[email protected]> wrote: >>>>>> Hi guys: >>>>>> >>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large >>>>>> block sizes - can i increase performance with either: >>>>>> >>>>>> 1) A custom FileInputFormat >>>>>> >>>>>> 2) A custom partitioner >>>>>> >>>>>> 3) -DnumReducers >>>>>> >>>>>> Clearly, (3) will be an issue due to the fact that it might overload >>>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way >>>>>> to "use" partitions as a "poor mans" block. >>>>>> >>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in order >>>>>> to simulate blocks and increase locality by utilizing the partition API. >>>>>> >>>>>> -- >>>>>> Jay Vyas >>>>>> http://jayunit100.blogspot.com >>>> >>>> >>>> >>>> -- >>>> Jay Vyas >>>> http://jayunit100.blogspot.com >> >> >> >> -- >> Jay Vyas >> http://jayunit100.blogspot.com >
