Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as you continue to read the file. Wouldn't it be a bottleneck from 'disk' point of view??Are you not going away from the distributed paradigm??
Am I taking it in the correct way. Please correct me if I am getting it wrong. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <[email protected]> wrote: > Well, to be more clear, I'm wondering how hadoop-mapreduce can be > optimized in a block-less filesystem... And am thinking about application > tier ways to simulate blocks - i.e. by making the granularity of partitions > smaller. > > Wondering, if there is a way to hack an increased numbers of partitions as > a mechanism to simulate blocks - or wether this is just a bad idea > altogether :) > > > > > On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <[email protected]>wrote: > >> Hello Jay, >> >> What are you going to do in your custom InputFormat and >> partitioner?Is your InputFormat is going to create larger splits which will >> overlap with larger blocks?If that is the case, IMHO, then you are going to >> reduce the no. of mappers thus reducing the parallelism. Also, much larger >> block size will put extra overhead when it comes to disk I/O. >> >> Warm Regards, >> Tariq >> https://mtariq.jux.com/ >> cloudfront.blogspot.com >> >> >> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <[email protected]> wrote: >> >>> Hi guys: >>> >>> Im wondering - if I'm running mapreduce jobs on a cluster with large >>> block sizes - can i increase performance with either: >>> >>> 1) A custom FileInputFormat >>> >>> 2) A custom partitioner >>> >>> 3) -DnumReducers >>> >>> Clearly, (3) will be an issue due to the fact that it might overload >>> tasks and network traffic... but maybe (1) or (2) will be a precise way to >>> "use" partitions as a "poor mans" block. >>> >>> Just a thought - not sure if anyone has tried (1) or (2) before in order >>> to simulate blocks and increase locality by utilizing the partition API. >>> >>> -- >>> Jay Vyas >>> http://jayunit100.blogspot.com >>> >> >> > > > -- > Jay Vyas > http://jayunit100.blogspot.com >
