Yes it is a problem at the first stage.  What I'm wondering, though, is
wether the intermediate results - which happen after the mapper phase - can
be optimized.


On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <donta...@gmail.com> wrote:

> Hmmm. I was actually thinking about the very first step. How are you going
> to create the maps. Suppose you are on a block-less filesystem and you have
> a custom Format that is going to give you the splits dynamically. This mean
> that you are going to store the file as a whole and create the splits as
> you continue to read the file. Wouldn't it be a bottleneck from 'disk'
> point of view??Are you not going away from the distributed paradigm??
>
> Am I taking it in the correct way. Please correct me if I am getting it
> wrong.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <jayunit...@gmail.com> wrote:
>
>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be
>> optimized in a block-less filesystem... And am thinking about application
>> tier ways to simulate blocks - i.e. by making the granularity of partitions
>> smaller.
>>
>> Wondering, if there is a way to hack an increased numbers of partitions
>> as a mechanism to simulate blocks - or wether this is just a bad idea
>> altogether :)
>>
>>
>>
>>
>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <donta...@gmail.com>wrote:
>>
>>> Hello Jay,
>>>
>>>     What are you going to do in your custom InputFormat and
>>> partitioner?Is your InputFormat is going to create larger splits which will
>>> overlap with larger blocks?If that is the case, IMHO, then you are going to
>>> reduce the no. of mappers thus reducing the parallelism. Also, much larger
>>> block size will put extra overhead when it comes to disk I/O.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <jayunit...@gmail.com> wrote:
>>>
>>>> Hi guys:
>>>>
>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large
>>>> block sizes - can i increase performance with either:
>>>>
>>>> 1) A custom FileInputFormat
>>>>
>>>>  2) A custom partitioner
>>>>
>>>> 3) -DnumReducers
>>>>
>>>> Clearly, (3) will be an issue due to the fact that it might overload
>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way to
>>>> "use" partitions as a "poor mans" block.
>>>>
>>>> Just a thought - not sure if anyone has tried (1) or (2) before in
>>>> order to simulate blocks and increase locality by utilizing the partition
>>>> API.
>>>>
>>>> --
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>>>>
>>>
>>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Reply via email to