What do you mean "increasing the size"?  Im talking more about increasing the 
number of partitions... Which actually decreases individual file size.

On Apr 30, 2013, at 4:09 PM, Mohammad Tariq <[email protected]> wrote:

> Increasing the size can help us to an extent, but increasing it further might 
> cause problems during copy and shuffle. If the partitions are too big to be 
> held in the memory, we'll end up with disk based shuffle which is gonna be 
> slower than RAM based shuffle, thus delaying the entire reduce phase. 
> Furthermore N/W might get overwhelmed.
> 
> I think keeping it "considerably" high will definitely give you some boost. 
> But it'll require a high level tinkering.
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Wed, May 1, 2013 at 1:29 AM, Jay Vyas <[email protected]> wrote:
>> Yes it is a problem at the first stage.  What I'm wondering, though, is 
>> wether the intermediate results - which happen after the mapper phase - can 
>> be optimized.
>> 
>> 
>> On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq <[email protected]> wrote:
>>> Hmmm. I was actually thinking about the very first step. How are you going 
>>> to create the maps. Suppose you are on a block-less filesystem and you have 
>>> a custom Format that is going to give you the splits dynamically. This mean 
>>> that you are going to store the file as a whole and create the splits as 
>>> you continue to read the file. Wouldn't it be a bottleneck from 'disk' 
>>> point of view??Are you not going away from the distributed paradigm??
>>> 
>>> Am I taking it in the correct way. Please correct me if I am getting it 
>>> wrong.
>>> 
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Wed, May 1, 2013 at 12:34 AM, Jay Vyas <[email protected]> wrote:
>>>> Well, to be more clear, I'm wondering how hadoop-mapreduce can be 
>>>> optimized in a block-less filesystem... And am thinking about application 
>>>> tier ways to simulate blocks - i.e. by making the granularity of 
>>>> partitions smaller. 
>>>> 
>>>> Wondering, if there is a way to hack an increased numbers of partitions as 
>>>> a mechanism to simulate blocks - or wether this is just a bad idea 
>>>> altogether :) 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq <[email protected]> wrote:
>>>>> Hello Jay,
>>>>> 
>>>>>     What are you going to do in your custom InputFormat and 
>>>>> partitioner?Is your InputFormat is going to create larger splits which 
>>>>> will overlap with larger blocks?If that is the case, IMHO, then you are 
>>>>> going to reduce the no. of mappers thus reducing the parallelism. Also, 
>>>>> much larger block size will put extra overhead when it comes to disk I/O.
>>>>> 
>>>>> Warm Regards,
>>>>> Tariq
>>>>> https://mtariq.jux.com/
>>>>> cloudfront.blogspot.com
>>>>> 
>>>>> 
>>>>> On Wed, May 1, 2013 at 12:16 AM, Jay Vyas <[email protected]> wrote:
>>>>>> Hi guys:
>>>>>> 
>>>>>> Im wondering - if I'm running mapreduce jobs on a cluster with large 
>>>>>> block sizes - can i increase performance with either:
>>>>>> 
>>>>>> 1) A custom FileInputFormat
>>>>>> 
>>>>>> 2) A custom partitioner 
>>>>>> 
>>>>>> 3) -DnumReducers
>>>>>> 
>>>>>> Clearly, (3) will be an issue due to the fact that it might overload 
>>>>>> tasks and network traffic... but maybe (1) or (2) will be a precise way 
>>>>>> to "use" partitions as a "poor mans" block.  
>>>>>> 
>>>>>> Just a thought - not sure if anyone has tried (1) or (2) before in order 
>>>>>> to simulate blocks and increase locality by utilizing the partition API.
>>>>>> 
>>>>>> -- 
>>>>>> Jay Vyas
>>>>>> http://jayunit100.blogspot.com
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jay Vyas
>>>> http://jayunit100.blogspot.com
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> http://jayunit100.blogspot.com
> 

Reply via email to