Re: File block size use

Raj Vishwanathan Tue, 02 Oct 2012 07:52:31 -0700

I haven't tried it but this should also work

 hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest



Raj



>________________________________
> From: Anna Lahoud <[email protected]>
>To: [email protected]; [email protected] 
>Sent: Tuesday, October 2, 2012 7:17 AM
>Subject: Re: File block size use
> 
>
>Thank you. I will try today.
>
>
>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <[email protected]> wrote:
>
>Hi Anna
>>
>>If you want to increase the block size of existing files. You can use a 
>>Identity Mapper with no reducer.  Set the min and max split sizes to your 
>>requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat 
>>for your job.
>>Your job should be done.
>>
>>
>>Regards
>>Bejoy KS
>>
>>Sent from handheld, please excuse typos.
>>________________________________
>>
>>From:  Chris Nauroth <[email protected]> 
>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>To: <[email protected]>
>>ReplyTo:  [email protected] 
>>Subject: Re: File block size use
>>
>>Hello Anna,
>>
>>
>>If I understand correctly, you have a set of multiple sequence files, each 
>>much smaller than the desired block size, and you want to concatenate them 
>>into a set of fewer files, each one more closely aligned to your desired 
>>block size.  Presumably, the goal is to improve throughput of map reduce jobs 
>>using those files as input by running fewer map tasks, reading a larger 
>>number of input records.
>>
>>
>>Whenever I've had this kind of requirement, I've run a custom map reduce job 
>>to implement the file consolidation.  In my case, I was typically working 
>>with TextInputFormat (not sequence files).  I used IdentityMapper and a 
>>custom reducer that passed through all values but with key set to 
>>NullWritable, because the keys (input file offsets in the case of 
>>TextInputFormat) were not valuable data.  For my input data, this was 
>>sufficient to achieve fairly even distribution of data across the reducer 
>>tasks, and I could reasonably predict the input data set size, so I could 
>>reasonably set the number of reducers and get decent results.  (This may or 
>>may not be true for your data set though.)
>>
>>
>>A weakness of this approach is that the keys must pass from the map tasks to 
>>the reduce tasks, only to get discarded before writing the final output.  
>>Also, the distribution of input records to reduce tasks is not truly random, 
>>and therefore the reduce output files may be uneven in size.  This could be 
>>solved by writing NullWritable keys out of the map task instead of the reduce 
>>task and writing a custom implementation of Partitioner to distribute them 
>>randomly.
>>
>>
>>To expand on this idea, it could be possible to inspect the FileStatus of 
>>each input, sum the values of FileStatus.getLen(), and then use that 
>>information to make a decision about how many reducers to run (and therefore 
>>approximately set a target output file size).  I'm not aware of any built-in 
>>or external utilities that do this for you though.
>>
>>
>>Hope this helps,
>>--Chris
>>
>>
>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <[email protected]> wrote:
>>
>>I would like to be able to resize a set of inputs, already in SequenceFile 
>>format, to be larger. 
>>>
>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not 
>>>get what I expected. The outputs were exactly the same as the inputs. 
>>>
>>>I also tried running a job with an IdentityMapper and IdentityReducer. 
>>>Although that approaches a better solution, it still requires that I know in 
>>>advance how many reducers I need to get better file sizes. 
>>>
>>>I was looking at the SequenceFile.Writer constructors and noticed that there 
>>>are block size parameters that can be used. Using a writer constructed with 
>>>a 512MB block size, there is nothing that splits the output and I simply get 
>>>a single file the size of my inputs. 
>>>
>>>What is the current standard for combining sequence files to create larger 
>>>files for map-reduce jobs? I have seen code that tracks what it writes into 
>>>the file, but that seems like the long version. I am hoping there is a 
>>>shorter path.
>>>
>>>Thank you.
>>>
>>>Anna
>>>
>>>
>>
>
>
>

Re: File block size use

Reply via email to