I haven't tried it but this should also work hadoop fs -Ddfs.block.size=<NEW BLOCK SIZE> -cp src dest
Raj >________________________________ > From: Anna Lahoud <[email protected]> >To: [email protected]; [email protected] >Sent: Tuesday, October 2, 2012 7:17 AM >Subject: Re: File block size use > > >Thank you. I will try today. > > >On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <[email protected]> wrote: > >Hi Anna >> >>If you want to increase the block size of existing files. You can use a >>Identity Mapper with no reducer. Set the min and max split sizes to your >>requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat >>for your job. >>Your job should be done. >> >> >>Regards >>Bejoy KS >> >>Sent from handheld, please excuse typos. >>________________________________ >> >>From: Chris Nauroth <[email protected]> >>Date: Mon, 1 Oct 2012 21:12:58 -0700 >>To: <[email protected]> >>ReplyTo: [email protected] >>Subject: Re: File block size use >> >>Hello Anna, >> >> >>If I understand correctly, you have a set of multiple sequence files, each >>much smaller than the desired block size, and you want to concatenate them >>into a set of fewer files, each one more closely aligned to your desired >>block size. Presumably, the goal is to improve throughput of map reduce jobs >>using those files as input by running fewer map tasks, reading a larger >>number of input records. >> >> >>Whenever I've had this kind of requirement, I've run a custom map reduce job >>to implement the file consolidation. In my case, I was typically working >>with TextInputFormat (not sequence files). I used IdentityMapper and a >>custom reducer that passed through all values but with key set to >>NullWritable, because the keys (input file offsets in the case of >>TextInputFormat) were not valuable data. For my input data, this was >>sufficient to achieve fairly even distribution of data across the reducer >>tasks, and I could reasonably predict the input data set size, so I could >>reasonably set the number of reducers and get decent results. (This may or >>may not be true for your data set though.) >> >> >>A weakness of this approach is that the keys must pass from the map tasks to >>the reduce tasks, only to get discarded before writing the final output. >>Also, the distribution of input records to reduce tasks is not truly random, >>and therefore the reduce output files may be uneven in size. This could be >>solved by writing NullWritable keys out of the map task instead of the reduce >>task and writing a custom implementation of Partitioner to distribute them >>randomly. >> >> >>To expand on this idea, it could be possible to inspect the FileStatus of >>each input, sum the values of FileStatus.getLen(), and then use that >>information to make a decision about how many reducers to run (and therefore >>approximately set a target output file size). I'm not aware of any built-in >>or external utilities that do this for you though. >> >> >>Hope this helps, >>--Chris >> >> >>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <[email protected]> wrote: >> >>I would like to be able to resize a set of inputs, already in SequenceFile >>format, to be larger. >>> >>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not >>>get what I expected. The outputs were exactly the same as the inputs. >>> >>>I also tried running a job with an IdentityMapper and IdentityReducer. >>>Although that approaches a better solution, it still requires that I know in >>>advance how many reducers I need to get better file sizes. >>> >>>I was looking at the SequenceFile.Writer constructors and noticed that there >>>are block size parameters that can be used. Using a writer constructed with >>>a 512MB block size, there is nothing that splits the output and I simply get >>>a single file the size of my inputs. >>> >>>What is the current standard for combining sequence files to create larger >>>files for map-reduce jobs? I have seen code that tracks what it writes into >>>the file, but that seems like the long version. I am hoping there is a >>>shorter path. >>> >>>Thank you. >>> >>>Anna >>> >>> >> > > >
