Raj - I was not able to get this to work either. On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <[email protected]> wrote:
> I haven't tried it but this should also work > > hadoop fs -Ddfs.block.size=<NEW BLOCK SIZE> -cp src dest > > Raj > > ------------------------------ > *From:* Anna Lahoud <[email protected]> > *To:* [email protected]; [email protected] > *Sent:* Tuesday, October 2, 2012 7:17 AM > > *Subject:* Re: File block size use > > Thank you. I will try today. > > On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <[email protected]> wrote: > > ** > Hi Anna > > If you want to increase the block size of existing files. You can use a > Identity Mapper with no reducer. Set the min and max split sizes to your > requirement (512Mb). Use SequenceFileInputFormat and > SequenceFileOutputFormat for your job. > Your job should be done. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Chris Nauroth <[email protected]> > *Date: *Mon, 1 Oct 2012 21:12:58 -0700 > *To: *<[email protected]> > *ReplyTo: * [email protected] > *Subject: *Re: File block size use > > Hello Anna, > > If I understand correctly, you have a set of multiple sequence files, each > much smaller than the desired block size, and you want to concatenate them > into a set of fewer files, each one more closely aligned to your desired > block size. Presumably, the goal is to improve throughput of map reduce > jobs using those files as input by running fewer map tasks, reading a > larger number of input records. > > Whenever I've had this kind of requirement, I've run a custom map reduce > job to implement the file consolidation. In my case, I was typically > working with TextInputFormat (not sequence files). I used IdentityMapper > and a custom reducer that passed through all values but with key set to > NullWritable, because the keys (input file offsets in the case of > TextInputFormat) were not valuable data. For my input data, this was > sufficient to achieve fairly even distribution of data across the reducer > tasks, and I could reasonably predict the input data set size, so I could > reasonably set the number of reducers and get decent results. (This may or > may not be true for your data set though.) > > A weakness of this approach is that the keys must pass from the map tasks > to the reduce tasks, only to get discarded before writing the final output. > Also, the distribution of input records to reduce tasks is not truly > random, and therefore the reduce output files may be uneven in size. This > could be solved by writing NullWritable keys out of the map task instead of > the reduce task and writing a custom implementation of Partitioner to > distribute them randomly. > > To expand on this idea, it could be possible to inspect the FileStatus of > each input, sum the values of FileStatus.getLen(), and then use that > information to make a decision about how many reducers to run (and > therefore approximately set a target output file size). I'm not aware of > any built-in or external utilities that do this for you though. > > Hope this helps, > --Chris > > On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <[email protected]> wrote: > > I would like to be able to resize a set of inputs, already in SequenceFile > format, to be larger. > > I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not > get what I expected. The outputs were exactly the same as the inputs. > > I also tried running a job with an IdentityMapper and IdentityReducer. > Although that approaches a better solution, it still requires that I know > in advance how many reducers I need to get better file sizes. > > I was looking at the SequenceFile.Writer constructors and noticed that > there are block size parameters that can be used. Using a writer > constructed with a 512MB block size, there is nothing that splits the > output and I simply get a single file the size of my inputs. > > What is the current standard for combining sequence files to create larger > files for map-reduce jobs? I have seen code that tracks what it writes into > the file, but that seems like the long version. I am hoping there is a > shorter path. > > Thank you. > > Anna > > > > > >
