Hi, I have been thinking recently about various functions I plan to implement in MapReduce. One common theme is that I have many reducers to increase the parallelism and so the performance. However this means that I potentially end up with many separate files in HDFS, one from each reducer. This isn't necessarily a problem in and of itself. However it's a little awkward having lots of little files containing the contents of one logical file. And also, presumably, it's extra load on the name-server. (Of course I could have one output file by having a single reducer but this seems to defeat the purpose of having a parallel solution.)
So this led me to wondering what would be the best way to concatenate all these HDFS files into one file on HDFS. There doesn't seem to be a FSShell command to do this. I could write something in Java just using the FSDataStream classes to open all the input files and write to the single output file. But this seemed a little primitive and with little scope for parallelism. Also it would seem sensible to have some way of doing this which would avoid the need to hold two copies of the data in HDFS. Or something which merged at the block level and so never needed to hold more than one or two blocks as overhead. I see that there's HDFS-222 but that seems to be targeted at the case where all of the files, bar the last one, consist of an integral number of whole blocks. Which doesn't apply in my case. Is there already something to do this or have I missed something? Regards, Peter Marron Trillium Software UK Limited
