Thanks for both your responses. I was indeed talking about developing a codec utility as the hadoop application itself.
In particular, thanks to Bertrand for the lengthy response. I'm actually learning Hadoop at the moment, so I've been trying to find a suitable (very modestly sized) application for a student project (1-2 weeks max). I had previously written a codec utility in Perl that uses a combination of dictionary (LZW) and arithmetic coding techniques. Compression rates aren't that bad, but it's very slow. In any case, I just thought that it might be interesting to Hadoop-ify the program since compression/decompression is compute intensive and could probably benefit from parallelization. I'm thinking now that it might not be such a good fit after all. Also, if anyone reading this has any novel ideas for demonstrating Hadoops capabilities inside of a short developmental window, I'd love to hear about it. At the moment, I'm leaning towards a distributed grep, most likely with some kind of agrep-like functionality. Not really a searingly inventive idea, but if anyone can suggest some way I could make it more exciting, I'd love to hear about that too. -Rob On 31 March 2013 10:38, Bertrand Dechoux <[email protected]> wrote: > Your question could be interpreted in another way : should I use Hadoop in > order to perform massive compression/decompression using my own > (eventually, proprietary) utility? > > So yes, Hadoop can be used to parallelize the work. But the real answer > will depend on your context, like always. > How many files need to be processed? What is the average size? Is your > utility parallelizable? How the data will be used after > compression/decompression? > > The number of files and their size is important because Hadoop is designed > to deal with a relatively low number of files but relatively big : a few > millions of gigabyte-sized files instead of 'milliards' of megabyte-sized > files. Many small files could become an issue for the performance. But a > huge files is not necessarily better because if your utility is not > parallelizable then, regardless of Hadoop, uncompressing a 2GB file require > a single process to read the whole file and then the uncompressed version > need to be stored somewhere. > > So the final question is : for what purpose? If it is for massive > decompression, keeping the compressed version inside Hadoop seems a sane > strategy. So it might be better to rely on a standard compression utility > and uncompress only before processing inside Hadoop itself. If it is for > compression, well, it might not be that massive because you might not > receive that many files at the same time. > > The common strategy in Hadoop is not to compress a whole file but instead > compress the parts (blocks) of the file. This way the size of the > compression work is limited/bounded and the work can be parallelized even > with a non parallelizable compression utility. The drawback is that the > "list of compressed blocks" is not a standard compressed file. And so the > interoperability with other parts of your system is not granted without > extra work. > > Bertrand > > > On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann < > [email protected]> wrote: > >> Dear Robert, >> >> SequenceFiles do have either record, block or no compression. You can >> configure, which codec (gzip, bzip2, etc.) is used. Have a look at >> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html >> >> Best regards, >> >> Jens >> >> >> > > > -- > Bertrand Dechoux >
