Your question could be interpreted in another way : should I use Hadoop in order to perform massive compression/decompression using my own (eventually, proprietary) utility?
So yes, Hadoop can be used to parallelize the work. But the real answer will depend on your context, like always. How many files need to be processed? What is the average size? Is your utility parallelizable? How the data will be used after compression/decompression? The number of files and their size is important because Hadoop is designed to deal with a relatively low number of files but relatively big : a few millions of gigabyte-sized files instead of 'milliards' of megabyte-sized files. Many small files could become an issue for the performance. But a huge files is not necessarily better because if your utility is not parallelizable then, regardless of Hadoop, uncompressing a 2GB file require a single process to read the whole file and then the uncompressed version need to be stored somewhere. So the final question is : for what purpose? If it is for massive decompression, keeping the compressed version inside Hadoop seems a sane strategy. So it might be better to rely on a standard compression utility and uncompress only before processing inside Hadoop itself. If it is for compression, well, it might not be that massive because you might not receive that many files at the same time. The common strategy in Hadoop is not to compress a whole file but instead compress the parts (blocks) of the file. This way the size of the compression work is limited/bounded and the work can be parallelized even with a non parallelizable compression utility. The drawback is that the "list of compressed blocks" is not a standard compressed file. And so the interoperability with other parts of your system is not granted without extra work. Bertrand On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann < [email protected]> wrote: > Dear Robert, > > SequenceFiles do have either record, block or no compression. You can > configure, which codec (gzip, bzip2, etc.) is used. Have a look at > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html > > Best regards, > > Jens > > > -- Bertrand Dechoux
