Re: Using Hadoop for codec functionality

Bertrand Dechoux Sun, 31 Mar 2013 02:38:34 -0700

Your question could be interpreted in another way : should I use Hadoop in
order to perform massive compression/decompression using my own
(eventually, proprietary) utility?

So yes, Hadoop can be used to parallelize the work. But the real answer
will depend on your context, like always.
How many files need to be processed? What is the average size? Is your
utility parallelizable? How the data will be used after
compression/decompression?

The number of files and their size is important because Hadoop is designed
to deal with a relatively low number of files but relatively big : a few
millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
files. Many small files could become an issue for the performance. But a
huge files is not necessarily better because if your utility is not
parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
a single process to read the whole file and then the uncompressed version
need to be stored somewhere.

So the final question is : for what purpose? If it is for massive
decompression, keeping the compressed version inside Hadoop seems a sane
strategy. So it might be better to rely on a standard compression utility
and uncompress only before processing inside Hadoop itself. If it is for
compression, well, it might not be that massive because you might not
receive that many files at the same time.

The common strategy in Hadoop is not to compress a whole file but instead
compress the parts (blocks) of the file. This way the size of the
compression work is limited/bounded and the work can be parallelized even
with a non parallelizable compression utility. The drawback is that the
"list of compressed blocks" is not a standard compressed file. And so the
interoperability with other parts of your system is not granted without
extra work.

Bertrand

On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
[email protected]> wrote:

> Dear Robert,
>
> SequenceFiles do have either record, block or no compression. You can
> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>
> Best regards,
>
> Jens
>
>
>

-- 
Bertrand Dechoux

Re: Using Hadoop for codec functionality

Reply via email to