Hi all, I need to generate Zip (yes really) files and I am looking to do this as efficiently as possible. The zips will hold the output of Hive queries, but the Hive bit is irrelevant - this is a straight MR problem.
Ideally I'd compress text files in MR world and then merge then into a Zip on the way out of the cluster so that a) the reducers are compressing blocks in parallel b) data coming out of Hadoop is compressed so it is bandwidth efficient and c) I can simply merge the compressed data on the way out of HDFS so there is no single bottleneck, normally associated with Zip. I notice the default compression codec is Deflater, but it is writing headers etc on the .deflate file. In order to merge Deflate streams into a Zip you need a few things: a) the length of the uncompressed data b) the length of the compressed data c) the CRC-32 of the uncompressed data ...and the deflated content needs to have been created headerless (e.g. the no wrap option, and with SYNC_FLUSH mode). Has anyone here ever seen anyone who has tackled this problem before please? Or anyone got any tricks for getting Zips out of text data in HDFS efficiently? Thanks, Tim
