Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will
require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.
So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.
On 22.09.2014 17:13, Niels Basjes wrote:
Hi,
You can use the GZip inside the AVRO files and still have splittable
AVRO files.
This has the to with the fact that there is a block structure inside
the AVRO and these blocks are gzipped.
I suggest you simply try it.
Niels
On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov
<[email protected] <mailto:[email protected]>> wrote:
Hi guys,
I would like to compress the files on HDFS to save some storage.
As far as i see bzip2 is the only format which is splitable (and
slow).
The actual files are Avro.
So in my driver class i have :
job.setInputFormatClass(AvroKeyInputFormat.class);
I have number of jobs running processing Avro files so i would
like to keep the code change to a minimum.
Is it possible to comrpess these avro files with bzip2 and keep
the code of MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to
find any good resources on the Internet.
Georgi
--
Best regards / Met vriendelijke groeten,
Niels Basjes