Compression is irrelevant with yarn. If you want to store files with compression, you should compress the file when they were load to HDFS. The files on HDFS were compressed according to the parameter "io.compression.codecs" which was set in core-site.xml. If you want to specific a novel compression format, you need to set "STORED AS INPUTFORMAT" to the corresponding class which act as the role of compression such as "com.hadoop.mapred.DeprecatedLzoTextInputFormat". 1, you should compress each file in the dir rather than the whole dir. 2, consider the compression ratio, bzip2 > gzip > lzo, however, the decompression speed is just the opposite order. So we need balance. gzip is popular one as far as I know. 3, without need. 4, Yes, and the process is transparent to users.
2013/10/16 xeon <[email protected]> > Hi, > > > I want execute the wordcount in yarn with compression enabled with a dir > with several files, but for that I must compress the input. > > dir1/file1.txt > dir1/file2.txt > dir1/file3.txt > dir1/file4.txt > dir1/file5.txt > > 1 - Should I compress the whole dir or each file in the dir? > > 2 - Should I use gzip or bzip2? > > 3 - Do I need to setup any yarn configuration file? > > 4 - when the job is running, the files are decompressed before running the > mappers and compressed again after reducers executed? > > -- > Thanks, > >
