... > > That will be my approach for now, or disabling compression altogether for > these files. The only problem I have is that compression is so efficient > that any operation in the mapper (so on the uncompressed data) just makes > the mapper throw an OOM exception, no matter how much memory I give it. > > What partly works though, is setting a low mapred.max.split.size. In a > directory containing 34 files, I get 33 mappers (???). When setting > hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs > blocksize default), it doesn't seem to have any effect and I get 20 mappers > only. > > You can still use compression if you use a splittable format, like bzip2 with block compression. Gzip isn't splittable.
If you're running out of memory, you could also increase the heap size for the client VMs. See the "Real-World Cluster Configurations" section of this page: http://hadoop.apache.org/docs/r1.0.3/cluster_setup.html By the way, you could also experiment with turning on intermediate compression; compression of the data sent between the mapper and reducer tasks, compression of the output, etc, as discussed here: https://cwiki.apache.org/Hive/adminmanual-configuration.html > ... >> >