Hi,
I'm using a map-reduce job with HFileOutputFormat followed by bulk loads/merges
to create and populate a table with multiple column familes. I would like to
understand how compression works, and how to specify a non-default compression
in this setup. So:
AFAIK, there are two relevant switches: the per-column-family compression
configuration and hfile.compression. Are there any others?
Can the compression format be deduced from the contents of a HFile, or does the
format of a region store file have to match the family's configuration?
Can a column family's compression format be changed if it already contains some
data? If so, how is this done? Are the family store files converted to the
new format before the table comes back online, or is it a lazy-update, or just
a compaction-time thing?
Is it possible to write updates for multiple families with different
compression formats in the same map-reduce job?
Can HFileOutputFormat:configureIncrementalLoad infer compression format from an
existing table, just as it does for partitioning?
Is there a way to specify a default compression which is not None, so that new
tables and families are automatically compressed (with gzip for example)?
I have seen archived discussions which refer to RECORD vs BLOCK compression,
but I don't see those options in later versions. Have they gone away?
Thanks,
--Adam