most are parquet settings....
from
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
* # The block size is the size of a row group being buffered in memory
* # this limits the memory usage when writing
* # Larger values will improve the IO when reading but consume more memory
when writing
* parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024
*
* # The page size is for compression. When reading, each page can be
decompressed independently.
* # A block is composed of pages. The page is the smallest unit that must be
read fully to access a single record.
* # If this value is too small, the compression will deteriorate
* parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
*
* # There is one dictionary page per column per row group when dictionary
encoding is used.
* # The dictionary page size works like the page size but for dictionary
* parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
*
* # The compression algorithm used to compress pages
* parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO.
Default: UNCOMPRESSED. Supersedes mapred.output.compress*
*
* # The write support class to convert the records written to the OutputFormat
into the events accepted by the record consumer
* # Usually provided by a specific ParquetOutputFormat subclass
* parquet.write.support.class= # fully qualified name
*
* # To enable/disable dictionary encoding
* parquet.enable.dictionary=true # false to disable dictionary encoding
*
* # To enable/disable summary metadata aggregation at the end of a MR job
* # The default is true (enabled)
* parquet.enable.summary-metadata=true # false to disable summary aggregation
public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {
private static final Log LOG = Log.getLog(ParquetOutputFormat.class);
public static final String BLOCK_SIZE = "parquet.block.size";
public static final String PAGE_SIZE = "parquet.page.size";
public static final String COMPRESSION = "parquet.compression";
public static final String WRITE_SUPPORT_CLASS =
"parquet.write.support.class";
public static final String DICTIONARY_PAGE_SIZE =
"parquet.dictionary.page.size";
public static final String ENABLE_DICTIONARY = "parquet.enable.dictionary";
public static final String VALIDATION = "parquet.validation";
public static final String WRITER_VERSION = "parquet.writer.version";
public static final String ENABLE_JOB_SUMMARY =
"parquet.enable.summary-metadata";
public static final String MEMORY_POOL_RATIO = "parquet.memory.pool.ratio";
public static final String MIN_MEMORY_ALLOCATION =
"parquet.memory.min.chunk.size";
some of the variables (e.g. parquet.enable.summary-metadata) may not currently
be exposed via hive. Others (parquet.block.size, parquet.compression,
parquet.enable.dictionary) have been exposed by hive-specific JIRAs
(HIVE-7685, HIVE-7858, HIVE-8823)
I'm also aware of these hive-specific ones:
hive.parquet.timestamp.skip.conversion (HIVE-9482), documented here:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.parquet.timestamp.skip.conversion
parquet.column.index.access (HIVE-6938, HIVE-7800), documented here:
https://cwiki.apache.org/confluence/display/Hive/Parquet
There may be others that I'm not aware of, but they would likely be in one of
the JIRAs tracked by HIVE-8120
It would certainly be nice if there was a single place to find all of this in
the hive documentation...many of the individual JIRA notes indicate that it
"should be documented in Hive's wiki," yet that doesn't appear to have occurred.
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties would
seem like a logical location, but it is definitely incomplete currently.
From: John Omernik [mailto:[email protected]]
Sent: Tuesday, August 18, 2015 1:06 PM
To: [email protected]
Subject: Parquet Files in Hive - Settings
Is there a good writeup on what the settings that can be tweaked in hive as it
pertains to writing parquet files are? For example, in some obscure pages I've
found settings like parquet.compression, parquet.dictionary.page.size and
parquet.enable.dictionary, but they were in reference to stock mapr reduce
jobs, not hive, and thus, I don't even know what the defaults for these are
when using hive. I tried doing hive -e "set"|grep "parquet\." but these
settings aren't there.
Any documentation on what these are, what hive uses as defaults etc, and how I
can optimize my parquet writing with hive would be appreciated.
======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL
and may contain information that is privileged and exempt from disclosure under
applicable law. If you are neither the intended recipient nor responsible for
delivering the message to the intended recipient, please note that any
dissemination, distribution, copying or the taking of any action in reliance
upon the message is strictly prohibited. If you have received this
communication in error, please notify the sender immediately. Thank you.