RE: Parquet Files in Hive - Settings

Ryan Harris Tue, 18 Aug 2015 13:45:27 -0700

most are parquet settings....

from 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
 * # The block size is the size of a row group being buffered in memory
 * # this limits the memory usage when writing
 * # Larger values will improve the IO when reading but consume more memory 
when writing
 * parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024
 *
 * # The page size is for compression. When reading, each page can be 
decompressed independently.
 * # A block is composed of pages. The page is the smallest unit that must be 
read fully to access a single record.
 * # If this value is too small, the compression will deteriorate
 * parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
 *
 * # There is one dictionary page per column per row group when dictionary 
encoding is used.
 * # The dictionary page size works like the page size but for dictionary
 * parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
 *
 * # The compression algorithm used to compress pages
 * parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO. 
Default: UNCOMPRESSED. Supersedes mapred.output.compress*
 *
 * # The write support class to convert the records written to the OutputFormat 
into the events accepted by the record consumer
 * # Usually provided by a specific ParquetOutputFormat subclass
 * parquet.write.support.class= # fully qualified name
 *
 * # To enable/disable dictionary encoding
 * parquet.enable.dictionary=true # false to disable dictionary encoding
 *
 * # To enable/disable summary metadata aggregation at the end of a MR job
 * # The default is true (enabled)
 * parquet.enable.summary-metadata=true # false to disable summary aggregation


public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {
  private static final Log LOG = Log.getLog(ParquetOutputFormat.class);

  public static final String BLOCK_SIZE           = "parquet.block.size";
  public static final String PAGE_SIZE            = "parquet.page.size";
  public static final String COMPRESSION          = "parquet.compression";
  public static final String WRITE_SUPPORT_CLASS  = 
"parquet.write.support.class";
  public static final String DICTIONARY_PAGE_SIZE = 
"parquet.dictionary.page.size";
  public static final String ENABLE_DICTIONARY    = "parquet.enable.dictionary";
  public static final String VALIDATION           = "parquet.validation";
  public static final String WRITER_VERSION       = "parquet.writer.version";
  public static final String ENABLE_JOB_SUMMARY   = 
"parquet.enable.summary-metadata";
  public static final String MEMORY_POOL_RATIO    = "parquet.memory.pool.ratio";
  public static final String MIN_MEMORY_ALLOCATION = 
"parquet.memory.min.chunk.size";

some of the variables (e.g. parquet.enable.summary-metadata) may not currently 
be exposed via hive.  Others (parquet.block.size, parquet.compression, 
parquet.enable.dictionary)  have been exposed by hive-specific JIRAs 
(HIVE-7685, HIVE-7858, HIVE-8823)

I'm also aware of these hive-specific ones:
hive.parquet.timestamp.skip.conversion (HIVE-9482), documented here:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.parquet.timestamp.skip.conversion
parquet.column.index.access (HIVE-6938, HIVE-7800), documented here:
https://cwiki.apache.org/confluence/display/Hive/Parquet

There may be others that I'm not aware of, but they would likely be in one of 
the JIRAs tracked by HIVE-8120

It would certainly be nice if there was a single place to find all of this in 
the hive documentation...many of the individual JIRA notes indicate that it 
"should be documented in Hive's wiki," yet that doesn't appear to have occurred.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties would 
seem like a logical location, but it is definitely incomplete currently.






From: John Omernik [mailto:[email protected]]
Sent: Tuesday, August 18, 2015 1:06 PM
To: [email protected]
Subject: Parquet Files in Hive - Settings

Is there a good writeup on what the settings that can be tweaked in hive as it 
pertains to writing parquet files are? For example, in some obscure pages I've 
found settings like parquet.compression, parquet.dictionary.page.size and 
parquet.enable.dictionary, but they were in reference to stock mapr reduce 
jobs, not hive, and thus, I don't even know what the defaults for these are 
when using hive.  I tried doing hive -e "set"|grep "parquet\." but these 
settings aren't there.

Any documentation on what these are, what hive uses as defaults etc, and how I 
can optimize my parquet writing with hive would be appreciated.



======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.

RE: Parquet Files in Hive - Settings

Reply via email to