Hi Hyukjin,
Thanks for bringing this up. Could you please make a PR for this one? We
didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0,
but we should let users choose the writer version, as long as
PARQUET_1_0 remains the default option.
Cheng
On 10/8/15 11:04 PM, Hyukjin Kwon wrote:
Hi all,
While wring some parquet files by Spark, I found it actually only
writes the parquet files with writer version1.
This differs encoding types of the file.
Is this intendedly fixed for some reasons?
I changed codes and tested to write this as writer version2 and it
looks fine.
In more details,
I found it fixes the writer version in
org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala
|def setSchema(schema: StructType, configuration: Configuration): Unit
= { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set(
ParquetOutputFormat.WRITER_VERSION,
ParquetProperties.WriterVersion.PARQUET_1_0.toString) } |
I changed this to this in order to keep the given configuration
|def setSchema(schema: StructType, configuration: Configuration): Unit
= { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set(
ParquetOutputFormat.WRITER_VERSION,
configuration.get(ParquetOutputFormat.WRITER_VERSION,
ParquetProperties.WriterVersion.PARQUET_1_0.toString) ) } |
and set the version to version2
|sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,
ParquetProperties.WriterVersion.PARQUET_2_0.toString) |