Hi Hyukjin,

Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option.

Cheng

On 10/8/15 11:04 PM, Hyukjin Kwon wrote:
Hi all,

While wring some parquet files by Spark, I found it actually only writes the parquet files with writer version1.

This differs encoding types of the file.

Is this intendedly fixed for some reasons?


I changed codes and tested to write this as writer version2 and it looks fine.

In more details,
I found it fixes the writer version in org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala

|def setSchema(schema: StructType, configuration: Configuration): Unit = { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName) configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_1_0.toString) } |
​

I changed this to this in order to keep the given configuration

|def setSchema(schema: StructType, configuration: Configuration): Unit = { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName) configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( ParquetOutputFormat.WRITER_VERSION, configuration.get(ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_1_0.toString) ) } |
​

and set the version to version2
|sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_2_0.toString) |

Reply via email to