Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

Cheng Lian Fri, 09 Oct 2015 11:03:25 -0700

Hi Hyukjin,

Thanks for bringing this up. Could you please make a PR for this one? Wedidn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0,but we should let users choose the writer version, as long asPARQUET_1_0 remains the default option.


Cheng

On 10/8/15 11:04 PM, Hyukjin Kwon wrote:

Hi all,
While wring some parquet files by Spark, I found it actually onlywrites the parquet files with writer version1.
This differs encoding types of the file.

Is this intendedly fixed for some reasons?
I changed codes and tested to write this as writer version2 and itlooks fine.
In more details,
I found it fixes the writer version inorg.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala
|def setSchema(schema: StructType, configuration: Configuration): Unit= { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set(ParquetOutputFormat.WRITER_VERSION,ParquetProperties.WriterVersion.PARQUET_1_0.toString) } |
I changed this to this in order to keep the given configuration
|def setSchema(schema: StructType, configuration: Configuration): Unit= { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set(ParquetOutputFormat.WRITER_VERSION,configuration.get(ParquetOutputFormat.WRITER_VERSION,ParquetProperties.WriterVersion.PARQUET_1_0.toString) ) } |
and set the version to version2
|sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,ParquetProperties.WriterVersion.PARQUET_2_0.toString) |

Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

Reply via email to