Using Spark 1.5.1, Parquet 1.7.0. I'm trying to write Avro/Parquet files. I have this code:
sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS, classOf[AvroWriteSupport].getName) AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$) myDF.write.parquet(outputPath) The problem is that the write support class gets overwritten in org.apache.spark.sql.execution.datasources.parquet.ParquetRelation#prepareJobForWrite: val writeSupportClass = if (dataSchema.map(_.dataType).forall(ParquetTypesConverter.isPrimitiveType)) { classOf[MutableRowWriteSupport] } else { classOf[RowWriteSupport] } ParquetOutputFormat.setWriteSupportClass(job, writeSupportClass) So it doesn't seem to actually write Avro data. When look at the metadata of the Parquet files it writes, it looks like this: extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"foo","type":"string","nullable":true,"metadata":{}},{"name":"bar","type":"long","nullable":true,"metadata":{}}]} I would expect to see something like "extra: avro.schema" instead.