Re: Best way to store Avro Objects as Parquet using SPARK

Manivannan Selvadurai Tue, 22 Mar 2016 04:20:08 -0700

I should have phrased it differently,  Avro schema has additional
properties like required etc.. Right now the json data that I have gets
stored as optional fields in the parquet file. Is there a way to model the
parquet file schema, close to avro schema. I tried using the
sqc.read.schema(avroScehma).jsonRDD(jsonRDD).toDF()  but it has some issues
with longType data. I use the below code to convert the avro schema to
spark specific schema.


          def getSparkSchemaForAvro(sqc: SQLContext, avroSchema: Schema):
StructType = {
    val dummyFIle = File.createTempFile("avroSchema_dummy", "avro")
    val datumWriter = new GenericDatumWriter[wuser]()
    datumWriter.setSchema(avroSchema)
    val writer = new
DataFileWriter(datumWriter).create(wuser.getClassSchema, dummyFIle)
    writer.flush()
    writer.close()
    val df =
sqc.read.format("com.databricks.spark.avro").load(dummyFIle.getAbsolutePath)
    val sparkSchema = df.schema
    sparkSchema
  }


           So the requirement is, how to validate the incoming data with
the avro schema, and handle the bad records as well apart from storing the
data in parquet format with schema matching the Avro schema that I have.


The approach I have taken is

Try converting the json to the avro object and return a tuple having string
and boolean (json, valid). and filter out valid records and write the json
data directly as parquet files. These parquet files have fields with type

message root {
  optional group FIELD_FOO {
    optional binary string (UTF8);
  }
   .
   .
   .
}

similarly filter out the invalid records as corrupt data.

     This causes two scans on the rdds
1) filtering valid data
2) filtering invalid data.


 If there is a better approach please guide.



On Mon, Mar 21, 2016 at 11:07 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> But when tired using Spark streamng I could not find a way to store the
>> data with the avro schema information. The closest that I got was to create
>> a Dataframe using the json RDDs and store them as parquet. Here the parquet
>> files had a spark specific schema in their footer.
>>
>
> Does this cause a problem?  This is just extra information that we use to
> store metadata that parquet doesn't directly support, but I would still
> expect other systems to be able to read it.
>

Re: Best way to store Avro Objects as Parquet using SPARK

Reply via email to