We use this, but not sure how the schema is stored Job job = Job.getInstance(); ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class); AvroParquetOutputFormat.setSchema(job, schema); LazyOutputFormat.setOutputFormatClass(job, new ParquetOutputFormat<T>().getClass()); job.getConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false"); job.getConfiguration().set("parquet.enable.summary-metadata", "false");
//save the file rdd.mapToPair(me -> new Tuple2(null, me)) .saveAsNewAPIHadoopFile( String.format("%s/%s", path, timeStamp.milliseconds()), Void.class, clazz, LazyOutputFormat.class, job.getConfiguration()); On Mon, 21 Mar 2016, 05:55 Manivannan Selvadurai, <smk.manivan...@gmail.com> wrote: > Hi All, > > In my current project there is a requirement to store avro data > (json format) as parquet files. > I was able to use AvroParquetWriter in separately to create the Parquet > Files. The parquet files along with the data also had the 'avro schema' > stored on them as a part of their footer. > > But when tired using Spark streamng I could not find a way to > store the data with the avro schema information. The closest that I got was > to create a Dataframe using the json RDDs and store them as parquet. Here > the parquet files had a spark specific schema in their footer. > > Is this the right approach or do I have a better one. Please guide > me. > > > We are using Spark 1.4.1. > > Thanks In Advance!! >