Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-11 Thread Raghavendra Pandey
I think AvroWriteSupport class already saves avro schema as part of parquet
meta data. You can think of using parquet-mr
https://github.com/Parquet/parquet-mr directly.

Raghavendra

On Fri, Jan 9, 2015 at 10:32 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi Raghavendra,

 This makes a lot of sense. Thank you.
 The problem is that I'm using Spark SQL right now to generate the parquet
 file.

 What I think I need to do is to use Spark directly and transform all rows
 from SchemaRDD to avro objects and supply it to use saveAsNewAPIHadoopFile
 (from the PairRDD). From there, I can supply the avro schema to parquet via
 AvroParquetOutputFormat.

 It is not difficult just not as simple as I would like because SchemaRDD
 can write to Parquet file using its schema and if I can supply the avro
 schema to parquet, it save me the transformation step for avro objects.

 I'm thinking of overriding the saveAsParquetFile method to allows me to
 persist the avro schema inside parquet. Is this possible at all?

 Best Regards,

 Jerry


 On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 I cam across this
 http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. You can take
 a look.


 On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 I have the similar kind of requirement where I want to push avro data
 into parquet. But it seems you have to do it on your own. There
 is parquet-mr project that uses hadoop to do so. I am trying to write a
 spark job to do similar kind of thing.

 On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 I'm using spark SQL to create parquet files on HDFS. I would like to
 store the avro schema into the parquet meta so that non spark sql
 applications can marshall the data without avro schema using the avro
 parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do
 that. Is there another API that allows me to do this?

 Best Regards,

 Jerry






Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-09 Thread Jerry Lam
Hi Raghavendra,

This makes a lot of sense. Thank you.
The problem is that I'm using Spark SQL right now to generate the parquet
file.

What I think I need to do is to use Spark directly and transform all rows
from SchemaRDD to avro objects and supply it to use saveAsNewAPIHadoopFile
(from the PairRDD). From there, I can supply the avro schema to parquet via
AvroParquetOutputFormat.

It is not difficult just not as simple as I would like because SchemaRDD
can write to Parquet file using its schema and if I can supply the avro
schema to parquet, it save me the transformation step for avro objects.

I'm thinking of overriding the saveAsParquetFile method to allows me to
persist the avro schema inside parquet. Is this possible at all?

Best Regards,

Jerry


On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 I cam across this
 http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. You can take
 a look.


 On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 I have the similar kind of requirement where I want to push avro data
 into parquet. But it seems you have to do it on your own. There
 is parquet-mr project that uses hadoop to do so. I am trying to write a
 spark job to do similar kind of thing.

 On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 I'm using spark SQL to create parquet files on HDFS. I would like to
 store the avro schema into the parquet meta so that non spark sql
 applications can marshall the data without avro schema using the avro
 parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do
 that. Is there another API that allows me to do this?

 Best Regards,

 Jerry





Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I cam across this http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.
You can take a look.

On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 I have the similar kind of requirement where I want to push avro data into
 parquet. But it seems you have to do it on your own. There is parquet-mr
 project that uses hadoop to do so. I am trying to write a spark job to do
 similar kind of thing.

 On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 I'm using spark SQL to create parquet files on HDFS. I would like to
 store the avro schema into the parquet meta so that non spark sql
 applications can marshall the data without avro schema using the avro
 parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do
 that. Is there another API that allows me to do this?

 Best Regards,

 Jerry





Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I have the similar kind of requirement where I want to push avro data into
parquet. But it seems you have to do it on your own. There is parquet-mr
project that uses hadoop to do so. I am trying to write a spark job to do
similar kind of thing.

On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 I'm using spark SQL to create parquet files on HDFS. I would like to store
 the avro schema into the parquet meta so that non spark sql applications
 can marshall the data without avro schema using the avro parquet reader.
 Currently, schemaRDD.saveAsParquetFile does not allow to do that. Is there
 another API that allows me to do this?

 Best Regards,

 Jerry