schema for schema

2014-09-18 Thread Eric Friedman
I have a SchemaRDD which I've gotten from a parquetFile. Did some transforms on it and now want to save it back out as parquet again. Getting a SchemaRDD proves challenging because some of my fields can be null/None and SQLContext.inferSchema abjects those. So, I decided to use the schema

Re: schema for schema

2014-09-18 Thread Michael Armbrust
challenging because some of my fields can be null/None and SQLContext.inferSchema abjects those. So, I decided to use the schema on the original RDD with SQLContext.applySchema. This works, but only if I add a map function to turn my Row objects into a list. (pyspark) applied = sq.applySchema

Re: schema for schema

2014-09-18 Thread Davies Liu
the schema on the original RDD with SQLContext.applySchema. This works, but only if I add a map function to turn my Row objects into a list. (pyspark) applied = sq.applySchema(transformed_rows.map(lambda r: list(r)), original_parquet_file.schema()) This seems a bit kludgy. Is there a better way

Re: schema for schema

2014-09-18 Thread Eric Friedman
fields can be null/None and SQLContext.inferSchema abjects those. So, I decided to use the schema on the original RDD with SQLContext.applySchema. This works, but only if I add a map function to turn my Row objects into a list. (pyspark) applied = sq.applySchema