I have a SchemaRDD which I've gotten from a parquetFile.
Did some transforms on it and now want to save it back out as parquet again.
Getting a SchemaRDD proves challenging because some of my fields can be
null/None and SQLContext.inferSchema abjects those.
So, I decided to use the schema
challenging because some of my fields can be
null/None and SQLContext.inferSchema abjects those.
So, I decided to use the schema on the original RDD with
SQLContext.applySchema.
This works, but only if I add a map function to turn my Row objects into a
list. (pyspark)
applied = sq.applySchema
the schema on the original RDD with
SQLContext.applySchema.
This works, but only if I add a map function to turn my Row objects into a
list. (pyspark)
applied = sq.applySchema(transformed_rows.map(lambda r: list(r)),
original_parquet_file.schema())
This seems a bit kludgy. Is there a better way
fields can be
null/None and SQLContext.inferSchema abjects those.
So, I decided to use the schema on the original RDD with
SQLContext.applySchema.
This works, but only if I add a map function to turn my Row objects
into a
list. (pyspark)
applied = sq.applySchema