Hi All, I'm having a bit of trouble with nested data structures in pyspark with saveAsParquetFile. I'm running master (as of yesterday) with this pull request added: https://github.com/apache/spark/pull/1802.
*# these all work* > sqlCtx.jsonRDD(sc.parallelize(['{"record": null}'])).saveAsParquetFile('/tmp/test0') > sqlCtx.jsonRDD(sc.parallelize(['{"record": []}'])).saveAsParquetFile('/tmp/test1') > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": null}}'])).saveAsParquetFile('/tmp/test2') > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": []}}'])).saveAsParquetFile('/tmp/test3') > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": "foobar"}]* }'])).saveAsParquetFile('/tmp/test4') *# this FAILS* > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": null}]* }'])).saveAsParquetFile('/tmp/test5') Py4JJavaError: An error occurred while calling o706.saveAsParquetFile. : java.lang.RuntimeException: *Unsupported datatype NullType* *# this FAILS* > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": []}]* }'])).saveAsParquetFile('/tmp/test6') Py4JJavaError: An error occurred while calling o719.saveAsParquetFile. : java.lang.RuntimeException: *Unsupported datatype NullType* Based on the documentation and the examples that work, it seems like the failing examples are probably meant to be supported features. I was unable to find an open issue for this. Does anybody know if there is an open issue, or whether an issue should be created? best, -Brad