Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon.
Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller <bmill...@eecs.berkeley.edu> wrote: > Hi All, > > I'm having a bit of trouble with nested data structures in pyspark with > saveAsParquetFile. I'm running master (as of yesterday) with this pull > request added: https://github.com/apache/spark/pull/1802. > > *# these all work* > > sqlCtx.jsonRDD(sc.parallelize(['{"record": > null}'])).saveAsParquetFile('/tmp/test0') > > sqlCtx.jsonRDD(sc.parallelize(['{"record": > []}'])).saveAsParquetFile('/tmp/test1') > > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": > null}}'])).saveAsParquetFile('/tmp/test2') > > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": > []}}'])).saveAsParquetFile('/tmp/test3') > > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": "foobar"}]* > }'])).saveAsParquetFile('/tmp/test4') > > *# this FAILS* > > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": null}]* > }'])).saveAsParquetFile('/tmp/test5') > Py4JJavaError: An error occurred while calling o706.saveAsParquetFile. > : java.lang.RuntimeException: *Unsupported datatype NullType* > > *# this FAILS* > > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": []}]* > }'])).saveAsParquetFile('/tmp/test6') > Py4JJavaError: An error occurred while calling o719.saveAsParquetFile. > : java.lang.RuntimeException: *Unsupported datatype NullType* > > Based on the documentation and the examples that work, it seems like the > failing examples are probably meant to be supported features. I was unable > to find an open issue for this. Does anybody know if there is an open > issue, or whether an issue should be created? > > best, > -Brad >