Hi All,

I'm having a bit of trouble with nested data structures in pyspark with
saveAsParquetFile.  I'm running master (as of yesterday) with this pull
request added: https://github.com/apache/spark/pull/1802.

*# these all work*
> sqlCtx.jsonRDD(sc.parallelize(['{"record":
null}'])).saveAsParquetFile('/tmp/test0')
> sqlCtx.jsonRDD(sc.parallelize(['{"record":
[]}'])).saveAsParquetFile('/tmp/test1')
> sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children":
null}}'])).saveAsParquetFile('/tmp/test2')
> sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children":
[]}}'])).saveAsParquetFile('/tmp/test3')
> sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": "foobar"}]*
}'])).saveAsParquetFile('/tmp/test4')

*# this FAILS*
> sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": null}]*
}'])).saveAsParquetFile('/tmp/test5')
Py4JJavaError: An error occurred while calling o706.saveAsParquetFile.
: java.lang.RuntimeException: *Unsupported datatype NullType*

*# this FAILS*
> sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": []}]*
}'])).saveAsParquetFile('/tmp/test6')
Py4JJavaError: An error occurred while calling o719.saveAsParquetFile.
: java.lang.RuntimeException: *Unsupported datatype NullType*

Based on the documentation and the examples that work, it seems like the
failing examples are probably meant to be supported features.  I was unable
to find an open issue for this.  Does anybody know if there is an open
issue, or whether an issue should be created?

best,
-Brad

Reply via email to