Actually, the issue is if values of a field are always null (or this field is missing), we cannot figure out the data type. So, we use NullType (it is an internal data type). Right now, we have a step to convert the data type from NullType to StringType. This logic in the master has a bug.
We will have a better story to handle NullType columns ( https://issues.apache.org/jira/browse/SPARK-2695). But, we still will not expose NullType to users. On Thu, Aug 7, 2014 at 1:41 PM, Brad Miller <bmill...@eecs.berkeley.edu> wrote: > Thanks Yin! > > best, > -Brad > > > On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai <yh...@databricks.com> wrote: > >> Hi Brad, >> >> It is a bug. I have filed >> https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be >> fixed soon. >> >> Thanks, >> >> Yin >> >> >> On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller <bmill...@eecs.berkeley.edu> >> wrote: >> >>> Hi All, >>> >>> I'm having a bit of trouble with nested data structures in pyspark with >>> saveAsParquetFile. I'm running master (as of yesterday) with this pull >>> request added: https://github.com/apache/spark/pull/1802. >>> >>> *# these all work* >>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": >>> null}'])).saveAsParquetFile('/tmp/test0') >>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": >>> []}'])).saveAsParquetFile('/tmp/test1') >>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": >>> null}}'])).saveAsParquetFile('/tmp/test2') >>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": >>> []}}'])).saveAsParquetFile('/tmp/test3') >>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": "foobar"}]* >>> }'])).saveAsParquetFile('/tmp/test4') >>> >>> *# this FAILS* >>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": null}]* >>> }'])).saveAsParquetFile('/tmp/test5') >>> Py4JJavaError: An error occurred while calling o706.saveAsParquetFile. >>> : java.lang.RuntimeException: *Unsupported datatype NullType* >>> >>> *# this FAILS* >>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": []}]* >>> }'])).saveAsParquetFile('/tmp/test6') >>> Py4JJavaError: An error occurred while calling o719.saveAsParquetFile. >>> : java.lang.RuntimeException: *Unsupported datatype NullType* >>> >>> Based on the documentation and the examples that work, it seems like the >>> failing examples are probably meant to be supported features. I was unable >>> to find an open issue for this. Does anybody know if there is an open >>> issue, or whether an issue should be created? >>> >>> best, >>> -Brad >>> >> >> >