The PR is https://github.com/apache/spark/pull/1840.
On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai <yh...@databricks.com> wrote: > Actually, the issue is if values of a field are always null (or this field > is missing), we cannot figure out the data type. So, we use NullType (it is > an internal data type). Right now, we have a step to convert the data type > from NullType to StringType. This logic in the master has a bug. > > We will have a better story to handle NullType columns ( > https://issues.apache.org/jira/browse/SPARK-2695). But, we still will not > expose NullType to users. > > > On Thu, Aug 7, 2014 at 1:41 PM, Brad Miller <bmill...@eecs.berkeley.edu> > wrote: > >> Thanks Yin! >> >> best, >> -Brad >> >> >> On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai <yh...@databricks.com> wrote: >> >>> Hi Brad, >>> >>> It is a bug. I have filed >>> https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will >>> be fixed soon. >>> >>> Thanks, >>> >>> Yin >>> >>> >>> On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller <bmill...@eecs.berkeley.edu >>> > wrote: >>> >>>> Hi All, >>>> >>>> I'm having a bit of trouble with nested data structures in pyspark with >>>> saveAsParquetFile. I'm running master (as of yesterday) with this pull >>>> request added: https://github.com/apache/spark/pull/1802. >>>> >>>> *# these all work* >>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": >>>> null}'])).saveAsParquetFile('/tmp/test0') >>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": >>>> []}'])).saveAsParquetFile('/tmp/test1') >>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": >>>> null}}'])).saveAsParquetFile('/tmp/test2') >>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children": >>>> []}}'])).saveAsParquetFile('/tmp/test3') >>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": "foobar"}]* >>>> }'])).saveAsParquetFile('/tmp/test4') >>>> >>>> *# this FAILS* >>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": null}]* >>>> }'])).saveAsParquetFile('/tmp/test5') >>>> Py4JJavaError: An error occurred while calling o706.saveAsParquetFile. >>>> : java.lang.RuntimeException: *Unsupported datatype NullType* >>>> >>>> *# this FAILS* >>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": []}]* >>>> }'])).saveAsParquetFile('/tmp/test6') >>>> Py4JJavaError: An error occurred while calling o719.saveAsParquetFile. >>>> : java.lang.RuntimeException: *Unsupported datatype NullType* >>>> >>>> Based on the documentation and the examples that work, it seems like >>>> the failing examples are probably meant to be supported features. I was >>>> unable to find an open issue for this. Does anybody know if there is an >>>> open issue, or whether an issue should be created? >>>> >>>> best, >>>> -Brad >>>> >>> >>> >> >