Re: trouble with saveAsParquetFile

Yin Huai Thu, 07 Aug 2014 13:49:38 -0700

Actually, the issue is if values of a field are always null (or this field
is missing), we cannot figure out the data type. So, we use NullType (it is
an internal data type). Right now, we have a step to convert the data type
from NullType to StringType. This logic in the master has a bug.


We will have a better story to handle NullType columns (
https://issues.apache.org/jira/browse/SPARK-2695). But, we still will not
expose NullType to users.


On Thu, Aug 7, 2014 at 1:41 PM, Brad Miller <bmill...@eecs.berkeley.edu>
wrote:

> Thanks Yin!
>
> best,
> -Brad
>
>
> On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi Brad,
>>
>> It is a bug. I have filed
>> https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be
>> fixed soon.
>>
>> Thanks,
>>
>> Yin
>>
>>
>> On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller <bmill...@eecs.berkeley.edu>
>> wrote:
>>
>>> Hi All,
>>>
>>> I'm having a bit of trouble with nested data structures in pyspark with
>>> saveAsParquetFile.  I'm running master (as of yesterday) with this pull
>>> request added: https://github.com/apache/spark/pull/1802.
>>>
>>> *# these all work*
>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record":
>>> null}'])).saveAsParquetFile('/tmp/test0')
>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record":
>>> []}'])).saveAsParquetFile('/tmp/test1')
>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children":
>>> null}}'])).saveAsParquetFile('/tmp/test2')
>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children":
>>> []}}'])).saveAsParquetFile('/tmp/test3')
>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": "foobar"}]*
>>> }'])).saveAsParquetFile('/tmp/test4')
>>>
>>> *# this FAILS*
>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": null}]*
>>> }'])).saveAsParquetFile('/tmp/test5')
>>> Py4JJavaError: An error occurred while calling o706.saveAsParquetFile.
>>> : java.lang.RuntimeException: *Unsupported datatype NullType*
>>>
>>> *# this FAILS*
>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": []}]*
>>> }'])).saveAsParquetFile('/tmp/test6')
>>> Py4JJavaError: An error occurred while calling o719.saveAsParquetFile.
>>> : java.lang.RuntimeException: *Unsupported datatype NullType*
>>>
>>> Based on the documentation and the examples that work, it seems like the
>>> failing examples are probably meant to be supported features.  I was unable
>>> to find an open issue for this.  Does anybody know if there is an open
>>> issue, or whether an issue should be created?
>>>
>>> best,
>>> -Brad
>>>
>>
>>
>

Re: trouble with saveAsParquetFile

Reply via email to