Re: trouble with saveAsParquetFile

Yin Huai Thu, 07 Aug 2014 13:54:11 -0700

The PR is https://github.com/apache/spark/pull/1840.



On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai <yh...@databricks.com> wrote:

> Actually, the issue is if values of a field are always null (or this field
> is missing), we cannot figure out the data type. So, we use NullType (it is
> an internal data type). Right now, we have a step to convert the data type
> from NullType to StringType. This logic in the master has a bug.
>
> We will have a better story to handle NullType columns (
> https://issues.apache.org/jira/browse/SPARK-2695). But, we still will not
> expose NullType to users.
>
>
> On Thu, Aug 7, 2014 at 1:41 PM, Brad Miller <bmill...@eecs.berkeley.edu>
> wrote:
>
>> Thanks Yin!
>>
>> best,
>> -Brad
>>
>>
>> On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> Hi Brad,
>>>
>>> It is a bug. I have filed
>>> https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will
>>> be fixed soon.
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>>
>>> On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller <bmill...@eecs.berkeley.edu
>>> > wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm having a bit of trouble with nested data structures in pyspark with
>>>> saveAsParquetFile.  I'm running master (as of yesterday) with this pull
>>>> request added: https://github.com/apache/spark/pull/1802.
>>>>
>>>> *# these all work*
>>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record":
>>>> null}'])).saveAsParquetFile('/tmp/test0')
>>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record":
>>>> []}'])).saveAsParquetFile('/tmp/test1')
>>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children":
>>>> null}}'])).saveAsParquetFile('/tmp/test2')
>>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": {"children":
>>>> []}}'])).saveAsParquetFile('/tmp/test3')
>>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": "foobar"}]*
>>>> }'])).saveAsParquetFile('/tmp/test4')
>>>>
>>>> *# this FAILS*
>>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": null}]*
>>>> }'])).saveAsParquetFile('/tmp/test5')
>>>> Py4JJavaError: An error occurred while calling o706.saveAsParquetFile.
>>>> : java.lang.RuntimeException: *Unsupported datatype NullType*
>>>>
>>>> *# this FAILS*
>>>> > sqlCtx.jsonRDD(sc.parallelize(['{"record": *[{"children": []}]*
>>>> }'])).saveAsParquetFile('/tmp/test6')
>>>> Py4JJavaError: An error occurred while calling o719.saveAsParquetFile.
>>>> : java.lang.RuntimeException: *Unsupported datatype NullType*
>>>>
>>>> Based on the documentation and the examples that work, it seems like
>>>> the failing examples are probably meant to be supported features.  I was
>>>> unable to find an open issue for this.  Does anybody know if there is an
>>>> open issue, or whether an issue should be created?
>>>>
>>>> best,
>>>> -Brad
>>>>
>>>
>>>
>>
>

Re: trouble with saveAsParquetFile

Reply via email to