[jira] [Updated] (SPARK-39168) Consider all values in a python list when inferring schema

Brian Schaefer (Jira) Thu, 12 May 2022 13:07:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brian Schaefer updated SPARK-39168:
-----------------------------------
    Description: 
Schema inference fails on the following case:
{code:python}
>>> data = [{"a": [1, None], "b": [None, 2]}]
>>> spark.createDataFrame(data)
ValueError: Some of types cannot be determined after inferring
{code}
This is because only the first value in the array is used to infer the element 
type for the array: 
[https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260].
 The element type of the "b" array is inferred as {{NullType}} but I think it 
makes sense to infer the element type as {{{}LongType{}}}.

One approach to address the above would be to infer the type from the first 
non-null value in the array. However, consider a case with structs:
{code:python}
>>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled",  True)
>>> data = [{"a": [{"b": 1}, {"c": 2}]}]
>>> spark.createDataFrame(data).schema
StructType([StructField('a', ArrayType(StructType([StructField('b', LongType(), 
True)]), True), True)])
{code}
The element type of the "a" array is inferred as a struct with one field, "b". 
However, it would be convenient to infer the element type as a struct with both 
fields "b" and "c". Omitted fields from each dictionary would become null 
values in each struct:
{code:java}
+----------------------+
|                     a|
+----------------------+
|[{1, null}, {null, 2}]|
+----------------------+
{code}
To support both of these cases, the type of each array element could be 
inferred, and those types could be merged, similar to the approach 
[here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576].

  was:
Schema inference fails on the following case:
{code:python}
>>> data = [{"a": [1, None], "b": [None, 2]}]
>>> spark.createDataFrame(data)
ValueError: Some of types cannot be determined after inferring
{code}
This is because only the first value in the array is used to infer the element 
type for the array: 
[https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260].
 The element type of the "b" array is inferred as {{NullType}} but I think it 
makes sense to infer the element type as {{{}LongType{}}}.

One approach to address the above would be to infer the type from the first 
non-null value in the array. However, consider a case with structs:
{code:python}
>>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled",  True)
>>> data = [{"a": [{"b": 1}, {"c": 2}]}]
>>> spark.createDataFrame(data).schema
StructType([StructField('a', ArrayType(StructType([StructField('b', LongType(), 
True)]), True), True)])
{code}
The element type of the "a" array is inferred as a struct with one field, "b". 
However, it would be convenient to infer the element type as a struct with both 
fields "b" and "c". Omitted fields from each dictionary would become null 
values in each struct:
{code:java}
+----------------------+
|                     a|
+----------------------+
|[{1, null}, {null, 1}]|
+----------------------+
{code}
To support both of these cases, the type of each array element could be 
inferred, and those types could be merged, similar to the approach 
[here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576].


> Consider all values in a python list when inferring schema
> ----------------------------------------------------------
>
>                 Key: SPARK-39168
>                 URL: https://issues.apache.org/jira/browse/SPARK-39168
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Brian Schaefer
>            Priority: Major
>
> Schema inference fails on the following case:
> {code:python}
> >>> data = [{"a": [1, None], "b": [None, 2]}]
> >>> spark.createDataFrame(data)
> ValueError: Some of types cannot be determined after inferring
> {code}
> This is because only the first value in the array is used to infer the 
> element type for the array: 
> [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260].
>  The element type of the "b" array is inferred as {{NullType}} but I think it 
> makes sense to infer the element type as {{{}LongType{}}}.
> One approach to address the above would be to infer the type from the first 
> non-null value in the array. However, consider a case with structs:
> {code:python}
> >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled",  True)
> >>> data = [{"a": [{"b": 1}, {"c": 2}]}]
> >>> spark.createDataFrame(data).schema
> StructType([StructField('a', ArrayType(StructType([StructField('b', 
> LongType(), True)]), True), True)])
> {code}
> The element type of the "a" array is inferred as a struct with one field, 
> "b". However, it would be convenient to infer the element type as a struct 
> with both fields "b" and "c". Omitted fields from each dictionary would 
> become null values in each struct:
> {code:java}
> +----------------------+
> |                     a|
> +----------------------+
> |[{1, null}, {null, 2}]|
> +----------------------+
> {code}
> To support both of these cases, the type of each array element could be 
> inferred, and those types could be merged, similar to the approach 
> [here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39168) Consider all values in a python list when inferring schema

Reply via email to