[ 
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14584.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 2.2.0

> Improve recognition of non-nullability in Dataset transformations
> -----------------------------------------------------------------
>
>                 Key: SPARK-14584
>                 URL: https://issues.apache.org/jira/browse/SPARK-14584
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Josh Rosen
>             Fix For: 2.2.0
>
>
> There are many cases where we can statically know that a field will never be 
> null. For instance, a field in a case class with a primitive type will never 
> return null. However, there are currently several cases in the Dataset API 
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when 
> constructing serializer expressions in ExpressionEncoders. The following 
> assertion will currently fail if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
>     require(field.dataType == fieldSerializer.dataType, s"Field 
> ${field.name}'s data type is " +
>       s"${field.dataType} in the schema but ${fieldSerializer.dataType} in 
> its serializer")
>     require(field.nullable == fieldSerializer.nullable, s"Field 
> ${field.name}'s nullability is " +
>       s"${field.nullable} in the schema but ${fieldSerializer.nullable} in 
> its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder 
> allows for nullability, but occasionally we see a mismatch in the datatypes 
> due to disagreements over the nullability of nested structs' fields (or 
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a 
> struct's schema we consider its fields' nullability to be independent of the 
> nullability of the struct itself, whereas in the serializer expressions we 
> are considering those field extraction expressions to be nullable if the 
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to 
> leave the serializers unchanged and have ObjectOperator derive its output 
> attributes from an explicitly-passed schema rather than using the 
> serializers' attributes. However, I worry that this might introduce bugs in 
> case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to