Re: Dataframe schema...

Koert Kuipers Fri, 21 Oct 2016 20:41:04 -0700

This rather innocent looking optimization flag nullable has caused a lot of
bugs... Makes me wonder if we are better off without it


On Oct 21, 2016 8:37 PM, "Muthu Jayakumar" <bablo...@gmail.com> wrote:

> Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0.
>
> Thanks,
> Muthu
>
> On Fri, Oct 21, 2016 at 3:30 PM, Cheng Lian <l...@databricks.com> wrote:
>
>> Yea, confirmed. While analyzing unions, we treat StructTypes with
>> different field nullabilities as incompatible types and throws this error.
>>
>> Opened https://issues.apache.org/jira/browse/SPARK-18058 to track this
>> issue. Thanks for reporting!
>>
>> Cheng
>>
>> On 10/21/16 3:15 PM, Cheng Lian wrote:
>>
>> Hi Muthu,
>>
>> What is the version of Spark are you using? This seems to be a bug in the
>> analysis phase.
>>
>> Cheng
>>
>> On 10/21/16 12:50 PM, Muthu Jayakumar wrote:
>>
>> Sorry for the late response. Here is what I am seeing...
>>
>>
>> Schema from parquet file.
>>
>> d1.printSchema()
>>
>> root
>>  |-- task_id: string (nullable = true)
>>  |-- task_name: string (nullable = true)
>>  |-- some_histogram: struct (nullable = true)
>>  |    |-- values: array (nullable = true)
>>  |    |    |-- element: double (containsNull = true)
>>  |    |-- freq: array (nullable = true)
>>  |    |    |-- element: long (containsNull = true)
>>
>> d2.printSchema() //Data created using dataframe and/or processed before 
>> writing to parquet file.
>>
>> root
>>  |-- task_id: string (nullable = true)
>>  |-- task_name: string (nullable = true)
>>  |-- some_histogram: struct (nullable = true)
>>  |    |-- values: array (nullable = true)
>>  |    |    |-- element: double (containsNull = false)
>>  |    |-- freq: array (nullable = true)
>>  |    |    |-- element: long (containsNull = false)
>>
>> d1.union(d2).printSchema()
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> unresolved operator 'Union;
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.f
>> ailAnalysis(CheckAnalysis.scala:40)
>> at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis
>> (Analyzer.scala:58)
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfu
>> n$checkAnalysis$1.apply(CheckAnalysis.scala:361)
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfu
>> n$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeN
>> ode.scala:126)
>> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.c
>> heckAnalysis(CheckAnalysis.scala:67)
>> at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysi
>> s(Analyzer.scala:58)
>> at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed
>> (QueryExecution.scala:49)
>> at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
>> at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
>> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>>
>> Please advice,
>> Muthu
>>
>> On Thu, Oct 20, 2016 at 1:46 AM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> What is the issue you see when unioning?
>>>
>>> On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar <bablo...@gmail.com>
>>> wrote:
>>>
>>>> Hello Michael,
>>>>
>>>> Thank you for looking into this query. In my case there seem to be an
>>>> issue when I union a parquet file read from disk versus another dataframe
>>>> that I construct in-memory. The only difference I see is the containsNull =
>>>> true. In fact, I do not see any errors with union on the simple schema of
>>>> "col1 thru col4" above. But the problem seem to exist only on that
>>>> "some_histogram" column which contains the mixed containsNull = true/false.
>>>> Let me know if this helps.
>>>>
>>>> Thanks,
>>>> Muthu
>>>>
>>>>
>>>>
>>>> On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Nullable is just a hint to the optimizer that its impossible for there
>>>>> to be a null value in this column, so that it can avoid generating code 
>>>>> for
>>>>> null-checks.  When in doubt, we set nullable=true since it is always safer
>>>>> to check.
>>>>>
>>>>> Why in particular are you trying to change the nullability of the
>>>>> column?
>>>>>
>>>>> On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <bablo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello there,
>>>>>>
>>>>>> I am trying to understand how and when does DataFrame (or Dataset)
>>>>>> sets nullable = true vs false on a schema.
>>>>>>
>>>>>> Here is my observation from a sample code I tried...
>>>>>>
>>>>>>
>>>>>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3,
>>>>>> "c", 2.0d))).toDF("col1", "col2", "col3").withColumn("col4",
>>>>>> lit("bla")).printSchema()
>>>>>> root
>>>>>>  |-- col1: integer (nullable = false)
>>>>>>  |-- col2: string (nullable = true)
>>>>>>  |-- col3: double (nullable = false)
>>>>>>  |-- col4: string (nullable = false)
>>>>>>
>>>>>>
>>>>>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3,
>>>>>> "c", 2.0d))).toDF("col1", "col2", "col3").withColumn("col4",
>>>>>> lit("bla")).write.parquet("/tmp/sample.parquet")
>>>>>>
>>>>>> scala> spark.read.parquet("/tmp/sample.parquet").printSchema()
>>>>>> root
>>>>>>  |-- col1: integer (nullable = true)
>>>>>>  |-- col2: string (nullable = true)
>>>>>>  |-- col3: double (nullable = true)
>>>>>>  |-- col4: string (nullable = true)
>>>>>>
>>>>>>
>>>>>> The place where this seem to get me into trouble is when I try to
>>>>>> union one data-structure from in-memory (notice that in the below schema
>>>>>> the highlighted element is represented as 'false' for in-memory created
>>>>>> schema) and one from file that starts out with a schema like below...
>>>>>>
>>>>>>  |-- some_histogram: struct (nullable = true)
>>>>>>  |    |-- values: array (nullable = true)
>>>>>>  |    |    |-- element: double (containsNull = true)
>>>>>>  |    |-- freq: array (nullable = true)
>>>>>>  |    |    |-- element: long (containsNull = true)
>>>>>>
>>>>>> Is there a way to convert this attribute from true to false without
>>>>>> running any mapping / udf on that column?
>>>>>>
>>>>>> Please advice,
>>>>>> Muthu
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>

Re: Dataframe schema...

Reply via email to