Nullable is just a hint to the optimizer that its impossible for there to be a null value in this column, so that it can avoid generating code for null-checks. When in doubt, we set nullable=true since it is always safer to check.
Why in particular are you trying to change the nullability of the column? On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <bablo...@gmail.com> wrote: > Hello there, > > I am trying to understand how and when does DataFrame (or Dataset) sets > nullable = true vs false on a schema. > > Here is my observation from a sample code I tried... > > > scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", > 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", > lit("bla")).printSchema() > root > |-- col1: integer (nullable = false) > |-- col2: string (nullable = true) > |-- col3: double (nullable = false) > |-- col4: string (nullable = false) > > > scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", > 2.0d))).toDF("col1", "col2", "col3").withColumn("col4", > lit("bla")).write.parquet("/tmp/sample.parquet") > > scala> spark.read.parquet("/tmp/sample.parquet").printSchema() > root > |-- col1: integer (nullable = true) > |-- col2: string (nullable = true) > |-- col3: double (nullable = true) > |-- col4: string (nullable = true) > > > The place where this seem to get me into trouble is when I try to union > one data-structure from in-memory (notice that in the below schema the > highlighted element is represented as 'false' for in-memory created schema) > and one from file that starts out with a schema like below... > > |-- some_histogram: struct (nullable = true) > | |-- values: array (nullable = true) > | | |-- element: double (containsNull = true) > | |-- freq: array (nullable = true) > | | |-- element: long (containsNull = true) > > Is there a way to convert this attribute from true to false without > running any mapping / udf on that column? > > Please advice, > Muthu >