Re: Spark SQL DataFrame: Nullable column and filtering

Michael Armbrust Thu, 30 Jul 2015 11:24:26 -0700

We don't yet updated nullability information based on predicates as we
don't actually leverage this information in many places yet.  Why do you
want to update the schema?


On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 <martin.se...@googlemail.com>
wrote:

> Hi all,
>
> 1. *Columns in dataframes can be nullable and not nullable. Having a
> nullable column of Doubles, I can use the following Scala code to filter
> all
> "non-null" rows:*
>
>   val df = ..... // some code that creates a DataFrame
>   df.filter( df("columnname").isNotNull() )
>
> +-+-----+----+
> |x|    a|   y|
> +-+-----+----+
> |1|hello|null|
> |2|  bob|   5|
> +-+-----+----+
>
> root
>  |-- x: integer (nullable = false)
>  |-- a: string (nullable = true)
>  |-- y: integer (nullable = true)
>
> And with the filter expression
> +-+---+-+
> |x|  a|y|
> +-+---+-+
> |2|bob|5|
> +-+---+-+
>
>
> Unfortunetaly and while this is a true for a nullable column (according to
> df.printSchema), it is not true for a column that is not nullable:
>
>
> +-+-----+----+
> |x|    a|   y|
> +-+-----+----+
> |1|hello|null|
> |2|  bob|   5|
> +-+-----+----+
>
> root
>  |-- x: integer (nullable = false)
>  |-- a: string (nullable = true)
>  |-- y: integer (nullable = false)
>
> +-+-----+----+
> |x|    a|   y|
> +-+-----+----+
> |1|hello|null|
> |2|  bob|   5|
> +-+-----+----+
>
> such that the output is not affected by the filter. Is this intended?
>
>
> 2. *What is the cheapest (in sense of performance) to turn a non-nullable
> column into a nullable column?
> A came uo with this:*
>
>   /**
>    * Set, if a column is nullable.
>    * @param df source DataFrame
>    * @param cn is the column name to change
>    * @param nullable is the flag to set, such that the column is either
> nullable or not
>    */
>   def setNullableStateOfColumn( df: DataFrame, cn: String, nullable:
> Boolean) : DataFrame = {
>
>     val schema = df.schema
>     val newSchema = StructType(schema.map {
>       case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t,
> nullable = nullable, m)
>       case y: StructField => y
>     })
>     df.sqlContext.createDataFrame( df.rdd, newSchema)
>   }
>
> Is there a cheaper solution?
>
> 3. *Any comments?*
>
> Cheers and thx in advance,
>
> Martin
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DataFrame-Nullable-column-and-filtering-tp24087.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark SQL DataFrame: Nullable column and filtering

Reply via email to