Re: Spark SQL DataFrame: Nullable column and filtering

Martin Senne Thu, 30 Jul 2015 11:54:50 -0700

Dear Michael, dear all,

motivation:


object OtherEntities {

  case class Record( x:Int, a: String)
  case class Mapping( x: Int, y: Int )

  val records = Seq( Record(1, "hello"), Record(2, "bob"))
  val mappings = Seq( Mapping(2, 5) )
}

Now I want to perform an *left outer join* on records and mappings
(with the ON JOIN criterion on columns (recordDF("x") ===
mappingDF("x") .... shorthand is in
*leftOuterJoinWithRemovalOfEqualColumn*

val sqlContext = new SQLContext(sc)
// used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

val recordDF= sc.parallelize(OtherEntities.records, 4).toDF()
val mappingDF = sc.parallelize(OtherEntities.mappings, 4).toDF()

val joinedDF = recordDF.leftOuterJoinWithRemovalOfEqualColumn( mappingDF, "x")

joinedDF.filter(joinedDF("y").isNotNull).show


Currently, the output is

+-+-----+----+

|x|    a|   y|
+-+-----+----+
|1|hello|null|
|2|  bob|   5|
+-+-----+----+

instead of

+-+---+-+

|x|  a|y|
+-+---+-+
|2|bob|5|
+-+---+-+

The last output can be achieved by the method of changing nullable=false to
nullable=true described in my first post.

*Thus, I need this schema modification as to make outer joins work.*

Cheers and thanks,

Martin



2015-07-30 20:23 GMT+02:00 Michael Armbrust <mich...@databricks.com>:

> We don't yet updated nullability information based on predicates as we
> don't actually leverage this information in many places yet.  Why do you
> want to update the schema?
>
> On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 <
> martin.se...@googlemail.com> wrote:
>
>> Hi all,
>>
>> 1. *Columns in dataframes can be nullable and not nullable. Having a
>> nullable column of Doubles, I can use the following Scala code to filter
>> all
>> "non-null" rows:*
>>
>>   val df = ..... // some code that creates a DataFrame
>>   df.filter( df("columnname").isNotNull() )
>>
>> +-+-----+----+
>> |x|    a|   y|
>> +-+-----+----+
>> |1|hello|null|
>> |2|  bob|   5|
>> +-+-----+----+
>>
>> root
>>  |-- x: integer (nullable = false)
>>  |-- a: string (nullable = true)
>>  |-- y: integer (nullable = true)
>>
>> And with the filter expression
>> +-+---+-+
>> |x|  a|y|
>> +-+---+-+
>> |2|bob|5|
>> +-+---+-+
>>
>>
>> Unfortunetaly and while this is a true for a nullable column (according to
>> df.printSchema), it is not true for a column that is not nullable:
>>
>>
>> +-+-----+----+
>> |x|    a|   y|
>> +-+-----+----+
>> |1|hello|null|
>> |2|  bob|   5|
>> +-+-----+----+
>>
>> root
>>  |-- x: integer (nullable = false)
>>  |-- a: string (nullable = true)
>>  |-- y: integer (nullable = false)
>>
>> +-+-----+----+
>> |x|    a|   y|
>> +-+-----+----+
>> |1|hello|null|
>> |2|  bob|   5|
>> +-+-----+----+
>>
>> such that the output is not affected by the filter. Is this intended?
>>
>>
>> 2. *What is the cheapest (in sense of performance) to turn a non-nullable
>> column into a nullable column?
>> A came uo with this:*
>>
>>   /**
>>    * Set, if a column is nullable.
>>    * @param df source DataFrame
>>    * @param cn is the column name to change
>>    * @param nullable is the flag to set, such that the column is either
>> nullable or not
>>    */
>>   def setNullableStateOfColumn( df: DataFrame, cn: String, nullable:
>> Boolean) : DataFrame = {
>>
>>     val schema = df.schema
>>     val newSchema = StructType(schema.map {
>>       case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t,
>> nullable = nullable, m)
>>       case y: StructField => y
>>     })
>>     df.sqlContext.createDataFrame( df.rdd, newSchema)
>>   }
>>
>> Is there a cheaper solution?
>>
>> 3. *Any comments?*
>>
>> Cheers and thx in advance,
>>
>> Martin
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DataFrame-Nullable-column-and-filtering-tp24087.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Spark SQL DataFrame: Nullable column and filtering

Reply via email to