Dear Michael, dear all, motivation:
object OtherEntities { case class Record( x:Int, a: String) case class Mapping( x: Int, y: Int ) val records = Seq( Record(1, "hello"), Record(2, "bob")) val mappings = Seq( Mapping(2, 5) ) } Now I want to perform an *left outer join* on records and mappings (with the ON JOIN criterion on columns (recordDF("x") === mappingDF("x") .... shorthand is in *leftOuterJoinWithRemovalOfEqualColumn* val sqlContext = new SQLContext(sc) // used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ val recordDF= sc.parallelize(OtherEntities.records, 4).toDF() val mappingDF = sc.parallelize(OtherEntities.mappings, 4).toDF() val joinedDF = recordDF.leftOuterJoinWithRemovalOfEqualColumn( mappingDF, "x") joinedDF.filter(joinedDF("y").isNotNull).show Currently, the output is +-+-----+----+ |x| a| y| +-+-----+----+ |1|hello|null| |2| bob| 5| +-+-----+----+ instead of +-+---+-+ |x| a|y| +-+---+-+ |2|bob|5| +-+---+-+ The last output can be achieved by the method of changing nullable=false to nullable=true described in my first post. *Thus, I need this schema modification as to make outer joins work.* Cheers and thanks, Martin 2015-07-30 20:23 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > We don't yet updated nullability information based on predicates as we > don't actually leverage this information in many places yet. Why do you > want to update the schema? > > On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 < > martin.se...@googlemail.com> wrote: > >> Hi all, >> >> 1. *Columns in dataframes can be nullable and not nullable. Having a >> nullable column of Doubles, I can use the following Scala code to filter >> all >> "non-null" rows:* >> >> val df = ..... // some code that creates a DataFrame >> df.filter( df("columnname").isNotNull() ) >> >> +-+-----+----+ >> |x| a| y| >> +-+-----+----+ >> |1|hello|null| >> |2| bob| 5| >> +-+-----+----+ >> >> root >> |-- x: integer (nullable = false) >> |-- a: string (nullable = true) >> |-- y: integer (nullable = true) >> >> And with the filter expression >> +-+---+-+ >> |x| a|y| >> +-+---+-+ >> |2|bob|5| >> +-+---+-+ >> >> >> Unfortunetaly and while this is a true for a nullable column (according to >> df.printSchema), it is not true for a column that is not nullable: >> >> >> +-+-----+----+ >> |x| a| y| >> +-+-----+----+ >> |1|hello|null| >> |2| bob| 5| >> +-+-----+----+ >> >> root >> |-- x: integer (nullable = false) >> |-- a: string (nullable = true) >> |-- y: integer (nullable = false) >> >> +-+-----+----+ >> |x| a| y| >> +-+-----+----+ >> |1|hello|null| >> |2| bob| 5| >> +-+-----+----+ >> >> such that the output is not affected by the filter. Is this intended? >> >> >> 2. *What is the cheapest (in sense of performance) to turn a non-nullable >> column into a nullable column? >> A came uo with this:* >> >> /** >> * Set, if a column is nullable. >> * @param df source DataFrame >> * @param cn is the column name to change >> * @param nullable is the flag to set, such that the column is either >> nullable or not >> */ >> def setNullableStateOfColumn( df: DataFrame, cn: String, nullable: >> Boolean) : DataFrame = { >> >> val schema = df.schema >> val newSchema = StructType(schema.map { >> case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t, >> nullable = nullable, m) >> case y: StructField => y >> }) >> df.sqlContext.createDataFrame( df.rdd, newSchema) >> } >> >> Is there a cheaper solution? >> >> 3. *Any comments?* >> >> Cheers and thx in advance, >> >> Martin >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DataFrame-Nullable-column-and-filtering-tp24087.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >