i checked it, it seems is a bug. do you create a jira now plesae? ---Original--- From: "Don Drake"<dondr...@gmail.com> Date: 2017/2/7 01:26:59 To: "user"<user@spark.apache.org>; Subject: Re: Spark 2 - Creating datasets from dataframes with extra columns
This seems like a bug to me, the schemas should match. scala> import org.apache.spark.sql.Encoders import org.apache.spark.sql.Encoders scala> val fEncoder = Encoders.product[F] fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, f3[0]: string] scala> fEncoder.schema == ds.schema res2: Boolean = false scala> ds.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true), StructField(c4,StringType,true)) scala> fEncoder.schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true)) I'll open a JIRA. -Don On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <dondr...@gmail.com> wrote: In 1.6, when you created a Dataset from a Dataframe that had extra columns, the columns not in the case class were dropped from the Dataset. For example in 1.6, the column c4 is gone: scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1", "f2", "f3", "c4") df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: string] scala> val ds = df.as[F] ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] scala> ds.show +---+---+---+ | f1| f2| f3| +---+---+---+ |  a|  b|  c| |  d|  e|  f| |  h|  i|  j| This seems to have changed in Spark 2.0 and also 2.1: Spark 2.1.0: scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import spark.implicits._ import spark.implicits._ scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1", "f2", "f3", "c4") df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields] scala> val ds = df.as[F] ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields] scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ |  a|  b|  c|  x| |  d|  e|  f|  y| |  h|  i|  j|  z| +---+---+---+---+ Is there a way to get a Dataset that conforms to the case class in Spark 2.1.0?  Basically, I'm attempting to use the case class to define an output schema, and these extra columns are getting in the way. Thanks. -Don -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ https://twitter.com/dondrake 800-733-2143 -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ https://twitter.com/dondrake 800-733-2143