i checked it, it seems is a bug. do you create a jira  now plesae?

---Original---
From: "Don Drake"<dondr...@gmail.com>
Date: 2017/2/7 01:26:59
To: "user"<user@spark.apache.org>;
Subject: Re: Spark 2 - Creating datasets from dataframes with extra columns


This seems like a bug to me, the schemas should match.

scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders


scala> val fEncoder = Encoders.product[F]
fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, 
f3[0]: string]



scala> fEncoder.schema == ds.schema
res2: Boolean = false



scala> ds.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true), StructField(c4,StringType,true))


scala> fEncoder.schema
res4: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true))



I'll open a JIRA.


-Don


On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <dondr...@gmail.com> wrote:
In 1.6, when you created a Dataset from a Dataframe that had extra columns, the 
columns not in the case class were dropped from the Dataset.

For example in 1.6, the column c4 is gone:


 
scala> case class F(f1: String, f2: String, f3:String)

defined class F




scala> import sqlContext.implicits._

import sqlContext.implicits._




scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")

df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
string]




scala> val ds = df.as[F]

ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]




scala> ds.show

+---+---+---+

| f1| f2| f3|

+---+---+---+

| &#xA0;a| &#xA0;b| &#xA0;c|

| &#xA0;d| &#xA0;e| &#xA0;f|

| &#xA0;h| &#xA0;i| &#xA0;j|




This seems to have changed in Spark 2.0 and also 2.1:


Spark 2.1.0:


scala> case class F(f1: String, f2: String, f3:String)
defined class F


scala> import spark.implicits._
import spark.implicits._


scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields]


scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields]


scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
| &#xA0;a| &#xA0;b| &#xA0;c| &#xA0;x|
| &#xA0;d| &#xA0;e| &#xA0;f| &#xA0;y|
| &#xA0;h| &#xA0;i| &#xA0;j| &#xA0;z|
+---+---+---+---+



Is there a way to get a Dataset that conforms to the case class in Spark 
2.1.0?&#xA0; Basically, I'm attempting to use the case class to define an 
output schema, and these extra columns are getting in the way.


Thanks.


-Don


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake
800-733-2143




 

 




-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake
800-733-2143

Reply via email to